Re: [PATCH V2 1/1 (was 0/1 by accident)] tools/dtrace: initial implementation of DTrace

2019-07-10 Thread Brendan Gregg
On Wed, Jul 10, 2019 at 2:36 PM Kris Van Hees  wrote:
>
> On Wed, Jul 10, 2019 at 11:19:43PM +0200, Daniel Borkmann wrote:
> > On 07/10/2019 10:30 PM, Jonathan Corbet wrote:
> > > On Wed, 10 Jul 2019 21:32:25 +0200
> > > Daniel Borkmann  wrote:
> > >
> > >> Looks like you missed Brendan Gregg's prior feedback from v1 [0]. I 
> > >> haven't
> > >> seen a strong compelling argument for why this needs to reside in the 
> > >> kernel
> > >> tree given we also have all the other tracing tools and many of which 
> > >> also
> > >> rely on BPF such as bcc, bpftrace, ply, systemtap, sysdig, lttng to just 
> > >> name
> > >> a few.
> > >
> > > So I'm just watching from the sidelines here, but I do feel the need to
> > > point out that Kris appears to be trying to follow the previous feedback
> > > he got from Alexei, where creating tools/dtrace is exactly what he was
> > > told to do:
> > >
> > >   
> > > https://lwn.net/ml/netdev/20190521175617.ipry6ue7o24a2...@ast-mbp.dhcp.thefacebook.com/
> > >
> > > Now he's being told the exact opposite.  Not the best experience for
> > > somebody who is trying to make the kernel better.
> >
> > Ugh, agree, sorry for the misleading direction. Alexei is currently offgrid
> > this week, he might comment later.
> >
> > It has nothing to do with making the _kernel_ better, it's a /user space/ 
> > front
> > end for the existing kernel infrastructure like many of the other tracers 
> > out
> > there. Don't get me wrong, adding the missing /kernel parts/ for it is a 
> > totally
> > different subject [and _that_ is what is making the kernel better, not the 
> > former].
>
> I disagree.  Yes, the current patch obviously isn't making the kernel better
> because it doesn't touch the kernel.  But DTrace as a whole is not just a
> /front end/ to the existing kernel infrastructure, and I did make that point
> at LPC 2018 and in my emails.  Some of its more advanced features will lead
> to contributions to the kernel that (by virtue of being developed as part of
> this DTrace re-implementation) will more often than not be able to benefit
> other tracers as well.  I do think that aspect qualifies as working towards
> making the kenrel better.
>
> > Hypothetical question: does it make the _kernel_ better if we suddenly add 
> > a huge
> > and complex project like tools/mysql/ to the kernel tree? Nope.
> >
> > > There are still people interested in DTrace out there.  How would you
> > > recommend that Kris proceed at this point?
> >
> > My recommendation to proceed is to maintain the dtrace user space tooling in
> > its own separate project like the vast majority of all the other tracing 
> > projects
> > (see also the other advantages that Steven pointed out from his 
> > experience), and
> > extend the kernel bits whenever needed.
>
> I wish that would have been the initial recommendation because it certainly
> would have avoided me going down a path that was going to lead to rejection.
>
> Either way, I do hope that as work progresses and contributions to the kernel
> code are submitted in support of advancing tracing on Linux, those patches
> will receive a fair review and consideration.  I can appreciate that some
> people do not like DTrace or feel that it is not necessary, but personal
> opinions about tools should not be a deciding factor in whether a contribution
> has merit or not.

Hey Kris -- so you're referring to me, and I've used DTrace more than
anyone over the past 15 years, and I don't think anyone has used all
the different Linux tracers more than I have. I think my opinion has a
lot of value.


Brendan


Re: [PATCH V2 1/1 (was 0/1 by accident)] tools/dtrace: initial implementation of DTrace

2019-07-10 Thread Brendan Gregg
On Wed, Jul 10, 2019 at 1:30 PM Jonathan Corbet  wrote:
>
> On Wed, 10 Jul 2019 21:32:25 +0200
> Daniel Borkmann  wrote:
>
> > Looks like you missed Brendan Gregg's prior feedback from v1 [0]. I haven't
> > seen a strong compelling argument for why this needs to reside in the kernel
> > tree given we also have all the other tracing tools and many of which also
> > rely on BPF such as bcc, bpftrace, ply, systemtap, sysdig, lttng to just 
> > name
> > a few.
>
> So I'm just watching from the sidelines here, but I do feel the need to
> point out that Kris appears to be trying to follow the previous feedback
> he got from Alexei, where creating tools/dtrace is exactly what he was
> told to do:
>
>   
> https://lwn.net/ml/netdev/20190521175617.ipry6ue7o24a2...@ast-mbp.dhcp.thefacebook.com/

>From what I saw, the discussion was about kernel and user bits, where
the user bit was a "little tool is currently hardcoded to process a
single test case". I missed it first time around, but was going to
make the case that such tests belong in tools/testing/selftests/bpf
rather than tools/dtrace, since we have other similar test cases to
ensure bits of BPF work.

This patchset pivoted from a single test case to the entire DTrace
front end. If this was Kris's intent all along, it wasn't clear to me
(and maybe Alexei either) until now.

>
> Now he's being told the exact opposite.  Not the best experience for
> somebody who is trying to make the kernel better.
>
> There are still people interested in DTrace out there.

Yes, they can:

apt-get install bpftrace (or snap, yum, whatever)

and start solving production problems today, like we are.

> How would you
> recommend that Kris proceed at this point?

You may not be asking me, but I don't think it's best for Linux to
split our tracing expertise among 13 different tracers (SystemTap,
LTTng, ftrace, perf, dtrace4linux, OEL DTrace, ktap, sysdig, Intel
PIN, bcc, shark, ply, and bpftrace). bpftrace is already far ahead and
in use in production, and Kris is just starting building a new one. I
actually think we need to consolidate our expertise on fewer tracers,
which includes asking developers to jump the fence and work on other
projects (like I did myself). But I also recognize's Kris's need to
support legacy users, so I'll stop short of saying that it shouldn't
exist at all.

Brendan


Re: [PATCH 1/1] tools/dtrace: initial implementation of DTrace

2019-07-04 Thread Brendan Gregg
On Wed, Jul 3, 2019 at 8:17 PM Kris Van Hees  wrote:
>
> This initial implementation of a tiny subset of DTrace functionality
> provides the following options:
>
> dtrace [-lvV] [-b bufsz] -s script
> -b  set trace buffer size
> -l  list probes (only works with '-s script' for now)
> -s  enable or list probes for the specified BPF program
> -V  report DTrace API version
>
> The patch comprises quite a bit of code due to DTrace requiring a few
> crucial components, even in its most basic form.

This patchset has moved from adding tests (which actually belong in
selftests, not tools/dtrace), to the start of adding a giant tracer to
the kernel code base.

First, in some ways you're doing this for me -- I've been the number
one user of DTrace for 15 years -- and thanks, but no thanks. I don't
need this anymore. I needed this 6 years ago, and I would have helped
you build it, but in the meantime Linux has built something better,
built from the ground up for BPF: bpftrace.

Second, you argued that DTrace was needed because of speculative
tracing (a feature I never used), as you had customers who wanted it.
Those customers aren't going to be happy with this initial tiny
implementation of DTrace -- this is really the start of adding a large
and complex tracer to the kernel.

We've all been working under the assumption that these user-space
tracers did not belong in the kernel, and so far that's been working
fine for us. Is it now open season for tracers in the kernel? Let's
have:

tools/bpftrace
tools/ply
tools/systemtap
tools/LTTng
tools/sysdig
tools/ktap
etc

Yes, that's ridiculous. If there's only going to be one, let's have
the best one, bpftrace. We'll offer a version that is GPL, C, and has
no dependencies.

But it would be news to us all that we're allowed to have even one.

There are more things that don't make sense about this, but I'll stop
here for now.




Brendan


Re: [PATCH v3 net-next 2/5] net: tracepoint: replace tcp_set_state tracepoint with inet_sock_set_state tracepoint

2018-01-02 Thread Brendan Gregg
On Sat, Dec 30, 2017 at 7:06 PM, Yafang Shao  wrote:
> On Sun, Dec 31, 2017 at 6:33 AM, Brendan Gregg
>  wrote:
>> On Tue, Dec 19, 2017 at 7:12 PM, Yafang Shao  wrote:
>>> As sk_state is a common field for struct sock, so the state
>>> transition tracepoint should not be a TCP specific feature.
>>> Currently it traces all AF_INET state transition, so I rename this
>>> tracepoint to inet_sock_set_state tracepoint with some minor changes and 
>>> move it
>>> into trace/events/sock.h.
>>
>> The tcp:tcp_set_state probe is tcp_set_state(), so it's only going to
>> fire for TCP sessions. It's not broken, and we could add a
>> sctp:sctp_set_state as well. Replacing tcp:tcp_set_state with
>> inet_sk_set_state is feeling like we might be baking too much
>> implementation detail into the tracepoint API.
>>
>> If we must have inet_sk_set_state, then must we also delete 
>> tcp:tcp_set_state?
>>
>
> Hi Brendan,
>
> The reason we have to make this change could be got from this mail
> thread, https://patchwork.kernel.org/patch/10099243/ .
>
> The original tcp:tcp_set_state probe doesn't traced all TCP state transitions.
> There're some state transitions in inet_connection_sock.c and
> inet_hashtables.c are missed.
> So we have to place this probe into these two files to fix the issue.
> But as inet_connection_sock.c and inet_hashtables.c are common files
> for all IPv4 protocols, not only for TCP, so it is not proper to place
> a tcp_ function in these two files.
> That's why we decide to rename tcp:tcp_set_state probe to
> sock:inet_sock_set_state.

It kinda feels like we are fixing one exposing-implementation problem
(the missing state changes, which I'm happy to see fixed), by exposing
another (there's no tcp:tcp_set_state because we don't want to put tcp
functions in inet*.c files). Anyway...

If I'm to use sock:inet_sock_set_state for TCP tracing, I'd like
sk->sk_protocol exposed as a tracepoint argument so I can match on
IPPROTO_TCP. Otherwise I'll have to keep digging it out of (void
*)skaddr. (And if we're adding arguments, maybe consider sk_family as
well, to make it easier to see which address arguments to use).

Brendan


Re: [PATCH v3 net-next 2/5] net: tracepoint: replace tcp_set_state tracepoint with inet_sock_set_state tracepoint

2017-12-30 Thread Brendan Gregg
On Tue, Dec 19, 2017 at 7:12 PM, Yafang Shao  wrote:
> As sk_state is a common field for struct sock, so the state
> transition tracepoint should not be a TCP specific feature.
> Currently it traces all AF_INET state transition, so I rename this
> tracepoint to inet_sock_set_state tracepoint with some minor changes and move 
> it
> into trace/events/sock.h.

The tcp:tcp_set_state probe is tcp_set_state(), so it's only going to
fire for TCP sessions. It's not broken, and we could add a
sctp:sctp_set_state as well. Replacing tcp:tcp_set_state with
inet_sk_set_state is feeling like we might be baking too much
implementation detail into the tracepoint API.

If we must have inet_sk_set_state, then must we also delete tcp:tcp_set_state?

Brendan


> We dont need to create a file named trace/events/inet_sock.h for this one 
> single
> tracepoint.
>
> Two helpers are introduced to trace sk_state transition
> - void inet_sk_state_store(struct sock *sk, int newstate);
> - void inet_sk_set_state(struct sock *sk, int state);
> As trace header should not be included in other header files,
> so they are defined in sock.c.
>
> The protocol such as SCTP maybe compiled as a ko, hence export
> inet_sk_set_state().
>[...]


[tip:perf/core] perf sched timehist: Add --next option

2017-03-15 Thread tip-bot for Brendan Gregg
Commit-ID:  292c4a8f985b35b3738d5900fe256c4fed4cd3f5
Gitweb: http://git.kernel.org/tip/292c4a8f985b35b3738d5900fe256c4fed4cd3f5
Author: Brendan Gregg 
AuthorDate: Tue, 14 Mar 2017 01:56:29 +
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Tue, 14 Mar 2017 15:17:38 -0300

perf sched timehist: Add --next option

The --next option shows the next task for each context switch, providing
more context for the sequence of scheduler events.

  $ perf sched timehist --next | head
  Samples do not have callchains.
   time  cpu task name  waittime schdelay run time
 [tid/pid] (msec) (msec) (msec)
  -- --- -- - -- -
  374.793792 [0]  0.000  0.000 0.000 next: rngd[1524]
  374.793801 [0] rngd[1524] 0.000  0.000 0.009 next: swapper/0[0]
  374.794048 [7]  0.000  0.000 0.000 next: yes[30884]
  374.794066 [7] yes[30884] 0.000  0.000 0.018 next: swapper/7[0]
  374.794126 [2]  0.000  0.000 0.000 next: rngd[1524]
  374.794140 [2] rngd[1524] 0.325  0.006 0.013 next: swapper/2[0]
  374.794281 [3]  0.000  0.000 0.000 next: perf[31070]

Signed-off-by: Brendan Gregg 
Tested-by: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Namhyung Kim 
Cc: Peter Zijlstra 
Link: 
http://lkml.kernel.org/r/1489456589-32555-1-git-send-email-bgr...@netflix.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/Documentation/perf-sched.txt |  4 
 tools/perf/builtin-sched.c  | 25 -
 2 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/tools/perf/Documentation/perf-sched.txt 
b/tools/perf/Documentation/perf-sched.txt
index d33dedd..a092a24 100644
--- a/tools/perf/Documentation/perf-sched.txt
+++ b/tools/perf/Documentation/perf-sched.txt
@@ -132,6 +132,10 @@ OPTIONS for 'perf sched timehist'
 --migrations::
Show migration events.
 
+-n::
+--next::
+   Show next task.
+
 -I::
 --idle-hist::
Show idle-related events only.
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 16170e9..b92c4d9 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -221,6 +221,7 @@ struct perf_sched {
unsigned intmax_stack;
boolshow_cpu_visual;
boolshow_wakeups;
+   boolshow_next;
boolshow_migrations;
boolshow_state;
u64 skipped_samples;
@@ -1897,14 +1898,18 @@ static char task_state_char(struct thread *thread, int 
state)
 }
 
 static void timehist_print_sample(struct perf_sched *sched,
+ struct perf_evsel *evsel,
  struct perf_sample *sample,
  struct addr_location *al,
  struct thread *thread,
  u64 t, int state)
 {
struct thread_runtime *tr = thread__priv(thread);
+   const char *next_comm = perf_evsel__strval(evsel, sample, "next_comm");
+   const u32 next_pid = perf_evsel__intval(evsel, sample, "next_pid");
u32 max_cpus = sched->max_cpu + 1;
char tstr[64];
+   char nstr[30];
u64 wait_time;
 
timestamp__scnprintf_usec(t, tstr, sizeof(tstr));
@@ -1937,7 +1942,12 @@ static void timehist_print_sample(struct perf_sched 
*sched,
if (sched->show_state)
printf(" %5c ", task_state_char(thread, state));
 
-   if (sched->show_wakeups)
+   if (sched->show_next) {
+   snprintf(nstr, sizeof(nstr), "next: %s[%d]", next_comm, 
next_pid);
+   printf(" %-*s", comm_width, nstr);
+   }
+
+   if (sched->show_wakeups && !sched->show_next)
printf("  %-*s", comm_width, "");
 
if (thread->tid == 0)
@@ -2531,7 +2541,7 @@ static int timehist_sched_change_event(struct perf_tool 
*tool,
}
 
if (!sched->summary_only)
-   timehist_print_sample(sched, sample, &al, thread, t, state);
+   timehist_print_sample(sched, evsel, sample, &al, thread, t, 
state);
 
 out:
if (sched->hist_time.start == 0 && t >= ptime->start)
@@ -3341,6 +3351,7 @@ int cmd_sched(int argc, const char **argv, const char 
*prefix __maybe_unused)
OPT_BOOLEAN('S', "with-summary", &sched.summary,
"Show all syscalls and summary with statistics"),
OPT_BOOLEAN('w', "wakeups", &sched.show_wakeups, "Show wakeup events"),
+   OPT_BOOLEAN('n', "next", &sched.show_next, "Show next task"),
OPT_BOOLEAN('M', "migrations", &sched.show_migrations, "Show migration 
events"),
OPT_BOOLEAN('V&

Re: [PATCH] perf sched timehist: Add --next option

2017-03-14 Thread Brendan Gregg
On Tue, Mar 14, 2017 at 6:12 AM, Arnaldo Carvalho de Melo
 wrote:
> Em Tue, Mar 14, 2017 at 01:56:29AM +0000, Brendan Gregg escreveu:
>> The --next option shows the next task for each context switch, providing
>> more context for the sequence of scheduler events.
>>
>> $ perf sched timehist --next | head
>> Samples do not have callchains.
>>timecpu  task name   wait time  sch delay 
>>   run time
>> [tid/pid]  (msec) (msec) 
>> (msec)
>> --- --  --  -  - 
>>  -
>
> Thanks, applied.
>
> Minor request: when printing tool output in the cset comment, please add
> two spaces before all lines, my git-am scripts get confused with that
> --- thinking its a separator.

Ok, will do in the future, thanks Arnaldo!

Brendan


[PATCH] perf sched timehist: Add --next option

2017-03-13 Thread Brendan Gregg
The --next option shows the next task for each context switch, providing
more context for the sequence of scheduler events.

$ perf sched timehist --next | head
Samples do not have callchains.
   timecpu  task name   wait time  sch delay   
run time
[tid/pid]  (msec) (msec)
 (msec)
--- --  --  -  -  
-
 1136374.793792 []0.000  0.000
  0.000  next: rngd[1524]
 1136374.793801 []  rngd[1524]  0.000  0.000
  0.009  next: swapper/0[0]
 1136374.794048 [0007]0.000  0.000
  0.000  next: yes[30884]
 1136374.794066 [0007]  yes[30884]  0.000  0.000
  0.018  next: swapper/7[0]
 1136374.794126 [0002]0.000  0.000
  0.000  next: rngd[1524]
 1136374.794140 [0002]  rngd[1524]  0.325  0.006
  0.013  next: swapper/2[0]
 1136374.794281 [0003]0.000  0.000
  0.000  next: perf[31070]

Signed-off-by: Brendan Gregg 
---
 tools/perf/Documentation/perf-sched.txt |  4 
 tools/perf/builtin-sched.c  | 29 -
 2 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/tools/perf/Documentation/perf-sched.txt 
b/tools/perf/Documentation/perf-sched.txt
index d33dedd..a092a24 100644
--- a/tools/perf/Documentation/perf-sched.txt
+++ b/tools/perf/Documentation/perf-sched.txt
@@ -132,6 +132,10 @@ OPTIONS for 'perf sched timehist'
 --migrations::
Show migration events.
 
+-n::
+--next::
+   Show next task.
+
 -I::
 --idle-hist::
Show idle-related events only.
diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index b94cf0d..fedc8f2 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -221,6 +221,7 @@ struct perf_sched {
unsigned intmax_stack;
boolshow_cpu_visual;
boolshow_wakeups;
+   boolshow_next;
boolshow_migrations;
boolshow_state;
u64 skipped_samples;
@@ -1897,14 +1898,18 @@ static char task_state_char(struct thread *thread, int 
state)
 }
 
 static void timehist_print_sample(struct perf_sched *sched,
+ struct perf_evsel *evsel,
  struct perf_sample *sample,
  struct addr_location *al,
  struct thread *thread,
  u64 t, int state)
 {
struct thread_runtime *tr = thread__priv(thread);
+   const char *next_comm = perf_evsel__strval(evsel, sample, "next_comm");
+   const u32 next_pid = perf_evsel__intval(evsel, sample, "next_pid");
u32 max_cpus = sched->max_cpu + 1;
char tstr[64];
+   char nstr[30];
u64 wait_time;
 
timestamp__scnprintf_usec(t, tstr, sizeof(tstr));
@@ -1937,7 +1942,13 @@ static void timehist_print_sample(struct perf_sched 
*sched,
if (sched->show_state)
printf(" %5c ", task_state_char(thread, state));
 
-   if (sched->show_wakeups)
+   if (sched->show_next) {
+   snprintf(nstr, sizeof(nstr), "next: %s[%d]", next_comm,
+   next_pid);
+   printf(" %-*s", comm_width, nstr);
+   }
+
+   if (sched->show_wakeups && !sched->show_next)
printf("  %-*s", comm_width, "");
 
if (thread->tid == 0)
@@ -2531,7 +2542,8 @@ static int timehist_sched_change_event(struct perf_tool 
*tool,
}
 
if (!sched->summary_only)
-   timehist_print_sample(sched, sample, &al, thread, t, state);
+   timehist_print_sample(sched, evsel, sample, &al, thread, t,
+ state);
 
 out:
if (sched->hist_time.start == 0 && t >= ptime->start)
@@ -3340,6 +3352,7 @@ int cmd_sched(int argc, const char **argv, const char 
*prefix __maybe_unused)
OPT_BOOLEAN('S', "with-summary", &sched.summary,
"Show all syscalls and summary with statistics"),
OPT_BOOLEAN('w', "wakeups", &sched.show_wakeups, "Show wakeup events"),
+   OPT_BOOLEAN('n', "next", &sched.show_next, "Show next task"),
OPT_BOOLEAN('M', "migrations", &sched.show_migrations, "Show migration 
events"),
OPT_BOOLEAN('V', "cpu-visual", &sched.show_cpu_visual, "Add CPU 
visual"),
OPT_BOOLEAN('I', &

Re: [PATCH v3 2/2] perf/sdt: Directly record SDT events with 'perf record'

2017-02-28 Thread Brendan Gregg
On Tue, Feb 28, 2017 at 2:31 PM, Brendan Gregg
 wrote:
> G'Day Ravi,
>
[...]
> Now retrying perf:
>
> # ./perf record -e sdt_node:http__server__request -a
> ^C[ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.446 MB perf.data (3 samples) ]
> # ./perf script
> node  7646 [002]   361.012364:
> sdt_node:http__server__request: (dc2e69)
> node  7646 [002]   361.204718:
> sdt_node:http__server__request: (dc2e69)
> node  7646 [002]   361.363043:
> sdt_node:http__server__request: (dc2e69)
>
> Now perf works.
>
> If I restart the node process, it goes back to the broken state.
>

Oh sorry, I forgot about that these Node.js probes are behind an
is-enabled semaphore.

$ readelf -n `which node`
[...]
  stapsdt  0x0089NT_STAPSDT (SystemTap probe descriptors)
Provider: node
Name: http__server__request
Location: 0x00dc2e69, Base: 0x0112e064, Semaphore:
0x01470954
Arguments: 8@%r14 8@%rax 8@-4344(%rbp) -4@-4348(%rbp)
8@-4304(%rbp) 8@-4312(%rbp) -4@-4352(%rbp)
# dd if=/proc/31695/mem bs=1 count=1 skip=$(( 0x01470954 ))
2>/dev/null | xxd
: 00   .
 # printf "\x1" | dd of=/proc/31695/mem bs=1 count=1 seek=$((
0x01470954 )) 2>/dev/null
# dd if=/proc/31695/mem bs=1 count=1 skip=$(( 0x01470954 ))
2>/dev/null | xxd
: 01   .
# ./perf record -e sdt_node:http__server__request -a
Matching event(s) from uprobe_events:
   sdt_node:http__server__request  0x9c2e69@/usr/local/bin/node
Use 'perf probe -d ' to delete event(s).
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.280 MB perf.data (3 samples) ]
# ./perf script
node 31695 [003] 24947.168761:
sdt_node:http__server__request: (dc2e69)
node 31695 [003] 24947.476143:
sdt_node:http__server__request: (dc2e69)
node 31695 [003] 24947.679090:
sdt_node:http__server__request: (dc2e69)

So setting that to 1 made the probe work from perf. I guess this is
not a problem with this patch set, but rather a feature request for
the next one: is-enabled SDT support.

Were probe arguments supposed to work? I don't notice them in the perf
script output.

PS, if it's helpful, here's the commands to build node with these SDT probes:

$ sudo apt-get install systemtap-sdt-dev   # adds "dtrace", used
by node build
$ wget https://nodejs.org/dist/v4.4.1/node-v4.4.1.tar.gz
$ tar xvf node-v4.4.1.tar.gz
$ cd node-v4.4.1
$ ./configure --with-dtrace
$ make -j 8

Brendan


Re: [PATCH v3 2/2] perf/sdt: Directly record SDT events with 'perf record'

2017-02-28 Thread Brendan Gregg
G'Day Ravi,

On Thu, Feb 23, 2017 at 11:43 PM, Ravi Bangoria
 wrote:
>
> From: Hemant Kumar 
>
> Add support for directly recording SDT events which are present in
> the probe cache. Without this patch, we could probe into SDT events
> using 'perf probe' and 'perf record'. With this patch, we can probe
> the SDT events directly using 'perf record'.
>
> For example :
>
>   $ perf list sdt
> sdt_libpthread:mutex_entry [SDT event]
> sdt_libc:setjmp[SDT event]
> ...
>
>   $ perf record -a -e sdt_libc:setjmp
> ^C[ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.286 MB perf.data (1065 samples) ]
>
>   $ perf script
>  bash   803 [002]  6492.190311: sdt_libc:setjmp: (7f1d503b56a1)
> login   488 [001]  6496.791583: sdt_libc:setjmp: (7ff3013d56a1)
>   fprintd 11038 [003]  6496.808032: sdt_libc:setjmp: (7fdedf5936a1)
> [...]


Thanks, I like the usage. I ran into trouble testing on Node.js:

# ./perf buildid-cache --add `which node`
# ./perf list | grep sdt_node
  sdt_node:gc__done  [SDT event]
  sdt_node:gc__start [SDT event]
  sdt_node:http__client__request [SDT event]
  sdt_node:http__client__response[SDT event]
  sdt_node:http__server__request [SDT event]
  sdt_node:http__server__response[SDT event]
  sdt_node:net__server__connection   [SDT event]
  sdt_node:net__stream__end  [SDT event]
# ./perf record -e sdt_node:http__server__request -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.308 MB perf.data ]
# ./perf script
#

No events. I can see it had set it up:

# cat /sys/kernel/debug/tracing/uprobe_events
p:sdt_node/http__server__request /usr/local/bin/node:0x009c2e69

Ok. Am I sure my workload is working? Trying from bcc/eBPF:

# /mnt/src/bcc/tools/trace.py 'u:node:http__server__request'
failed to enable probe 'http__server__request'; a possible cause can
be that the probe requires a pid to enable
# /mnt/src/bcc/tools/trace.py -p `pgrep node` 'u:node:http__server__request'
In file included from /virtual/main.c:41:
In file included from
/lib/modules/4.10.0-rc8-virtual/build/include/linux/blkdev.h:9:
In file included from
/lib/modules/4.10.0-rc8-virtual/build/include/linux/genhd.h:64:
In file included from
/lib/modules/4.10.0-rc8-virtual/build/include/linux/device.h:24:
In file included from
/lib/modules/4.10.0-rc8-virtual/build/include/linux/pinctrl/devinfo.h:21:
In file included from
/lib/modules/4.10.0-rc8-virtual/build/include/linux/pinctrl/consumer.h:17:
In file included from
/lib/modules/4.10.0-rc8-virtual/build/include/linux/seq_file.h:10:
/lib/modules/4.10.0-rc8-virtual/build/include/linux/fs.h:2648:9:
warning: comparison of unsigned enum expression < 0 is always false
[-Wtautological-compare]
if (id < 0 || id >= READING_MAX_ID)
~~ ^ ~
1 warning generated.
PIDTIDCOMM FUNC
7646   7646   node http__server__request
7646   7646   node http__server__request
7646   7646   node http__server__request
^C

(ignore the warning; I just asked lkml about it). So that works. It
instrumented:

# cat /sys/kernel/debug/tracing/uprobe_events
p:uprobes/p__usr_local_bin_node_0x9c2e69_bcc_25410
/usr/local/bin/node:0x009c2e69

Now retrying perf:

# ./perf record -e sdt_node:http__server__request -a
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.446 MB perf.data (3 samples) ]
# ./perf script
node  7646 [002]   361.012364:
sdt_node:http__server__request: (dc2e69)
node  7646 [002]   361.204718:
sdt_node:http__server__request: (dc2e69)
node  7646 [002]   361.363043:
sdt_node:http__server__request: (dc2e69)

Now perf works.

If I restart the node process, it goes back to the broken state.

This doesn't fix it either:

# ./perf probe sdt_node:http__server__request
Added new event:
  sdt_node:http__server__request (on %http__server__request in
/usr/local/bin/node)

You can now use it in all perf tools, such as:

perf record -e sdt_node:http__server__request -aR sleep 1

Hint: SDT event can be directly recorded with 'perf record'. No need
to create probe manually.

Brendan


Re: [PATCH 11/13] fs: fix unsigned enum warning with gcc-4.2

2017-02-28 Thread Brendan Gregg
On Tue, Jan 3, 2017 at 2:47 PM, Brendan Gregg  wrote:
>
> On Fri, Dec 16, 2016 at 2:56 AM, Arnd Bergmann  wrote:
> >
> > With arm-linux-gcc-4.2, almost every file we build in the kernel ends
> > up with this warning:
> >
> > include/linux/fs.h:2648: warning: comparison of unsigned expression < 0 is 
> > always false
> >
>
> Thanks, I'd like to see this fixed as a similar warning gets printed
> whenever running many of the bcc/BPF tools, which gets annoying and is
> user-visible. eg:
>
> # /usr/share/bcc/tools/xfsslower 1
> In file included from /virtual/main.c:3:
> /lib/modules/4.8.6-300.fc25.x86_64/build/include/linux/fs.h:2677:9:
> warning: comparison of unsigned enum expression < 0 is always false
> [-Wtautological-compare]
> if (id < 0 || id >= READING_MAX_ID)
> ~~ ^ ~
> 1 warning generated.
> Tracing XFS operations slower than 1 ms
> TIME COMM   PIDT BYTES   OFF_KB   LAT(ms) FILENAME
> 14:44:27 cksum  4414   R 65536   0   1.02 chcon
> 14:44:27 cksum  4414   R 65536   0   1.20 cpio
> 14:44:27 cksum  4414   R 65536   0   1.01 diff
> 14:44:27 cksum  4414   R 65536   0   1.15 dir
> [...]
>
> This patch fixes the warning.
>

What's the status of this patch? I still get these warnings on latest. Thanks,

Brendan


Re: [PATCH 11/13] fs: fix unsigned enum warning with gcc-4.2

2017-01-03 Thread Brendan Gregg
On Fri, Dec 16, 2016 at 2:56 AM, Arnd Bergmann  wrote:
>
> With arm-linux-gcc-4.2, almost every file we build in the kernel ends
> up with this warning:
>
> include/linux/fs.h:2648: warning: comparison of unsigned expression < 0 is 
> always false
>

Thanks, I'd like to see this fixed as a similar warning gets printed
whenever running many of the bcc/BPF tools, which gets annoying and is
user-visible. eg:

# /usr/share/bcc/tools/xfsslower 1
In file included from /virtual/main.c:3:
/lib/modules/4.8.6-300.fc25.x86_64/build/include/linux/fs.h:2677:9:
warning: comparison of unsigned enum expression < 0 is always false
[-Wtautological-compare]
if (id < 0 || id >= READING_MAX_ID)
~~ ^ ~
1 warning generated.
Tracing XFS operations slower than 1 ms
TIME COMM   PIDT BYTES   OFF_KB   LAT(ms) FILENAME
14:44:27 cksum  4414   R 65536   0   1.02 chcon
14:44:27 cksum  4414   R 65536   0   1.20 cpio
14:44:27 cksum  4414   R 65536   0   1.01 diff
14:44:27 cksum  4414   R 65536   0   1.15 dir
[...]

This patch fixes the warning.

Brendan

>
> Later versions don't have this problem, but it's easy enough to
> work around.
>
> Signed-off-by: Arnd Bergmann 
> ---
>  include/linux/fs.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 398cf20a706d..782c2a292fd7 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2645,7 +2645,7 @@ static const char * const kernel_read_file_str[] = {
>
>  static inline const char *kernel_read_file_id_str(enum kernel_read_file_id 
> id)
>  {
> -   if (id < 0 || id >= READING_MAX_ID)
> +   if ((unsigned)id >= READING_MAX_ID)
> return kernel_read_file_str[READING_UNKNOWN];
>
> return kernel_read_file_str[id];
> --
> 2.9.0
>


Re: [PATCH 2/2 v2] sched: use load_avg for selecting idlest group

2016-12-02 Thread Brendan Gregg
On Fri, Nov 25, 2016 at 7:34 AM, Vincent Guittot
 wrote:
>
> find_idlest_group() only compares the runnable_load_avg when looking for
> the least loaded group. But on fork intensive use case like hackbench
> where tasks blocked quickly after the fork, this can lead to selecting the
> same CPU instead of other CPUs, which have similar runnable load but a
> lower load_avg.
>
> When the runnable_load_avg of 2 CPUs are close, we now take into account
> the amount of blocked load as a 2nd selection factor. There is now 3 zones
> for the runnable_load of the rq:
> -[0 .. (runnable_load - imbalance)] : Select the new rq which has
> significantly less runnable_load
> -](runnable_load - imbalance) .. (runnable_load + imbalance)[ : The
> runnable load are close so we use load_avg to chose between the 2 rq
> -[(runnable_load + imbalance) .. ULONG_MAX] : Keep the current rq which
> has significantly less runnable_load
>

For background, is this from the "A decade of wasted cores" paper's
patches? What's the expected typical gain? Thanks,

Brendan


Re: [PATCH v2 net-next 0/6] perf, bpf: add support for bpf in sw/hw perf_events

2016-08-31 Thread Brendan Gregg
On Wed, Aug 31, 2016 at 2:50 PM, Alexei Starovoitov  wrote:
> Hi Peter, Dave,
>
> this patch set is a follow up to the discussion:
> https://lkml.kernel.org/r/20160804142853.GO6862%20()%20twins%20!%20programming%20!%20kicks-ass%20!%20net
> It turned out to be simpler than what we discussed.
>
> Patches 1-3 is bpf-side prep for the main patch 4
> that adds bpf program as an overflow_handler to sw and hw perf_events.
> Peter, please review.
>
> Patches 5 and 6 are examples from myself and Brendan.
>
> v1-v2: fixed issues spotted by Peter and Daniel.

Thanks Alexei!

Tested-by: Brendan Gregg 

Brendan


Re: [PATCH net-next 0/6] perf, bpf: add support for bpf in sw/hw perf_events

2016-08-29 Thread Brendan Gregg
On Mon, Aug 29, 2016 at 5:19 AM, Peter Zijlstra  wrote:
>
> On Fri, Aug 26, 2016 at 07:31:18PM -0700, Alexei Starovoitov wrote:
> > Hi Peter, Dave,
> >
> > this patch set is a follow up to the discussion:
> > https://lkml.org/lkml/2016/8/4/304
> > It turned out to be simpler than what we discussed.
> >
> > Patches 1-3 is a bpf-side prep for the main patch 4
> > that adds bpf program as an overflow_handler to sw and hw perf_events.
> > Peter, please review.
> >
> > Patches 5 and 6 are tests/examples from myself and Brendan.
>
> Brendan, so this works for you without extra hacks required?

Yes, thanks for checking, I've done both IP and stack sampling so far with it.

Brendan


[tip:perf/urgent] perf script: Add 'bpf-output' field to usage message

2016-08-09 Thread tip-bot for Brendan Gregg
Commit-ID:  bcdc09af3ef30ef071677544ce23a1c8873a2dda
Gitweb: http://git.kernel.org/tip/bcdc09af3ef30ef071677544ce23a1c8873a2dda
Author: Brendan Gregg 
AuthorDate: Wed, 3 Aug 2016 02:47:49 +
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Tue, 9 Aug 2016 10:46:43 -0300

perf script: Add 'bpf-output' field to usage message

This adds the 'bpf-output' field to the perf script usage message, and docs.

Signed-off-by: Brendan Gregg 
Cc: Alexander Shishkin 
Cc: Alexei Starovoitov 
Cc: Peter Zijlstra 
Cc: Wang Nan 
Link: 
http://lkml.kernel.org/r/1470192469-11910-4-git-send-email-bgr...@netflix.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/Documentation/perf-script.txt | 4 ++--
 tools/perf/builtin-script.c  | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/perf/Documentation/perf-script.txt 
b/tools/perf/Documentation/perf-script.txt
index 1f6c705..053bbbd 100644
--- a/tools/perf/Documentation/perf-script.txt
+++ b/tools/perf/Documentation/perf-script.txt
@@ -116,8 +116,8 @@ OPTIONS
 --fields::
 Comma separated list of fields to print. Options are:
 comm, tid, pid, time, cpu, event, trace, ip, sym, dso, addr, symoff,
-   srcline, period, iregs, brstack, brstacksym, flags.
-Field list can be prepended with the type, trace, sw or hw,
+srcline, period, iregs, brstack, brstacksym, flags, bpf-output,
+callindent. Field list can be prepended with the type, trace, sw or hw,
 to indicate to which event type the field list applies.
 e.g., -F sw:comm,tid,time,ip,sym  and -F trace:time,cpu,trace
 
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 971ff91..9c640a8 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -2116,7 +2116,7 @@ int cmd_script(int argc, const char **argv, const char 
*prefix __maybe_unused)
 "Valid types: hw,sw,trace,raw. "
 "Fields: comm,tid,pid,time,cpu,event,trace,ip,sym,dso,"
 "addr,symoff,period,iregs,brstack,brstacksym,flags,"
-"callindent", parse_output_fields),
+"bpf-output,callindent", parse_output_fields),
OPT_BOOLEAN('a', "all-cpus", &system_wide,
"system-wide collection from all CPUs"),
OPT_STRING('S', "symbols", &symbol_conf.sym_list_str, 
"symbol[,symbol...]",


Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-05 Thread Brendan Gregg
On Fri, Aug 5, 2016 at 3:52 AM, Peter Zijlstra  wrote:
> On Thu, Aug 04, 2016 at 10:24:06PM -0700, Alexei Starovoitov wrote:
>> tracepoints are actually zero overhead already via static-key mechanism.
>> I don't think Peter's objection for the tracepoint was due to overhead.
>
> Almost 0, they still have some I$ footprint, but yes. My main worry is
> that we can feed tracepoints into perf, so having tracepoints in perf is
> tricky.

Coincidentally I$ footprint was my most recent use case for needing
this: I have an I$ busting workload, and wanting to profile
instructions at a very high rate to get a breakdown of I$ population.
(Normally I'd use I$ miss overflow, but none of our Linux systems have
PMCs: cloud.)

> I also don't much like this tracepoint being specific to the hrtimer
> bits, I can well imagine people wanting to do the same thing for
> hardware based samples or whatnot.

Sure, which is why I thought we'd have two in a perf category. I'm all
for PMCs events, even though we can't currently use them!

>
>> > The perf:perf_hrtimer probe point is also reading state mid-way
>> > through a function, so it's not quite as simple as wrapping the
>> > function pointer. I do like that idea, though, but for things like
>> > struct file_operations.
>
> So what additional state to you need?

I was pulling in regs after get_irq_regs(), struct perf_event *event
after it's populated. Not that hard to duplicate. Just noting it
didn't map directly to the function entry.

I wanted perf_event just for event->ctx->task->pid, so that a BPF
program can differentiate between it's samples and other concurrent
sessions.

(I  was thinking of changing my patch to expose pid_t instead of
perf_event, since I was noticing it didn't add many instructions.)

[...]
>> instead of adding a tracepoint to perf_swevent_hrtimer we can replace
>> overflow_handler for that particular event with some form of bpf wrapper.
>> (probably new bpf program type). Then not only periodic events
>> will be triggering bpf prog, but pmu events as well.
>
> Exactly.

Although the timer use case is a bit different, and is via
hwc->hrtimer.function = perf_swevent_hrtimer.

[...]
>> The question is what to pass into the
>> program to make the most use out of it. 'struct pt_regs' is done deal.
>> but perf_sample_data we cannot pass as-is, since it's kernel internal.
>
> Urgh, does it have to be stable API? Can't we simply rely on the kernel
> headers to provide the right structure definition?

For timer it can be: struct pt_regs, pid_t.

So that would restrict your BPF program to one timer, since if you had
two (from one pid) you couldn't tell them apart. But I'm not sure of a
use case for two in-kernel timers. If there were, we could also add
struct perf_event_attr, which has enough info to tell things apart,
and is already exposed to user space.

I haven't looked into the PMU arguments, but perhaps that could be:
struct pt_regs, pid_t, struct perf_event_attr.

Thanks,

Brendan


Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-04 Thread Brendan Gregg
On Thu, Aug 4, 2016 at 6:43 PM, Alexei Starovoitov
 wrote:
> On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote:
>> On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote:
>>
>> > As for pmu tracepoints: if I were to instrument it (although I wasn't
>> > planning to), I'd put a tracepoint in perf_event_overflow() called
>> > "perf:perf_overflow", with the same arguments. That could then be used
>> > for all PMU overflow events, without needing to add specific
>> > tracepoints.
>>
>> Could we not teach BPF to replace event->overflow_handler and inject
>> itself there?
>>
>> We don't currently have nice interfaces for doing that, but it should be
>> possible to do I think. We already have the indirect function call, so
>> injecting ourself there has 0 overhead.

Sounds like a good idea, especially for things like struct
file_operations so that we can statically instrument file system
read/writes with zero non-enabled overhead, and not worry about high
frequency workloads (>10M events/sec).

These perf probes aren't high frequency, though, and the code is not
normally in use, so overhead should be much less of a concern.
Sampling at 999 Hertz * CPUs is as frequent as I'd go. And if the
tracepoint code is still adding a mem read, conditional, and branch,
then that's not many instructions, especially considering the normal
use case of these perf functions: creating records and writing to a
perf ring buffer, then picking that up in user space by perf, then
either processing it live or writing to perf.data, back to the file
system, etc. It would be hard to benchmark the effect of adding a few
instructions to that path (and any results may be more sensitive to
cache line placement than the instructions).

The perf:perf_hrtimer probe point is also reading state mid-way
through a function, so it's not quite as simple as wrapping the
function pointer. I do like that idea, though, but for things like
struct file_operations.

>
> you're right. All makes sense. I guess I was too lazy to look into
> how to do it properly. Adding a tracepoint looked like quick and
> easy way to achieve the same.
> As far as api goes probably existing IOC_SET_BPF ioctl will do too.
> Currently overflow_handler is set at event alloc time. If we start
> changing it on the fly with atomic xchg(), afaik things shouldn't
> break, since each overflow_handler is run to completion and doesn't
> change global state, right?
>

How would it be implemented? I was thinking of adding explicit wrappers, eg:

static ssize_t
__ext4_file_write_iter_tracepoint(struct kiocb *iocb, struct iov_iter *from)
{
trace_ext4_write_iter(iocb, from);
ext4_file_write_iter(iocb, from);
}

Then switching in __ext4_file_write_iter_tracepoint() as needed.

I was wondering about doing this dynamically, but wouldn't that then
create another problem of maintenance -- we'd need to decorate such
code with a comment, at least, to say "this function is exposed as a
tracepoint".

If a dynamic approach is still desired, and the goal is zero
non-enabled overhead, then I'd also wonder if we could leverage
kprobes to do this. Imagine walking a file_operations to find the
addresses, and then kprobe-ing the addresses. Not the same (or
probably as efficient) as setting the function pointer, but, using
existing kprobes.

Brendan


Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-03 Thread Brendan Gregg
On Wed, Aug 3, 2016 at 2:48 AM, Peter Zijlstra  wrote:
>
> On Wed, Aug 03, 2016 at 02:47:47AM +, Brendan Gregg wrote:
> > When perf is performing hrtimer-based sampling, this tracepoint can be used
> > by BPF to run additional logic on each sample. For example, BPF can fetch
> > stack traces and frequency count them in kernel context, for an efficient
> > profiler.
>
> Urgh, no like.
>
> And then next week we'll add a tracepoint to some other random pmu or
> whatnot.

I wouldn't want random pmu tracepoints either, but I can see how it
looks with a new tracepoint category of "perf:", and maybe I shouldn't
have called it that. Would it be better if the tracepoint was called
"timer:perf_hrtimer"?

As for pmu tracepoints: if I were to instrument it (although I wasn't
planning to), I'd put a tracepoint in perf_event_overflow() called
"perf:perf_overflow", with the same arguments. That could then be used
for all PMU overflow events, without needing to add specific
tracepoints. The "perf:" category would then have two tracepoints, and
would not need to grow. Does that option sound better than
"timer:perf_hrtimer"? (As in, keeping the category "perf" so that we
have a later option of adding a perf:perf_overflow tracepoint if need
be?)

It's really perf_hrtimer that we'll use daily, for CPU profiling with
in-kernel aggregations of instruction pointers and stack traces, and
other sampling needs. It's one tracepoint that will add much value,
and will mean that BPF programs can then be attached to kprobes,
uprobes, tracepoints, and timed samples.

Thanks for your consideration,

Brendan


Re: [PATCH] perf/core: Add a tracepoint for perf sampling

2016-08-02 Thread Brendan Gregg
On Fri, Jul 29, 2016 at 8:34 PM, Wangnan (F)  wrote:
>
>
> On 2016/7/30 2:05, Brendan Gregg wrote:
>>
>> On Tue, Jul 19, 2016 at 4:20 PM, Brendan Gregg  wrote:
>>>
>>> When perf is performing hrtimer-based sampling, this tracepoint can be
>>> used
>>> by BPF to run additional logic on each sample. For example, BPF can fetch
>>> stack traces and frequency count them in kernel context, for an efficient
>>> profiler.
>>
>> Any comments on this patch? Thanks,
>>
>> Brendan
>
>
> Sorry for the late.
>
> I think it is a useful feature. Could you please provide an example
> to show how to use it in perf?

Yes, the following example samples at 999 Hertz, and emits the
instruction pointer only when it is within a custom address range, as
checked by BPF. Eg:

# ./perf record -e bpf-output/no-inherit,name=evt/ \
-e ./sampleip_range.c/map:channel.event=evt/ \
-a ./perf record -F 999 -e cpu-clock -N -a -o /dev/null sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.000 MB /dev/null ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.134 MB perf.data (222 samples) ]

# ./perf script -F comm,pid,time,bpf-output
'bpf-output' not valid for hardware events. Ignoring.
'bpf-output' not valid for unknown events. Ignoring.
'bpf-output' not valid for unknown events. Ignoring.
  dd  6501  3058.117379:
  BPF output: : 3c 4c 21 81 ff ff ff ff  
#include 

#define SEC(NAME) __attribute__((section(NAME), used))

/*
 * Edit the following to match the instruction address range you want to
 * sample. Eg, look in /proc/kallsyms. The addresses will change for each
 * kernel version and build.
 */
#define RANGE_START  0x81214b90
#define RANGE_END0x81214cd0

struct bpf_map_def {
unsigned int type;
unsigned int key_size;
unsigned int value_size;
unsigned int max_entries;
};

static int (*probe_read)(void *dst, int size, void *src) =
(void *)BPF_FUNC_probe_read;
static int (*get_smp_processor_id)(void) =
(void *)BPF_FUNC_get_smp_processor_id;
static int (*perf_event_output)(void *, struct bpf_map_def *, int, void *,
unsigned long) = (void *)BPF_FUNC_perf_event_output;

struct bpf_map_def SEC("maps") channel = {
.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(u32),
.max_entries = __NR_CPUS__,
};

/* from /sys/kernel/debug/tracing/events/perf/perf_hrtimer/format */
struct perf_hrtimer_args {
unsigned long long pad;
struct pt_regs *regs;
struct perf_event *event;
};
SEC("perf:perf_hrtimer")
int func(struct perf_hrtimer_args *ctx)
{
struct pt_regs regs = {};
probe_read(®s, sizeof(regs), ctx->regs);
if (regs.ip >= RANGE_START && regs.ip < RANGE_END) {
perf_event_output(ctx, &channel, get_smp_processor_id(),
   ®s.ip, sizeof(regs.ip));
}
return 0;
}

char _license[] SEC("license") = "GPL";
int _version SEC("version") = LINUX_VERSION_CODE;
/* END ***/

>
> If I understand correctly, I can have a BPF script run 99 times per
> second using
>
>   # perf -e cpu-clock/freq=99/ -e mybpf.c ...
>
> And in mybpf.c, attach a BPF script on the new tracepoint. Right?
>
> Also, since we already have timer:hrtimer_expire_entry, please provide
> some further information about why we need a new tracepoint.

timer:hrtimer_expire_entry fires for much more than just the perf
timer. The perf:perf_hrtimer tracepoint also has registers and perf
context as arguments, which can be used for profiling programs.

Thanks for the comments,

Brendan


[PATCH v2 2/3] samples/bpf: Add a sampling BPF example

2016-08-02 Thread Brendan Gregg
This example samples the instruction pointer at a timed interval, and
frequency counts it in a BPF map. It is an example of summarizing sampled
data in-kernel for passing to user space. It uses the perf:perf_hrtimer
tracepoint with perf_events sampling.

Example output:

Sampling at 99 Hertz for 5 seconds. Ctrl-C also ends.
ADDRKSYM COUNT
0x81257088  __fsnotify_parent1
0x7f20b792d9b0  (user)   1
0x8121469e  __vfs_read   1
0x81214afd  rw_verify_area   1
0x8123327e  __fget_light 1
0x7fc0965a6e2c  (user)   1
0x81233c04  __fdget_pos  1
0x81378528  common_file_perm 1
0x404d90(user)   1
0x81214c13  vfs_read 1
[...]
0x813d9e97  copy_user_enhanced_fast_string   3
0x817e310c  _raw_spin_lock_irqsave   4
0x817e31a0  entry_SYSCALL_64_fastpath4
0x814fb96c  extract_crng 6
0x813d9e95  copy_user_enhanced_fast_string   7
0x814fb8a3  _extract_crng7
0x817e2d55  _raw_spin_unlock_irqrestore  1399
0x8105fb46  native_safe_halt 2190

It also has basic options:

USAGE: sampleip [-F freq] [duration]
   -F freq# sample frequency (Hertz), default 99
   duration   # sampling duration (seconds), default 5

Signed-off-by: Brendan Gregg 
Cc: Alexei Starovoitov 
Cc: Wang Nan 
---
 samples/bpf/Makefile|   4 +
 samples/bpf/sampleip_kern.c |  48 +++
 samples/bpf/sampleip_user.c | 189 
 3 files changed, 241 insertions(+)
 create mode 100644 samples/bpf/sampleip_kern.c
 create mode 100644 samples/bpf/sampleip_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 90ebf7d..dc88a1e 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -24,6 +24,7 @@ hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
 hostprogs-y += xdp1
 hostprogs-y += xdp2
+hostprogs-y += sampleip
 
 test_verifier-objs := test_verifier.o libbpf.o
 test_maps-objs := test_maps.o libbpf.o
@@ -49,6 +50,7 @@ test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 # reuse xdp1 source intentionally
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
+sampleip-objs := bpf_load.o libbpf.o sampleip_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -74,6 +76,7 @@ always += parse_varlen.o parse_simple.o parse_ldabs.o
 always += test_cgrp2_tc_kern.o
 always += xdp1_kern.o
 always += xdp2_kern.o
+always += sampleip_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 
@@ -97,6 +100,7 @@ HOSTLOADLIBES_map_perf_test += -lelf -lrt
 HOSTLOADLIBES_test_overhead += -lelf -lrt
 HOSTLOADLIBES_xdp1 += -lelf
 HOSTLOADLIBES_xdp2 += -lelf
+HOSTLOADLIBES_sampleip += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/sampleip_kern.c b/samples/bpf/sampleip_kern.c
new file mode 100644
index 000..afec3fe
--- /dev/null
+++ b/samples/bpf/sampleip_kern.c
@@ -0,0 +1,48 @@
+/* Copyright 2016 Netflix, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+#define MAX_IPS8192
+
+#define _(P) ({typeof(P) val; bpf_probe_read(&val, sizeof(val), &P); val;})
+
+struct bpf_map_def SEC("maps") ip_map = {
+   .type = BPF_MAP_TYPE_HASH,
+   .key_size = sizeof(u64),
+   .value_size = sizeof(u32),
+   .max_entries = MAX_IPS,
+};
+
+/* from /sys/kernel/debug/tracing/events/perf/perf_hrtimer/format */
+struct perf_hrtimer_args {
+   unsigned long long pad;
+   struct pt_regs *regs;
+   struct perf_event *event;
+};
+SEC("tracepoint/perf/perf_hrtimer")
+int do_sample(struct perf_hrtimer_args *args)
+{
+   struct pt_regs *regs;
+   u64 ip;
+   u32 *value, init_val = 1;
+
+   regs = _(args->regs);
+   ip = _(regs->ip);
+   value = bpf_map_lookup_elem(&ip_map, &ip);
+   if (value)
+   *value += 1;
+   else
+   /* E2BIG not tested for this example only */
+   bpf_map_update_elem(&ip_map, &ip, &init_val, BPF_ANY);
+
+   return 0;
+}
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/sampleip_user.c b/samples/bpf/sampleip_user.c
new file mod

[PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling

2016-08-02 Thread Brendan Gregg
When perf is performing hrtimer-based sampling, this tracepoint can be used
by BPF to run additional logic on each sample. For example, BPF can fetch
stack traces and frequency count them in kernel context, for an efficient
profiler.

Signed-off-by: Brendan Gregg 
Cc: Alexei Starovoitov 
Cc: Wang Nan 
---
 include/trace/events/perf.h | 29 +
 kernel/events/core.c|  5 +
 2 files changed, 34 insertions(+)
 create mode 100644 include/trace/events/perf.h

diff --git a/include/trace/events/perf.h b/include/trace/events/perf.h
new file mode 100644
index 000..461770d
--- /dev/null
+++ b/include/trace/events/perf.h
@@ -0,0 +1,29 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM perf
+
+#if !defined(_TRACE_PERF_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PERF_H
+
+#include 
+
+TRACE_EVENT(perf_hrtimer,
+   TP_PROTO(struct pt_regs *regs, struct perf_event *event),
+
+   TP_ARGS(regs, event),
+
+   TP_STRUCT__entry(
+   __field(struct pt_regs *, regs)
+   __field(struct perf_event *, event)
+   ),
+
+   TP_fast_assign(
+   __entry->regs = regs;
+   __entry->event = event;
+   ),
+
+   TP_printk("regs=%p evt=%p", __entry->regs, __entry->event)
+);
+#endif /* _TRACE_PERF_H */
+
+/* This part must be outside protection */
+#include 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 356a6c7..78bce19 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -51,6 +51,9 @@
 
 #include 
 
+#define CREATE_TRACE_POINTS
+#include 
+
 typedef int (*remote_function_f)(void *);
 
 struct remote_function_call {
@@ -8062,6 +8065,8 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct 
hrtimer *hrtimer)
perf_sample_data_init(&data, 0, event->hw.last_period);
regs = get_irq_regs();
 
+   trace_perf_hrtimer(regs, event);
+
if (regs && !perf_exclude_event(event, regs)) {
if (!(event->attr.exclude_idle && is_idle_task(current)))
if (__perf_event_overflow(event, 1, &data, regs))
-- 
2.7.4



[PATCH v2 3/3] perf script: add bpf-output field to usage message

2016-08-02 Thread Brendan Gregg
This adds the bpf-output field to the perf script usage message, and docs.

Signed-off-by: Brendan Gregg 
Cc: Wang Nan 
---
 tools/perf/Documentation/perf-script.txt | 4 ++--
 tools/perf/builtin-script.c  | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/perf/Documentation/perf-script.txt 
b/tools/perf/Documentation/perf-script.txt
index 1f6c705..053bbbd 100644
--- a/tools/perf/Documentation/perf-script.txt
+++ b/tools/perf/Documentation/perf-script.txt
@@ -116,8 +116,8 @@ OPTIONS
 --fields::
 Comma separated list of fields to print. Options are:
 comm, tid, pid, time, cpu, event, trace, ip, sym, dso, addr, symoff,
-   srcline, period, iregs, brstack, brstacksym, flags.
-Field list can be prepended with the type, trace, sw or hw,
+srcline, period, iregs, brstack, brstacksym, flags, bpf-output,
+callindent. Field list can be prepended with the type, trace, sw or hw,
 to indicate to which event type the field list applies.
 e.g., -F sw:comm,tid,time,ip,sym  and -F trace:time,cpu,trace
 
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 971ff91..9c640a8 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -2116,7 +2116,7 @@ int cmd_script(int argc, const char **argv, const char 
*prefix __maybe_unused)
 "Valid types: hw,sw,trace,raw. "
 "Fields: comm,tid,pid,time,cpu,event,trace,ip,sym,dso,"
 "addr,symoff,period,iregs,brstack,brstacksym,flags,"
-"callindent", parse_output_fields),
+"bpf-output,callindent", parse_output_fields),
OPT_BOOLEAN('a', "all-cpus", &system_wide,
"system-wide collection from all CPUs"),
OPT_STRING('S', "symbols", &symbol_conf.sym_list_str, 
"symbol[,symbol...]",
-- 
2.7.4



[PATCH v2 0/3] Add a tracepoint for BPF perf sampling

2016-08-02 Thread Brendan Gregg
This patchset adds a tracepoint for perf sampling, perf:perf_hrtimer, and
includes a complete example in samples/bpf for using it to frequency
count sampled instruction pointers in a BPF map.

Signed-off-by: Brendan Gregg 
---
Changes in v2:
- added samples/bpf/sampleip* example for perf:perf_hrtimer
- added bpf-output to perf script usage message

Brendan Gregg (3):
  perf/core: Add a tracepoint for perf sampling
  samples/bpf: Add a sampling BPF example
  perf script: add bpf-output field to usage message

 include/trace/events/perf.h  |  29 +
 kernel/events/core.c |   5 +
 samples/bpf/Makefile |   4 +
 samples/bpf/sampleip_kern.c  |  48 
 samples/bpf/sampleip_user.c  | 189 +++
 tools/perf/Documentation/perf-script.txt |   4 +-
 tools/perf/builtin-script.c  |   2 +-
 7 files changed, 278 insertions(+), 3 deletions(-)
 create mode 100644 include/trace/events/perf.h
 create mode 100644 samples/bpf/sampleip_kern.c
 create mode 100644 samples/bpf/sampleip_user.c

-- 
2.7.4



Re: [PATCH] perf/core: Add a tracepoint for perf sampling

2016-07-29 Thread Brendan Gregg
On Fri, Jul 29, 2016 at 12:21 PM, Arnaldo Carvalho de Melo
 wrote:
> Em Tue, Jul 19, 2016 at 11:20:48PM +0000, Brendan Gregg escreveu:
>> When perf is performing hrtimer-based sampling, this tracepoint can be used
>> by BPF to run additional logic on each sample. For example, BPF can fetch
>> stack traces and frequency count them in kernel context, for an efficient
>> profiler.
>
> Could you provide a complete experience? I.e. together with this patch a
> bpf script that could then run, with the full set of steps needed to
> show it in use.

There's currently profile.py, in bcc, which will either use this
tracepoint or use a kprobe if it doesn't exist (although the kprobe is
unreliable). profile samples stack traces and shows stack traces with
their occurrence counts. Eg:

# ./profile
Sampling at 49 Hertz of all threads by user + kernel stack... Hit Ctrl-C to end.
^C
81189249 filemap_map_pages
811bd3f5 handle_mm_fault
81065990 __do_page_fault
81065caf do_page_fault
817ce228 page_fault
7fed989afcc0 [unknown]
-cp (9036)
1
[...]

8105eb66 native_safe_halt
8103659e default_idle
81036d1f arch_cpu_idle
810bba5a default_idle_call
810bbd07 cpu_startup_entry
817bf4a7 rest_init
81d65f58 start_kernel
81d652db x86_64_start_reservations
81d65418 x86_64_start_kernel
-swapper/0 (0)
72

8105eb66 native_safe_halt
8103659e default_idle
81036d1f arch_cpu_idle
810bba5a default_idle_call
810bbd07 cpu_startup_entry
8104df55 start_secondary
-swapper/1 (0)
75

Tool and examples are on github [1][2]. Is this sufficient for this
patch? If not, I could rewrite something for samples/bpf (eg, an IP
sampler, or a task priority sampler), which I may do anyway as a
follow-on if they turned out to be nice examples.

>
> Also, what would be the value when BPF is not used?
>

No big reason comes to mind. I could imagine it might be useful when
debugging perf's sampling behavior, and there might be uses with
ftrace as well. But the big reason is extending perf's existing
sampling capabilities for in-kernel frequency counts of stack traces
(which could include custom BPF-based stack walkers), IP, task
priority, etc. Thanks,

Brendan

[1] https://github.com/iovisor/bcc/blob/master/tools/profile.py
[2] https://github.com/iovisor/bcc/blob/master/tools/profile_example.txt


Re: [PATCH] perf/core: Add a tracepoint for perf sampling

2016-07-29 Thread Brendan Gregg
On Tue, Jul 19, 2016 at 4:20 PM, Brendan Gregg  wrote:
> When perf is performing hrtimer-based sampling, this tracepoint can be used
> by BPF to run additional logic on each sample. For example, BPF can fetch
> stack traces and frequency count them in kernel context, for an efficient
> profiler.

Any comments on this patch? Thanks,

Brendan

>
> Signed-off-by: Brendan Gregg 
> Cc: Alexei Starovoitov 
> Cc: Wang Nan 
> ---
>  include/trace/events/perf.h | 29 +
>  kernel/events/core.c|  5 +
>  2 files changed, 34 insertions(+)
>  create mode 100644 include/trace/events/perf.h
>
> diff --git a/include/trace/events/perf.h b/include/trace/events/perf.h
> new file mode 100644
> index 000..461770d
> --- /dev/null
> +++ b/include/trace/events/perf.h
> @@ -0,0 +1,29 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM perf
> +
> +#if !defined(_TRACE_PERF_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_PERF_H
> +
> +#include 
> +
> +TRACE_EVENT(perf_hrtimer,
> +   TP_PROTO(struct pt_regs *regs, struct perf_event *event),
> +
> +   TP_ARGS(regs, event),
> +
> +   TP_STRUCT__entry(
> +   __field(struct pt_regs *, regs)
> +   __field(struct perf_event *, event)
> +   ),
> +
> +   TP_fast_assign(
> +   __entry->regs = regs;
> +   __entry->event = event;
> +   ),
> +
> +   TP_printk("regs=%p evt=%p", __entry->regs, __entry->event)
> +);
> +#endif /* _TRACE_PERF_H */
> +
> +/* This part must be outside protection */
> +#include 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 79dae18..0d843a7 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -51,6 +51,9 @@
>
>  #include 
>
> +#define CREATE_TRACE_POINTS
> +#include 
> +
>  typedef int (*remote_function_f)(void *);
>
>  struct remote_function_call {
> @@ -8036,6 +8039,8 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct 
> hrtimer *hrtimer)
> perf_sample_data_init(&data, 0, event->hw.last_period);
> regs = get_irq_regs();
>
> +   trace_perf_hrtimer(regs, event);
> +
> if (regs && !perf_exclude_event(event, regs)) {
> if (!(event->attr.exclude_idle && is_idle_task(current)))
> if (__perf_event_overflow(event, 1, &data, regs))
> --
> 2.7.4
>


[PATCH] perf/core: Add a tracepoint for perf sampling

2016-07-19 Thread Brendan Gregg
When perf is performing hrtimer-based sampling, this tracepoint can be used
by BPF to run additional logic on each sample. For example, BPF can fetch
stack traces and frequency count them in kernel context, for an efficient
profiler.

Signed-off-by: Brendan Gregg 
Cc: Alexei Starovoitov 
Cc: Wang Nan 
---
 include/trace/events/perf.h | 29 +
 kernel/events/core.c|  5 +
 2 files changed, 34 insertions(+)
 create mode 100644 include/trace/events/perf.h

diff --git a/include/trace/events/perf.h b/include/trace/events/perf.h
new file mode 100644
index 000..461770d
--- /dev/null
+++ b/include/trace/events/perf.h
@@ -0,0 +1,29 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM perf
+
+#if !defined(_TRACE_PERF_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PERF_H
+
+#include 
+
+TRACE_EVENT(perf_hrtimer,
+   TP_PROTO(struct pt_regs *regs, struct perf_event *event),
+
+   TP_ARGS(regs, event),
+
+   TP_STRUCT__entry(
+   __field(struct pt_regs *, regs)
+   __field(struct perf_event *, event)
+   ),
+
+   TP_fast_assign(
+   __entry->regs = regs;
+   __entry->event = event;
+   ),
+
+   TP_printk("regs=%p evt=%p", __entry->regs, __entry->event)
+);
+#endif /* _TRACE_PERF_H */
+
+/* This part must be outside protection */
+#include 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 79dae18..0d843a7 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -51,6 +51,9 @@
 
 #include 
 
+#define CREATE_TRACE_POINTS
+#include 
+
 typedef int (*remote_function_f)(void *);
 
 struct remote_function_call {
@@ -8036,6 +8039,8 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct 
hrtimer *hrtimer)
perf_sample_data_init(&data, 0, event->hw.last_period);
regs = get_irq_regs();
 
+   trace_perf_hrtimer(regs, event);
+
if (regs && !perf_exclude_event(event, regs)) {
if (!(event->attr.exclude_idle && is_idle_task(current)))
if (__perf_event_overflow(event, 1, &data, regs))
-- 
2.7.4



Re: perf bpf examples

2016-07-08 Thread Brendan Gregg
On Fri, Jul 8, 2016 at 3:46 AM, Wangnan (F)  wrote:
>
>
> On 2016/7/8 15:57, Brendan Gregg wrote:
>>
[...]
>> I mean just an -F99 that executes a BPF program on each sample. My
>> most common use for perf is:
>>
>> perf record -F 99 -a -g -- sleep 30
>> perf report (or perf script, for making flame graphs)
>>
>> But this uses perf.data as an intermediate file. With the recent
>> BPF_MAP_TYPE_STACK_TRACE, we could frequency count stack traces in
>> kernel context, and just dump a report. Much more efficient. And
>> improving a very common perf one-liner.
>
>
> You can't attach BPF script to samples other than kprobe and tracepoints.
> When you use 'perf record -F99 -a -g -- sleep 30', you are sampling on
> 'cycles:ppp' event. This is a hardware PMU event.

Sure, either cycles:ppp or cpu-clock (my Xen guests have no PMU,
sadly). But These are ultimately calling perf_swevent_hrtimer()/etc,
so I was wondering if someone was already looking at enhancing this
code to support BPF? Ie, BPF should be able to attach to kprobes,
uprobes, tracepoints, and timer-based samples.

> If we find a kprobe or tracepoint event which would be triggered 99 times
> in each second, we can utilize BPF_MAP_TYPE_STACK_TRACE and
> bpf_get_stackid().

Yes, that should be a workaround. It's annoying as some like
perf_swevent_hrtimer() can't be kprobed (inlined?), but I found
perf_misc_flags(struct pt_regs *regs) was called, but passing in that
regs to bpf_get_stackid() was returning "type=inv expected=ctx"
errors, despite casting. I'm guessing the BPF ctx type is special and
can't be casted, but need to dig more.

Brendan


Re: perf bpf examples

2016-07-08 Thread Brendan Gregg
On Thu, Jul 7, 2016 at 9:18 PM, Wangnan (F)  wrote:
>
>
> On 2016/7/8 1:58, Brendan Gregg wrote:
>>
>> On Thu, Jul 7, 2016 at 10:54 AM, Brendan Gregg
>>  wrote:
>>>
>>> On Wed, Jul 6, 2016 at 6:49 PM, Wangnan (F)  wrote:
[...]
>> ... Also, has anyone looked into perf sampling (-F 99) with bpf yet?
>> Thanks,
>
>
> Theoretically, BPF program is an additional filter to
> decide whetier an event should be filtered out or pass to perf. -F 99
> is another filter, which drops samples to ensure the frequence.
> Filters works together. The full graph should be:
>
>  BPF --> traditional filter --> proc (system wide of proc specific) -->
> period
>
> See the example at the end of this mail. The BPF program returns 0 for half
> of
> the events, and the result should be symmetrical. We can get similar result
> without
> -F:
>
> # ~/perf record -a --clang-opt '-DCATCH_ODD' -e ./sampling.c dd if=/dev/zero
> of=/dev/null count=8388480
> 8388480+0 records in
> 8388480+0 records out
> 4294901760 bytes (4.3 GB) copied, 11.9908 s, 358 MB/s
> [ perf record: Woken up 28 times to write data ]
> [ perf record: Captured and wrote 303.915 MB perf.data (4194449 samples) ]
> #
> root@wn-Lenovo-Product:~# ~/perf record -a --clang-opt '-DCATCH_EVEN' -e
> ./sampling.c dd if=/dev/zero of=/dev/null count=8388480
> 8388480+0 records in
> 8388480+0 records out
> 4294901760 bytes (4.3 GB) copied, 12.1154 s, 355 MB/s
> [ perf record: Woken up 54 times to write data ]
> [ perf record: Captured and wrote 303.933 MB perf.data (4194347 samples) ]
>
>
> With -F99 added:
>
> # ~/perf record -F99 -a --clang-opt '-DCATCH_ODD' -e ./sampling.c dd
> if=/dev/zero of=/dev/null count=8388480
> 8388480+0 records in
> 8388480+0 records out
> 4294901760 bytes (4.3 GB) copied, 9.60126 s, 447 MB/s
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.402 MB perf.data (35 samples) ]
> # ~/perf record -F99 -a --clang-opt '-DCATCH_EVEN' -e ./sampling.c dd
> if=/dev/zero of=/dev/null count=8388480
> 8388480+0 records in
> 8388480+0 records out
> 4294901760 bytes (4.3 GB) copied, 9.76719 s, 440 MB/s
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.399 MB perf.data (37 samples) ]

That looks like it's doing two different things: -F99, and a
sampling.c script (SEC("func=sys_read")).

I mean just an -F99 that executes a BPF program on each sample. My
most common use for perf is:

perf record -F 99 -a -g -- sleep 30
perf report (or perf script, for making flame graphs)

But this uses perf.data as an intermediate file. With the recent
BPF_MAP_TYPE_STACK_TRACE, we could frequency count stack traces in
kernel context, and just dump a report. Much more efficient. And
improving a very common perf one-liner.

Brendan


Re: [PATCH perf/core v12 00/16] perf-probe --cache and SDT support

2016-06-24 Thread Brendan Gregg
On Fri, Jun 24, 2016 at 2:05 AM, Masami Hiramatsu  wrote:
> Hi,
>
> Here is the 12th version of the patchset for probe-cache and
> initial SDT support.
>
> Here is the previous v11: https://lkml.org/lkml/2016/6/14/1041
>
> In this version I just rename strlist__for_each to
> strlist__for_each_entry, and change some patch description.
>

I tested it on Node.js (and applied to acme's perf/core), and it works. Thanks!

# perf script
node 36750 [006]  3080.761533:
sdt_node:http__server__request: (da8b0c)
node 36750 [006]  3081.368771:
sdt_node:http__server__request: (da8b0c)
node 36750 [006]  3081.592925:
sdt_node:http__server__request: (da8b0c)
node 36750 [006]  3081.976802:
sdt_node:http__server__request: (da8b0c)
node 36750 [006]  3082.296709:
sdt_node:http__server__request: (da8b0c)

Since it doesn't set the is-enabled semaphore yet, I had to do that
manually for the probes that use it. I know that will be a latter
addition.

# perf list | grep sdt
  sdt_node:http__client__request [Tracepoint event]
  sdt_node:http__server__request [Tracepoint event]
  sdt_node:gc__done  [SDT event]
  sdt_node:gc__start [SDT event]
  sdt_node:http__client__request [SDT event]
  sdt_node:http__client__response[SDT event]
  sdt_node:http__server__request [SDT event]
  sdt_node:http__server__response[SDT event]
  sdt_node:net__server__connection   [SDT event]
  sdt_node:net__stream__end  [SDT event]

It's also a bit weird to see these listed twice, but I understand
what's happening. If this continues to prove confusing, I guess later
on we could change it to exclude listing SDT events that have been
promoted to Tracepoint events, or not list SDT events in "perf list"
at all, leaving them for "perf probe --cache --list".

Anyway, this is great as is, and thanks again.

Brendan


> Thank you,
>
> ---
>
> Hemant Kumar (1):
>   perf/sdt: ELF support for SDT
>
> Masami Hiramatsu (15):
>   perf probe: Use cache entry if possible
>   perf probe: Show all cached probes
>   perf probe: Remove caches when --cache is given
>   perf probe: Add group name support
>   perf buildid-cache: Scan and import user SDT events to probe cache
>   perf probe: Accept %sdt and %cached event name
>   perf-list: Show SDT and pre-cached events
>   perf-list: Skip SDTs placed in invalid binaries
>   perf: probe-cache: Add for_each_probe_cache_entry() wrapper
>   perf probe: Allow wildcard for cached events
>   perf probe: Search SDT/cached event from all probe caches
>   perf probe: Support @BUILDID or @FILE suffix for SDT events
>   perf probe: Support a special SDT probe format
>   perf build: Add sdt feature detection
>   perf-test: Add a test case for SDT event
>
>
>  tools/perf/Documentation/perf-buildid-cache.txt |3
>  tools/perf/Documentation/perf-probe.txt |   30 +-
>  tools/perf/Makefile.perf|3
>  tools/perf/builtin-list.c   |6
>  tools/perf/builtin-probe.c  |   31 ++
>  tools/perf/config/Makefile  |   10 +
>  tools/perf/tests/Build  |1
>  tools/perf/tests/builtin-test.c |4
>  tools/perf/tests/make   |3
>  tools/perf/tests/sdt.c  |  115 +++
>  tools/perf/tests/tests.h|1
>  tools/perf/util/build-id.c  |  212 +
>  tools/perf/util/build-id.h  |4
>  tools/perf/util/parse-events.c  |   82 +
>  tools/perf/util/parse-events.h  |2
>  tools/perf/util/probe-event.c   |  366 
> +--
>  tools/perf/util/probe-event.h   |1
>  tools/perf/util/probe-file.c|  226 +-
>  tools/perf/util/probe-file.h|   24 +-
>  tools/perf/util/symbol-elf.c|  252 
>  tools/perf/util/symbol.h|   22 +
>  21 files changed, 1333 insertions(+), 65 deletions(-)
>  create mode 100644 tools/perf/tests/sdt.c
>
> --
> Masami Hiramatsu


Re: [PATCH 10/10] perf script: Add stackcollapse.py script

2016-06-21 Thread Brendan Gregg
On Mon, Jun 20, 2016 at 3:23 PM, Arnaldo Carvalho de Melo
 wrote:
> From: Paolo Bonzini 
>
> Add stackcollapse.py script as an example of parsing call chains, and
> also of using optparse to access command line options.
>
> The flame graph tools include a set of scripts that parse output from
> various tools (including "perf script"), remove the offsets in the
> function and collapse each stack to a single line.  The website also
> says "perf report could have a report style [...] that output folded
> stacks directly, obviating the need for stackcollapse-perf.pl", so here
> it is.
>
> This script is a Python rewrite of stackcollapse-perf.pl, using the perf
> scripting interface to access the perf data directly from Python.
>
> Signed-off-by: Paolo Bonzini 
> Acked-by: Jiri Olsa 
> Cc: Brendan Gregg 
> Link: 
> http://lkml.kernel.org/r/1460467573-22989-1-git-send-email-pbonz...@redhat.com
> Signed-off-by: Arnaldo Carvalho de Melo 
> ---
>  tools/perf/scripts/python/bin/stackcollapse-record |   8 ++
>  tools/perf/scripts/python/bin/stackcollapse-report |   3 +
>  tools/perf/scripts/python/stackcollapse.py | 127 
> +
>  3 files changed, 138 insertions(+)
>  create mode 100755 tools/perf/scripts/python/bin/stackcollapse-record
>  create mode 100755 tools/perf/scripts/python/bin/stackcollapse-report
>  create mode 100755 tools/perf/scripts/python/stackcollapse.py
>
> diff --git a/tools/perf/scripts/python/bin/stackcollapse-record 
> b/tools/perf/scripts/python/bin/stackcollapse-record
> new file mode 100755
> index ..9d8f9f0f3a17
> --- /dev/null
> +++ b/tools/perf/scripts/python/bin/stackcollapse-record
> @@ -0,0 +1,8 @@
> +#!/bin/sh
> +
> +#
> +# stackcollapse.py can cover all type of perf samples including
> +# the tracepoints, so no special record requirements, just record what
> +# you want to analyze.
> +#
> +perf record "$@"
> diff --git a/tools/perf/scripts/python/bin/stackcollapse-report 
> b/tools/perf/scripts/python/bin/stackcollapse-report
> new file mode 100755
> index ..356b9656393d
> --- /dev/null
> +++ b/tools/perf/scripts/python/bin/stackcollapse-report
> @@ -0,0 +1,3 @@
> +#!/bin/sh
> +# description: produce callgraphs in short form for scripting use
> +perf script -s "$PERF_EXEC_PATH"/scripts/python/stackcollapse.py -- "$@"
> diff --git a/tools/perf/scripts/python/stackcollapse.py 
> b/tools/perf/scripts/python/stackcollapse.py
> new file mode 100755
> index ..a2dfcda41ae6
> --- /dev/null
> +++ b/tools/perf/scripts/python/stackcollapse.py
> @@ -0,0 +1,127 @@
> +#!/usr/bin/perl -w

Perl? I guess this line is ignored when invoked.

Rest looks good. Glad it supports tidy_java code, and kernel annotations.

Brendan

> +#
> +# stackcollapse.py - format perf samples with one line per distinct call 
> stack
> +#
> +# This script's output has two space-separated fields.  The first is a 
> semicolon
> +# separated stack including the program name (from the "comm" field) and the
> +# function names from the call stack.  The second is a count:
> +#
> +#  swapper;start_kernel;rest_init;cpu_idle;default_idle;native_safe_halt 2
> +#
> +# The file is sorted according to the first field.
> +#
> +# Input may be created and processed using:
> +#
> +#  perf record -a -g -F 99 sleep 60
> +#  perf script report stackcollapse > out.stacks-folded
> +#
> +# (perf script record stackcollapse works too).
> +#
> +# Written by Paolo Bonzini 
> +# Based on Brendan Gregg's stackcollapse-perf.pl script.
> +
> +import os
> +import sys
> +from collections import defaultdict
> +from optparse import OptionParser, make_option
> +
> +sys.path.append(os.environ['PERF_EXEC_PATH'] + \
> +'/scripts/python/Perf-Trace-Util/lib/Perf/Trace')
> +
> +from perf_trace_context import *
> +from Core import *
> +from EventClass import *
> +
> +# command line parsing
> +
> +option_list = [
> +# formatting options for the bottom entry of the stack
> +make_option("--include-tid", dest="include_tid",
> + action="store_true", default=False,
> + help="include thread id in stack"),
> +make_option("--include-pid", dest="include_pid",
> + action="store_true", default=False,
> + help="include process id in stack"),
> +make_option("--no-comm", dest="include_comm",
> + action="store_false", default=True,
> + help="do not separa

Re: [GIT PULL 00/10] perf/core improvements and fixes

2016-06-20 Thread Brendan Gregg
On Mon, Jun 20, 2016 at 7:18 PM, Namhyung Kim  wrote:
>
> Hi Arnaldo,
>
> On Mon, Jun 20, 2016 at 07:23:20PM -0300, Arnaldo Carvalho de Melo wrote:
> > Hi Ingo,
> >
> >   Please consider pulling,
> >
> > - Arnaldo
> >
> > The following changes since commit 02469a95096a549508c5adf61d84a1d72851c85b:
> >
> >   Merge tag 'perf-core-for-mingo-20160615' of 
> > git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
> > (2016-06-16 10:27:35 +0200)
> >
> > are available in the git repository at:
> >
> >   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> > tags/perf-core-for-mingo-20160620
> >
> > for you to fetch changes up to 2a0a7c72702bac1b87cd4d49196a334483386fab:
> >
> >   perf script: Add stackcollapse.py script (2016-06-20 17:50:39 -0300)
> >
> > 
> > perf/core improvements and fixes:
> >
> > New features:
> >
> > . Add --dry-run option to 'perf record' to check if command line options 
> > can be
> >   parsed, but not doing any recording (Wang Nan)
> >
> > . Allow dumping the object files generated by llvm when processing eBPF
> >   scriptlet events (Wang Nan)
> >
> > - Add stackcollapse.py script to help generating flame graphs (Paolo 
> > Bonzini)
>
> I think this is already done by '-g folded'.  Please see:
>
>   http://www.brendangregg.com/blog/2016-04-30/linux-perf-folded.html
>

Pretty much. Two similar solutions were developed around the same
time. Although I have to use some awk to get "perf -g folded" in the
exact right format, and stackcollapse-perf.py does that directly.

Brendan

Brendan Gregg, Senior Performance Architect, Netflix


Re: BPF runtime for systemtap

2016-06-14 Thread Brendan Gregg
On Tue, Jun 14, 2016 at 1:06 PM, Richard Henderson  wrote:
> I'm pleased to be able to announce an initial implementation of an (e)bpf
> backend for systemtap.  For the subset of systemtap probes that can use
> kprobes, we can use a bpf filter instead of loading a kernel module.
>
> As this implementation is young, there are a number of limitations.  Neither
> string nor stats types are supported.  Both require enhancements to the set
> of builtin functions supported in kernel.  The stap bpf loader still needs
> improvement with respect to its use of the event subsystem.
>
> We're using the same intermediate file format that is supported by the llvm
> bpf backend.  I have some improvements to submit for the llvm bpf backend as
> well.
>
> The code can be reviewed at
>
>   git://sourceware.org/git/systemtap.git rth/bpf

Great! Is there a hello world example in there somewhere? I found this:

# ./stapbpf/stapbpf -h
Usage: ./stapbpf/stapbpf [-v][-w][-V][-h] [-o FILE] 
  -h, --help   Show this help text
  -v, --verboseIncrease verbosity
  -V, --versionShow version
  -w   Suppress warnings
  -o FILE  Send output to FILE

But I didn't see an explicit BPF example or bpf-file. Is it implicit?
Should I be able to run a stap one-liner with some -v's and see it
switches to using BPF, if I restrain myself to what's supported so
far? Eg, since you mentioned kprobes, how about?:

stap -ve 'probe kprobe.function("vfs_fsync") { println(pointer_arg(2)) }'

Brendan


Re: [PATCH] perf/sdt: Directly record cached SDT events

2016-05-02 Thread Brendan Gregg
On Fri, Apr 29, 2016 at 6:40 AM, Hemant Kumar  wrote:
> This patch adds support for directly recording SDT events which are
> present in the probe cache. This patch is based on current SDT
> enablement patchset (v5) by Masami :
> https://lkml.org/lkml/2016/4/27/828
> and it implements two points in the TODO list mentioned in the
> cover note :
> "- (perf record) Support SDT event recording directly"
> "- (perf record) Try to unregister SDT events after record."
>
> Without this patch, we could probe into SDT events using
> "perf probe" and "perf record". With this patch, we can probe
> the SDT events directly using "perf record".
>
> For example :
>
>  # perf list sdt   // List the SDT events
> ...
>   sdt_mysql:update__row__done[SDT event]
>   sdt_mysql:update__row__start   [SDT event]
>   sdt_mysql:update__start[SDT event]
>   sdt_python:function__entry [SDT event]
>   sdt_python:function__return[SDT event]
>   sdt_test:marker1   [SDT event]
>   sdt_test:marker2   [SDT event]
> ...
>
>  # perf record -e %sdt_test:marker1 -e %sdt_test:marker2 -a

Why do we need the '%'? Can't the "sdt_" prefix be sufficient? ie:

# perf record -e sdt_test:marker1 -e sdt_test:marker2 -a

I find it a bit weird to define it using %sdt_, but then use it using
sdt_. I'd also be inclined to use it for probe creation, ie:

# perf probe -x /lib/libc-2.17.so  sdt_libc:lll_lock_wait_private

That way, the user only learns one way to specify the probe, with the
sdt_ prefix. It's fine if % existed too, but optional.

> ^C[ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 2.087 MB perf.data (22 samples) ]
>
>  # perf script
> test_sdt 29230 [002] 405550.548017: sdt_test:marker1: (400534)
> test_sdt 29230 [002] 405550.548064: sdt_test:marker2: (40053f)
> test_sdt 29231 [002] 405550.962806: sdt_test:marker1: (400534)
> test_sdt 29231 [002] 405550.962841: sdt_test:marker2: (40053f)
> test_sdt 29232 [001] 405551.379327: sdt_test:marker1: (400534)
>  ...
>
> After invoking "perf record", behind the scenes, it checks whether the
> event specified is an SDT event using the flag '%'. After that, it
> does a lookup of the probe cache to find out the SDT event. If its not
> present, it throws an error. Otherwise, it goes on and writes the event
> into the uprobe_events file and sets up the probe event, trace events,
> etc and starts recording. It also maintains a list of the event names
> that were written to uprobe_events file. After finishing the record
> session, it removes the events from the uprobe_events file using the
> maintained name list.

Does this support semaphore SDT probes (is-enabled)? Those need the
semaphore incremented when enabled, then decremented when disabled.

Brendan


Re: [PATCH/RFC v3] perf core: Allow setting up max frame stack depth via sysctl

2016-04-26 Thread Brendan Gregg
On Tue, Apr 26, 2016 at 2:05 PM, Arnaldo Carvalho de Melo
 wrote:
> Em Tue, Apr 26, 2016 at 01:02:34PM -0700, Brendan Gregg escreveu:
>> On Mon, Apr 25, 2016 at 5:49 PM, Brendan Gregg  
>> wrote:
>> > On Mon, Apr 25, 2016 at 5:47 PM, Arnaldo Carvalho de Melo 
>> >  wrote:
>> >> Em Mon, Apr 25, 2016 at 05:44:00PM -0700, Alexei Starovoitov escreveu:
>> >>> yep :)
>> >>> hopefully Brendan can give it another spin.
>> >>
>> >> Agreed, and I'm calling it a day anyway, Brendan, please consider
>> >> retesting, thanks,
>> >
>> > Will do, thanks!
>>
>> Looks good.
>>
>> I started with max depth = 512, and even that was still truncated, and
>> had to profile again at 1024 to capture the full stacks. Seems to
>> generally match the flame graph I generated with V1, which made me
>> want to check that I'm running the new patch, and am:
>>
>> # grep six_hundred_forty_kb /proc/kallsyms
>> 81c431e0 d six_hundred_forty_kb
>>
>> I was mucking around and was able to get "corrupted callchain.
>> skipping..." errors, but these look to be expected -- that was
>
> Yeah, thanks for testing!
>
> And since you talked about userspace without frame pointers, have you
> played with '--call-graph lbr'?

Not really. Isn't it only 16 levels deep max?

Most of our Linux is Xen guests (EC2), and I'd have to check if the
MSRs are available for LBR (perf record --call-graph lbr ... returns
"The sys_perf_event_open() syscall returned with 95 (Operation not
supported) for event (cpu-clock).", so I'd guess not, although many
other MSRs are exposed).

BTS seemed more promising (deeper stacks), and there's already Xen
support for it (need to boot the Xen host with vpmu=bts, preferably
vpmu=bts,arch for some PMCs as well :).

Brendan


Re: [PATCH/RFC v3] perf core: Allow setting up max frame stack depth via sysctl

2016-04-26 Thread Brendan Gregg
On Mon, Apr 25, 2016 at 5:49 PM, Brendan Gregg
 wrote:
> On Mon, Apr 25, 2016 at 5:47 PM, Arnaldo Carvalho de Melo
>  wrote:
>> Em Mon, Apr 25, 2016 at 05:44:00PM -0700, Alexei Starovoitov escreveu:
>>> On Mon, Apr 25, 2016 at 09:29:28PM -0300, Arnaldo Carvalho de Melo wrote:
>>> > Em Mon, Apr 25, 2016 at 05:07:26PM -0700, Alexei Starovoitov escreveu:
>>> > > > +   {
>>> > > > +   .procname   = "perf_event_max_stack",
>>> > > > +   .data   = NULL, /* filled in by handler */
>>> > > > +   .maxlen = sizeof(sysctl_perf_event_max_stack),
>>> > > > +   .mode   = 0644,
>>> > > > +   .proc_handler   = perf_event_max_stack_handler,
>>> > > > +   .extra1 = &zero,
>>> > > > +   },
>>> >
>>> > > you need to define a max value otherwise perf_callchain_entry__sizeof
>>> > > will overflow. Sure it's root only facility, but still not nice.
>>> > > 1M? Anything above 1M stack frames would be insane anyway.
>>> > > The rest looks good. Thanks!
>>> >
>>> > Something else? ;-)
>>>
>>> all looks good to me. Thanks a bunch!
>>
>> Thanks for checking!
>>
>>> > Because we only allocate the callchain percpu data structures when 
>>> > there
>>> > is a user, which allows for changing the max easily, its just a matter
>>> > of having no callchain users at that point.
>>> >
>>> > Reported-and-Tested-by: Brendan Gregg 
>>> > Acked-by: Alexei Starovoitov 
>>>
>>> yep :)
>>> hopefully Brendan can give it another spin.
>>
>> Agreed, and I'm calling it a day anyway, Brendan, please consider
>> retesting, thanks,
>>
>
> Will do, thanks!
>

Looks good.

I started with max depth = 512, and even that was still truncated, and
had to profile again at 1024 to capture the full stacks. Seems to
generally match the flame graph I generated with V1, which made me
want to check that I'm running the new patch, and am:

# grep six_hundred_forty_kb /proc/kallsyms
81c431e0 d six_hundred_forty_kb

I was mucking around and was able to get "corrupted callchain.
skipping..." errors, but these look to be expected -- that was
profiling a binary (bash) that doesn't have frame pointers. Some perf
script -D output:

16 3204735442777 0x18f0d8 [0x2030]: PERF_RECORD_SAMPLE(IP, 0x1):
18134/18134: 0x8118b6a4 period: 1001001 addr: 0
... FP chain: nr:1023
.  0: ff80
.  1: 8118b6a4
.  2: 8118bc47
.  3: 811d8c85
.  4: 811b18f8
.  5: 811b2a55
.  6: 811b5ea0
.  7: 810663c0
.  8: 810666e0
.  9: 817b9d28
. 10: fe00
. 11: 004b45e2
. 12: 610f
. 13: 6110
. 14: 6111
. 15: 6112
. 16: 6113
. 17: 6114
. 18: 6115
. 19: 6116
. 20: 6117
[...]
. 1021: 650b
. 1022: 650c
 ... thread: bash:18134
 .. dso: /lib/modules/4.6.0-rc5-virtual/build/vmlinux
bash 18134 [016]  3204.735442:1001001 cpu-clock:

Brendan


Re: [PATCH/RFC v3] perf core: Allow setting up max frame stack depth via sysctl

2016-04-25 Thread Brendan Gregg
On Mon, Apr 25, 2016 at 5:47 PM, Arnaldo Carvalho de Melo
 wrote:
> Em Mon, Apr 25, 2016 at 05:44:00PM -0700, Alexei Starovoitov escreveu:
>> On Mon, Apr 25, 2016 at 09:29:28PM -0300, Arnaldo Carvalho de Melo wrote:
>> > Em Mon, Apr 25, 2016 at 05:07:26PM -0700, Alexei Starovoitov escreveu:
>> > > > +   {
>> > > > +   .procname   = "perf_event_max_stack",
>> > > > +   .data   = NULL, /* filled in by handler */
>> > > > +   .maxlen = sizeof(sysctl_perf_event_max_stack),
>> > > > +   .mode   = 0644,
>> > > > +   .proc_handler   = perf_event_max_stack_handler,
>> > > > +   .extra1 = &zero,
>> > > > +   },
>> >
>> > > you need to define a max value otherwise perf_callchain_entry__sizeof
>> > > will overflow. Sure it's root only facility, but still not nice.
>> > > 1M? Anything above 1M stack frames would be insane anyway.
>> > > The rest looks good. Thanks!
>> >
>> > Something else? ;-)
>>
>> all looks good to me. Thanks a bunch!
>
> Thanks for checking!
>
>> > Because we only allocate the callchain percpu data structures when 
>> > there
>> > is a user, which allows for changing the max easily, its just a matter
>> > of having no callchain users at that point.
>> >
>> > Reported-and-Tested-by: Brendan Gregg 
>> > Acked-by: Alexei Starovoitov 
>>
>> yep :)
>> hopefully Brendan can give it another spin.
>
> Agreed, and I'm calling it a day anyway, Brendan, please consider
> retesting, thanks,
>

Will do, thanks!

Brendan


> - Arnaldo


Re: [RFC] The Linux Scheduler: a Decade of Wasted Cores Report

2016-04-23 Thread Brendan Gregg
On Sat, Apr 23, 2016 at 11:20 AM, Jeff Merkey  wrote:
>
> Interesting read.
>
> http://www.ece.ubc.ca/~sasha/papers/eurosys16-final29.pdf
>
> "... The Linux kernel scheduler has deficiencies that prevent a
> multicore system from making proper use of all cores for heavily
> multithreaded loads, according to a lecture and paper delivered
> earlier this month at the EuroSys '16 conference in London, ..."
>
> Any plans to incorporate these fixes?

While this paper analyzes and proposes fixes for four bugs, it has
been getting a lot of attention for broader claims about Linux being
fundamentally broken:

"As a central part of resource management, the OS thread scheduler
must maintain the following, simple, invariant: make sure that ready
threads are scheduled on available cores. As simple as it may seem, we
found that this invariant is often broken in Linux. Cores may stay
idle for seconds while ready threads are waiting in runqueues."

Then states that the problems in the Linux scheduler that they found
cause degradations of "13-24% for typical Linux workloads".

Their proof of concept patches are online[1]. I tested them and saw 0%
improvements on the systems I tested, for some simple workloads[2]. I
tested 1 and 2 node NUMA, as that is typical for my employer (Netflix,
and our tens of thousands of Linux instances in the AWS/EC2 cloud),
even though I wasn't expecting any difference on 1 node. I've used
synthetic workloads so far.

I should note I do check run queue latency having hit scheduler bugs
in the past (especially on other kernels) and haven't noticed the
issues they describe, on our systems, for various workloads. I've also
written a new tool for this (runqlat using bcc/BPF[3]) to print run
queue latency as a histogram.

The bugs they found seem real, and their analysis is great (although
using visualizations to find and fix scheduler bugs isn't new), and it
would be good to see these fixed. However, it would also be useful to
double check how widespread these issues really are. I suspect many on
this list can test these patches in different environments.

Have we really had a decade of wasted cores, losing 13-24% for typical
Linux workloads? I don't think it's that widespread, but I'm only
testing one environment.

Brendan

[1] https://github.com/jplozi/wastedcores
[2] https://gist.github.com/brendangregg/588b1d29bcb952141d50ccc0e005fcf8
[3] https://github.com/iovisor/bcc/blob/master/tools/runqlat_example.txt


Re: [PATCHSET 0/9] perf report: Support folded callchain output (v5)

2015-11-12 Thread Brendan Gregg
On Sun, Nov 8, 2015 at 9:45 PM, Namhyung Kim  wrote:
> Hello,
>
> This is what Brendan requested on the perf-users mailing list [1] to
> support FlameGraphs [2] more efficiently.  This patchset adds a few
> more callchain options to adjust the output for it.
>
>  * changes in v5)
>- honor field separator from -t option
>- add support for TUI and GTK
>
>  * changes in v4)
>- add missing doc update
>  - cleanup/fix callchain value print code
>- add Acked-by from Brendan and Jiri
>
>  * changes in v3)
>- put the value before callchains
>  - fix compile error
>
>
> At first, 'folded' output mode was added.  The folded output puts the
> value, a space and all calchain nodes separated by semicolons.  Now it
> only supports --stdio as other UI provides some way of folding and/or
> expanding callchains dynamically.
>
> The value is now can be one of 'percent', 'period', or 'count'.  The
> percent is current default output and the period is the raw number of
> sample periods.  The count is the number of samples for each callchain.
>
> The proposed features of hiding hist lines with '-F none' and showing
> hist info with callchains can be added as later work.
>
> Here's an example:
>
>   $ perf report --no-children --show-nr-samples --stdio -g folded,count
>   ...
> 39.93% 80  swapper  [kernel.vmlinux]  [k] intel_idel
>   57 
> intel_idle;cpuidle_enter_state;cpuidle_enter;call_cpuidle;cpu_startup_entry;start_secondary
>   23 
> intel_idle;cpuidle_enter_state;cpuidle_enter;call_cpuidle;cpu_startup_entry;rest_init;...
>

Thanks, I tested it, it works!

It lets me do this:

# ./perf report --no-children -n --stdio -g folded,count -F pid
[...]
  0:swapper
1032 
xen_hypercall_sched_op;default_idle;arch_cpu_idle;default_idle_call;cpu_startup_entry;cpu_bringup_and_idle
134 
xen_hypercall_sched_op;default_idle;arch_cpu_idle;default_idle_call;cpu_startup_entry;rest_init;start_kernel;x86_64_start_reservations;xen_start_kernel
1 
xen_hypercall_xen_version;check_events;__schedule;schedule;schedule_preempt_disabled;cpu_startup_entry;cpu_bringup_and_idle
1 
xen_hypercall_xen_version;check_events;cpu_startup_entry;rest_init;start_kernel;x86_64_start_reservations;xen_start_kernel
   1248:bash
43 
copy_page_range;copy_process;_do_fork;sys_clone;entry_SYSCALL_64_fastpath;__libc_fork;make_child
6 
xen_hypercall_xen_version;check_events;xen_dup_mmap;copy_process;_do_fork;sys_clone;entry_SYSCALL_64_fastpath;__libc_fork;make_child
4 
xen_hypercall_xen_version;check_events;copy_page_range;copy_process;_do_fork;sys_clone;entry_SYSCALL_64_fastpath;__libc_fork;make_child
[...]

This is a parsable call chain summary, and which can be flamegraph'd
after a touch of awk. Later on we can add a "-F none" and "-g pid"
etc, but this patch solves the number one issue of avoiding the
expense of needing to re-aggregate the call chain output (output of
perf script), so I'd be pretty happy to use this instead.

Brendan

>
>   $ perf report --no-children --stdio -g percent
>   ...
> 39.93%  swapper  [kernel.vmlinux]  [k] intel_idel
> |
> ---intel_idle
>cpuidle_enter_state
>cpuidle_enter
>call_cpuidle
>cpu_startup_entry
>|
>|--28.63%-- start_secondary
>|
> --11.30%-- rest_init
>
>
>   $ perf report --no-children --stdio --show-total-period -g period
>   ...
> 39.93%   13018705  swapper  [kernel.vmlinux]  [k] intel_idel
> |
> ---intel_idle
>cpuidle_enter_state
>cpuidle_enter
>call_cpuidle
>cpu_startup_entry
>|
>|--9334403-- start_secondary
>|
> --3684302-- rest_init
>
>
>   $ perf report --no-children --stdio --show-nr-samples -g count
>   ...
> 39.93% 80  swapper  [kernel.vmlinux]  [k] intel_idel
> |
> ---intel_idle
>cpuidle_enter_state
>cpuidle_enter
>call_cpuidle
>cpu_startup_entry
>|
>|--57-- start_secondary
>|
> --23-- rest_init
>
>
> You can get it from 'perf/callchain-fold-v5' branch on my tree:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git
>
> Any comments are welcome, thanks
> Namhyung
>
>
> [1] http://www.spinics.net/lists/linux-perf-users/msg02498.html
> [2] http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
>
>
> Namhyung Kim (9):
>   perf report: Support folded callchain mode on --stdio
>   perf callchain: Abstract callchain print function
>   perf callchain: Add count fields to struct callchain_node
>   perf report: Add callchain value option
>   perf hists browser: Factor out hist_browser__show_callchain_list()
>   perf hists browser: Support flat callchains
>   perf hists browser: Support folded c

Re: [PATCHSET 0/4] perf report: Support folded callchain output (v4)

2015-11-03 Thread Brendan Gregg
On Tue, Nov 3, 2015 at 5:54 PM, Namhyung Kim  wrote:
> Hi Brendan,
>
> On Tue, Nov 03, 2015 at 01:33:43PM -0800, Brendan Gregg wrote:
>> On Tue, Nov 3, 2015 at 6:40 AM, Arnaldo Carvalho de Melo
>>  wrote:
>> > Em Tue, Nov 03, 2015 at 09:52:07PM +0900, Namhyung Kim escreveu:
>> >> Hello,
>> >>
>> >> This is what Brendan requested on the perf-users mailing list [1] to
>> >> support FlameGraphs [2] more efficiently.  This patchset adds a few
>> >> more callchain options to adjust the output for it.
>> >>
>> >>  * changes in v4)
>> >>   - add missing doc update
>> >>   - cleanup/fix callchain value print code
>> >>   - add Acked-by from Brendan and Jiri
>> >
>> > Do those Acked-by stand? Things changed, the values moved from the end
>> > of the line to the start, etc.
>> >
>> [...]
>>
>> I'd Ack this change as it's a useful addition. It doesn't quite
>> address the folded-only output, but it's a step in that direction. I
>> think having the value at the start of a line only makes sense for the
>> perf report output containing the hist summary lines, for consistency.
>
> Right, thanks!
>
>
>>
>> Here's how I'd shuffle the output of this patch (ignore word wrap
>> issues with this email):
>>
>> # ./perf report --stdio -g folded,count,caller -F pid | \
>> awk '/^ / { n = $1 }
>> /^[0-9]/ { split(n,a,":"); print a[2] "-" a[1] ";" $2,$1 }'
>> swapper-0;cpu_bringup_and_idle;cpu_startup_entry;default_idle_call;arch_cpu_idle;default_idle;xen_hypercall_sched_op
>> 809
>> swapper-0;xen_start_kernel;x86_64_start_reservations;start_kernel;rest_init;cpu_startup_entry;default_idle_call;arch_cpu_idle;default_idle;xen_hypercall_sched_op
>> 135
>> dd-30551;__GI___libc_read;entry_SYSCALL_64_fastpath;sys_read;vfs_read;__vfs_read;urandom_read;extract_entropy_user;extract_buf;check_events;xen_hypercall_xen_version
>> 63
>> dd-30551;__GI___libc_read;entry_SYSCALL_64_fastpath;sys_read;vfs_read;__vfs_read;urandom_read;extract_entropy_user;extract_buf
>> 54
>> dd-30551;__GI___libc_read;entry_SYSCALL_64_fastpath;sys_read;vfs_read;__vfs_read;urandom_read;extract_entropy_user;extract_buf;memset_erms
>> 3
>> dd-30551;xen_irq_enable_direct_end;check_events;xen_hypercall_xen_version 3
>>
>> So the output is folded stacks, prefixed by comm-PID. Shuffling the
>> summarized output is a lot better than doing a "perf script" dump and
>> re-processing call chains. (Note that since I'm using -F, I didn't
>> need --no-children;
>
> Nope.  The '-F pid' doesn't affect --children.  It doesn't show the
> children overhead column but we still have hist entries for
> (synthesized) children..
>
>   $ perf report --no-children | wc -l
>   998
>
>   $ perf report --no-children -F pid,dso,sym | wc -l
>   998
>
>   $ perf report --children | wc -l
>   3229
>
>   $ perf report --children -F pid,dso,sym | wc -l
>   3202
>
> So I think you still need to use --no-children (or set report.children
> config variable to false) for your script.

Ok, good to know, thanks.

>
>
>> and with "-g count", I didn't need --show-nr-samples.)
>
> Yes, I used -n/--show-nr-samples just to check the number is correct.
>
>
>>
>> I notice the fields (-F) option already has this precedent:
>>
>> - "comm": prints PID:comm
>> - "pid": prints PID
>
> It's opposite:  "comm" prints comm, "pid" prints PID:comm. :)

Ah, right, sorry, I'd typed those the wrong way around. :)

>
>
>>
>> If these were added to -g, along with a no-hists, then the two types
>> of folded-only output could be generated using:
>>
>> perf report --stdio -g folded,count,comm,no-hists,caller
>> perf report --stdio -g folded,count,pid,no-hists,caller
>
> As I said, using fields like comm, pid requires to have same keys in
> --sort option.  So it's basically unreliable to use those specific
> field names in the -g option IMHO.  I suggested to use 'info' (yes, it
> needs better name) to print all sort keys.
>
>
>>
>> ... although "no-hists" doesn't hit me as intuitive. How about "-F
>> none" to specify zero columns? ie:
>>
>> perf report --stdio -g folded,count,comm,caller -F none
>> perf report --stdio -g folded,count,pid,caller -F none
>
> Ah, makes sense.  So it'd look like
>
>   $ perf report --stdi

Re: [PATCHSET 0/4] perf report: Support folded callchain output (v4)

2015-11-03 Thread Brendan Gregg
On Tue, Nov 3, 2015 at 6:40 AM, Arnaldo Carvalho de Melo
 wrote:
> Em Tue, Nov 03, 2015 at 09:52:07PM +0900, Namhyung Kim escreveu:
>> Hello,
>>
>> This is what Brendan requested on the perf-users mailing list [1] to
>> support FlameGraphs [2] more efficiently.  This patchset adds a few
>> more callchain options to adjust the output for it.
>>
>>  * changes in v4)
>>   - add missing doc update
>>   - cleanup/fix callchain value print code
>>   - add Acked-by from Brendan and Jiri
>
> Do those Acked-by stand? Things changed, the values moved from the end
> of the line to the start, etc.
>
[...]

I'd Ack this change as it's a useful addition. It doesn't quite
address the folded-only output, but it's a step in that direction. I
think having the value at the start of a line only makes sense for the
perf report output containing the hist summary lines, for consistency.

Here's how I'd shuffle the output of this patch (ignore word wrap
issues with this email):

# ./perf report --stdio -g folded,count,caller -F pid | \
awk '/^ / { n = $1 }
/^[0-9]/ { split(n,a,":"); print a[2] "-" a[1] ";" $2,$1 }'
swapper-0;cpu_bringup_and_idle;cpu_startup_entry;default_idle_call;arch_cpu_idle;default_idle;xen_hypercall_sched_op
809
swapper-0;xen_start_kernel;x86_64_start_reservations;start_kernel;rest_init;cpu_startup_entry;default_idle_call;arch_cpu_idle;default_idle;xen_hypercall_sched_op
135
dd-30551;__GI___libc_read;entry_SYSCALL_64_fastpath;sys_read;vfs_read;__vfs_read;urandom_read;extract_entropy_user;extract_buf;check_events;xen_hypercall_xen_version
63
dd-30551;__GI___libc_read;entry_SYSCALL_64_fastpath;sys_read;vfs_read;__vfs_read;urandom_read;extract_entropy_user;extract_buf
54
dd-30551;__GI___libc_read;entry_SYSCALL_64_fastpath;sys_read;vfs_read;__vfs_read;urandom_read;extract_entropy_user;extract_buf;memset_erms
3
dd-30551;xen_irq_enable_direct_end;check_events;xen_hypercall_xen_version 3

So the output is folded stacks, prefixed by comm-PID. Shuffling the
summarized output is a lot better than doing a "perf script" dump and
re-processing call chains. (Note that since I'm using -F, I didn't
need --no-children; and with "-g count", I didn't need
--show-nr-samples.)

I notice the fields (-F) option already has this precedent:

- "comm": prints PID:comm
- "pid": prints PID

If these were added to -g, along with a no-hists, then the two types
of folded-only output could be generated using:

perf report --stdio -g folded,count,comm,no-hists,caller
perf report --stdio -g folded,count,pid,no-hists,caller

... although "no-hists" doesn't hit me as intuitive. How about "-F
none" to specify zero columns? ie:

perf report --stdio -g folded,count,comm,caller -F none
perf report --stdio -g folded,count,pid,caller -F none

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/PATCH 3/3] perf tools: Defaults to 'caller' callchain order only if --children is enabled

2015-11-02 Thread Brendan Gregg
On Thu, Oct 22, 2015 at 7:03 AM, Arnaldo Carvalho de Melo
 wrote:
> Em Thu, Oct 22, 2015 at 02:49:11AM -0700, Brendan Gregg escreveu:
>> On Thu, Oct 22, 2015 at 12:38 AM, Namhyung Kim  wrote:
>> > Hi Ingo,
>> >
>> > On Thu, Oct 22, 2015 at 4:32 PM, Ingo Molnar  wrote:
>> >>
>> >> * Namhyung Kim  wrote:
>> >>
>> >>> The caller callchain order is useful with --children option since it can
>> >>> show 'overview' style output, but other commands which don't use
>> >>>  --children feature like 'perf script' or even 'perf report/top' without
>> >>>  --children are better to keep caller order.
>> >
>> > Oops, there's a typo:  s/caller order/callee order/ :)
>>
>> Thanks, I was wondering about that (and I've made that typo myself).
>> Thanks for the patch!
>
> Brendan, can I take that as an Acked-by?

Belated, yes!

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/PATCH 0/4] perf report: Support folded callchain output (v2)

2015-11-02 Thread Brendan Gregg
On Mon, Nov 2, 2015 at 2:12 PM, Namhyung Kim  wrote:
> Hi Arnaldo,
>
> On Mon, Nov 02, 2015 at 06:30:21PM -0300, Arnaldo Carvalho de Melo wrote:
>> Em Mon, Nov 02, 2015 at 12:37:28PM -0800, Brendan Gregg escreveu:
>> > G'Day Namhyung,
>> >
>> > On Mon, Nov 2, 2015 at 4:57 AM, Namhyung Kim  wrote:
>> > > Hello,
>> > >
>> > > This is what Brendan requested on the perf-users mailing list [1] to
>> > > support FlameGraphs [2] more efficiently.  This patchset adds a few
>> > > more callchain options to adjust the output for it.
>> > >
>> > > At first, 'folded' output mode was added.  The folded output puts all
>> > > calchain nodes in a line separated by semicolons, a space and the
>> > > value.  Now it only supports --stdio as other UI provides some way of
>> > > folding/expanding callchains dynamically.
>> > >
>> > > The value is now can be one of 'percent', 'period', or 'count'.  The
>> > > percent is current default output and the period is the raw number of
>> > > sample periods.  The count is the number of samples for each callchain.
>> > >
>> > > Here's an example:
>> > >
>> > >   $ perf report --no-children --show-nr-samples --stdio -g folded,count
>> > >   ...
>> > > 39.93% 80  swapper  [kernel.vmlinux]  [k] intel_idel
>> > >   
>> > > intel_idle;cpuidle_enter_state;cpuidle_enter;call_cpuidle;cpu_startup_entry;start_secondary
>> > >  57
>> > >   
>> > > intel_idle;cpuidle_enter_state;cpuidle_enter;call_cpuidle;cpu_startup_entry;rest_init;...
>> > >  23
>> >
>> > Thanks!
>> >
>> > So for the folded output I don't need the summary line (the row of
>> > columns printed by hist_entry__snprintf()), and don't need anything
>> > except folded stacks and the counts. If working with the existing
>> > stdio interface is making it harder than it needs to be, might it be
>>
>> I don't think it so, just add some flag asking for that
>> hist_entry__snprintf() to be supressed, ideas for a long option name?
>>
>> Having it as Namhyung did may have value for some people as a more
>> compact way to show the callchains together with the hist_entry line.
>
> Yeah, I'd keep the hist entry line unless it's too hard to
> parse/filter.  IMHO it's just a way to show callchains, so no need to
> have separate output mode..

Ok, good point, it can be thought of as a different stack representation format.

>
> Brendan, I guess you still need to know other info like cpu or pid, no?
>

Yes, I just realized that I either include the process name (Command
column) or name-PID, as the first folded element. Eg, output can be:

mkdir;getopt_long;page_fault;do_page_fault;__do_page_fault;filemap_map_pages 3

Or:

mkdir-21918;getopt_long;page_fault;do_page_fault;__do_page_fault;filemap_map_pages
2

Usually the first, but sometimes it's helpful to split on PID as well.

As for what to call such options (which may be a follow on patch
anyway) ... maybe something like:

"folded": fold stacks as single lines
"nameonly,folded": suppress summary line and include process name in
the folded stack
"pidonly,folded": suppress summary line and include process_name-PID
in the folded stack

> And I feel like it'd be better to put the count before the callchains
> for consistency like below.  Is it OK to you?
>
>   $ perf report --no-children --show-nr-samples --stdio -g folded,count
>   ...
> 39.93% 80  swapper  [kernel.vmlinux]  [k] intel_idel
>   57 
> intel_idle;cpuidle_enter_state;cpuidle_enter;call_cpuidle;cpu_startup_entry;start_secondary
>   23 
> intel_idle;cpuidle_enter_state;cpuidle_enter;call_cpuidle;cpu_startup_entry;rest_init;...
>

If it was printing with the perf report summary, sure, but if we have
a way to only emit folded output, then counts last would be perfect
and maybe a bit more intuitive (key then value).

>
>>
>> With this in mind, do you have any other issues with Namhyung's
>> patchkit? An acked-by/tested-by you would be nice to have, and then we
>> could work out the new option to suppress that hist_entry__snprintf()
>> in a follow up patch.

Acked and tested, yes.

Looks like I'd be using caller ordering, eg, to get lines like this:

__GI___libc_read;entry_SYSCALL_64_fastpath;sys_read;vfs_read;__vfs_read;urandom_read;extract_entropy_user;extract_buf;check_events;xen_hypercall_xen_version
91

Which I can do just by using "-g folded,count,caller".

>>
>> > easier to make it a separate interface (ui/folded), that just emitted
>> > the folded output? Just an idea. This existing patchset is working for
>> > me, I'd just be filtering the output.
>> >
>> > Having the option for percentages and periods is nice. I can envisage
>> > using periods (for latency flame graphs).
>
> Glad to see you like it. :)
>
> Thanks,
> Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/PATCH 0/4] perf report: Support folded callchain output (v2)

2015-11-02 Thread Brendan Gregg
G'Day Namhyung,

On Mon, Nov 2, 2015 at 4:57 AM, Namhyung Kim  wrote:
> Hello,
>
> This is what Brendan requested on the perf-users mailing list [1] to
> support FlameGraphs [2] more efficiently.  This patchset adds a few
> more callchain options to adjust the output for it.
>
> At first, 'folded' output mode was added.  The folded output puts all
> calchain nodes in a line separated by semicolons, a space and the
> value.  Now it only supports --stdio as other UI provides some way of
> folding/expanding callchains dynamically.
>
> The value is now can be one of 'percent', 'period', or 'count'.  The
> percent is current default output and the period is the raw number of
> sample periods.  The count is the number of samples for each callchain.
>
> Here's an example:
>
>   $ perf report --no-children --show-nr-samples --stdio -g folded,count
>   ...
> 39.93% 80  swapper  [kernel.vmlinux]  [k] intel_idel
>   
> intel_idle;cpuidle_enter_state;cpuidle_enter;call_cpuidle;cpu_startup_entry;start_secondary
>  57
>   
> intel_idle;cpuidle_enter_state;cpuidle_enter;call_cpuidle;cpu_startup_entry;rest_init;...
>  23

Thanks!

So for the folded output I don't need the summary line (the row of
columns printed by hist_entry__snprintf()), and don't need anything
except folded stacks and the counts. If working with the existing
stdio interface is making it harder than it needs to be, might it be
easier to make it a separate interface (ui/folded), that just emitted
the folded output? Just an idea. This existing patchset is working for
me, I'd just be filtering the output.

Having the option for percentages and periods is nice. I can envisage
using periods (for latency flame graphs).

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC/PATCH 3/3] perf tools: Defaults to 'caller' callchain order only if --children is enabled

2015-10-22 Thread Brendan Gregg
On Thu, Oct 22, 2015 at 12:38 AM, Namhyung Kim  wrote:
> Hi Ingo,
>
> On Thu, Oct 22, 2015 at 4:32 PM, Ingo Molnar  wrote:
>>
>> * Namhyung Kim  wrote:
>>
>>> The caller callchain order is useful with --children option since it can
>>> show 'overview' style output, but other commands which don't use
>>>  --children feature like 'perf script' or even 'perf report/top' without
>>>  --children are better to keep caller order.
>
> Oops, there's a typo:  s/caller order/callee order/ :)

Thanks, I was wondering about that (and I've made that typo myself).
Thanks for the patch!

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 13/16] perf callchain: Switch default to 'graph,0.5,caller'

2015-10-21 Thread Brendan Gregg
On Wed, Oct 21, 2015 at 12:23 PM, Arnaldo Carvalho de Melo
 wrote:
> Em Wed, Oct 21, 2015 at 11:28:54AM -0700, Brendan Gregg escreveu:
>> On Tue, Oct 20, 2015 at 7:21 PM, Chandler Carruth  
>> wrote:
>> > Since Arnaldo asked, I thought I should actually try to respond 
>> > specifically
>> > to the question of why I favor the 'caller' view as the default.
>> >
>> > On Tue, Oct 20, 2015 at 3:06 AM Arnaldo Carvalho de Melo
>> >  wrote:
>> >>
>> >> > IMHO changing that order is not a good idea. Unless many users
>> >> > complained
>> >> > about it.
>> >>
>> >> Perhaps there are not that many users of callchains because the default
>> >> is not what they're used to see?
>> >>
>> >> Motivation for the change came from a video from Chandler, that
>> >> resurfaced the callchain default issue, Chandler?
>> >
>> >
>> > So, first and foremost, thanks for fixing some of my gripes about the
>> > usability of the perf tool, I'm super excited about the changes you're
>> > making, even if this one isn't among them.
>> >
>> > I think the default of caller vs. callee is probably the hardest judgement
>> > call to make about the right defaults. I can see it going both ways.
>> >
>> > When profiling my *system*, or a diverse group of programs or tasks, I 
>> > often
>> > find callee useful. Were I a kernel developer, I suspect callee would be
>> > *dramatically* more common than caller.
>> >
>> > For me, what makes the caller view much more frequently desired is that I'm
>> > usually profiling a fairly isolated application, or benchmark for an
>> > isolated library. While I always start off with some more system-level
>> > performance problem, I rarely need a detailed profile to get a reasonable
>> > idea of what subsystem to stare at, and then I spend days looking at a
>> > relatively isolated reproduction.
>> >
>> > Anyways, for profiling user-land applications, I suspect from my
>> > conversations with users that "caller" is the more common expectation.
>> [...]
>>
>> I would have said callee is the more common expectation, certainly for
>> system profilers. I quickly checked various tools to see what their
>> defaults are:
>>
>> callee:
>>
>> gdb
>> lldb
>> pstack
>> jstack
>> perf
>> stap
>> ktap
>> dtrace
>> kernel oops message
>> jvm crash
>> node.js/v8 crash
>>
>> caller:
>>
>> python traceback
>
> sysprof -> http://sysprof.com/screen-shot-4.png

Right, GUIs with tree views do this.

I've thought about it more, and how does the following 3 changes sound?:

1. perf top: change to caller default

- Since it's dumping a tree view, this ordering is intuitive, and this
is the convention from other interactive GUIs that do tree views.
Optionally add a hotkey to change it live. Could also add -G from perf
report (--inverted) to easily call the old behavior.

2. perf report: change to caller default

- I'd like to leave it callee, but as Ingo said, "top" and "report"
should be unified. I can always run -G, or -g, or edit ~/.perfconfig.

3. perf script: leave callee default.

- Since it's dumping a full stack, that's the convention with other
tools that dump full stacks (as I listed), and I think that's useful
ordering anyway when reading a full stack: show me event context
first, followed by history next. "perf script" should have -G and -g
added, to adjust behavior.

I'd also like different UIs from perf report (folded and json), but
that's a different thread. :)

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 13/16] perf callchain: Switch default to 'graph,0.5,caller'

2015-10-21 Thread Brendan Gregg
On Wed, Oct 21, 2015 at 1:06 AM, Ingo Molnar  wrote:
>
> * Arnaldo Carvalho de Melo  wrote:
>
>> Indeed, finding a default that is deemed adequate for most people is, ho-hum,
>> difficult 8-)
>>
>> Ingo, what do you think?
>
> So I think the new default is a lot more intuitive, because it starts with a 
> high
> level top-down 'overview' kind of output, allowing one to drill down to the
> details.
>

Yes, it may be more intuitive.

> That's what most other tools do as well by default, and people migrating to 
> perf
> found it pretty confusing that we do a bottom-up kind of callgraph.
> [...]

But I don't think it's commonplace. Which other tools do this? Java
Flight Recorder? GUIs? Certainly not text-based system profilers or
debuggers (I made a list in another email).

Although, these are usually printing entire stacks. So the most alien
change would be to flip the "perf script" order (please don't do
this). You could argue that the perf report/top tree view is not
commonplace from other text-based profilers or debuggers, so changing
it won't be as confusing. In fact, it's like a sideways flame graph
(except it doesn't fit on the screen!).

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 13/16] perf callchain: Switch default to 'graph,0.5,caller'

2015-10-21 Thread Brendan Gregg
On Tue, Oct 20, 2015 at 7:21 PM, Chandler Carruth  wrote:
> Since Arnaldo asked, I thought I should actually try to respond specifically
> to the question of why I favor the 'caller' view as the default.
>
> On Tue, Oct 20, 2015 at 3:06 AM Arnaldo Carvalho de Melo
>  wrote:
>>
>> > IMHO changing that order is not a good idea. Unless many users
>> > complained
>> > about it.
>>
>> Perhaps there are not that many users of callchains because the default
>> is not what they're used to see?
>>
>> Motivation for the change came from a video from Chandler, that
>> resurfaced the callchain default issue, Chandler?
>
>
> So, first and foremost, thanks for fixing some of my gripes about the
> usability of the perf tool, I'm super excited about the changes you're
> making, even if this one isn't among them.
>
> I think the default of caller vs. callee is probably the hardest judgement
> call to make about the right defaults. I can see it going both ways.
>
> When profiling my *system*, or a diverse group of programs or tasks, I often
> find callee useful. Were I a kernel developer, I suspect callee would be
> *dramatically* more common than caller.
>
> For me, what makes the caller view much more frequently desired is that I'm
> usually profiling a fairly isolated application, or benchmark for an
> isolated library. While I always start off with some more system-level
> performance problem, I rarely need a detailed profile to get a reasonable
> idea of what subsystem to stare at, and then I spend days looking at a
> relatively isolated reproduction.
>
> Anyways, for profiling user-land applications, I suspect from my
> conversations with users that "caller" is the more common expectation.
[...]

I would have said callee is the more common expectation, certainly for
system profilers. I quickly checked various tools to see what their
defaults are:

callee:

gdb
lldb
pstack
jstack
perf
stap
ktap
dtrace
kernel oops message
jvm crash
node.js/v8 crash

caller:

python traceback


Python was the only one I knew off-hand that is caller by default (and
it includes the text "most recent call last", suggesting the
developers thought it was necessary to point out that it was
different). Do you know what tools/profilers/debuggers these users are
using?

I really think people will be surprised if by default perf prints
stacks differently to gdb, lldb, oops messages, etc, etc. That may be
true for a specific developer community (eg, Python), but not for
system profilers.

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 13/16] perf callchain: Switch default to 'graph,0.5,caller'

2015-10-19 Thread Brendan Gregg
On Fri, Oct 9, 2015 at 3:25 PM, Arnaldo Carvalho de Melo
 wrote:
> Em Fri, Oct 09, 2015 at 03:10:29PM -0700, Brendan Gregg escreveu:
>> On Fri, Oct 9, 2015 at 2:56 PM, Arnaldo Carvalho de Melo
>>  wrote:
>> >
>> > Em Fri, Oct 09, 2015 at 01:34:33PM -0700, Brendan Gregg escreveu:
>> > > On Mon, Oct 5, 2015 at 2:03 PM, Arnaldo Carvalho de Melo
>> > >  wrote:
>> > > >
>> > > > From: Arnaldo Carvalho de Melo 
>> > > >
>> > > > Which is the most common default found in other similar tools.
>> > >
>> > > Interactive tools, sure, like the perf report TUI.
>> >
>> > > But this also changes the ordering of the non-interactive tools which
>> > > dump stacks: "perf report -n --stdio" and "perf script". The most
>> > > common default for dumping stacks is caller. Eg:
>> >
>> > And you use that for scripting?
>>
>> Yes; how I typically CPU profile:
>>
>> git clone https://github.com/brendangregg/FlameGraph
>> cd FlameGraph
>> perf record -F 99 -a -g -- sleep 60
>> perf script | ./stackcollapse-perf.pl | /flamegraph.pl > flame.svg
>>
>> Then open flame.svg in a browser and click around. Try it. :)
>
> Should've already, guess there are things to merge and save you some
> time :)
>
>> But it's not just scripting; We often email around "perf report -n
>> --stdio" output, or attach it to tickets, when working on an issue.
>> Easier than trying to grab the right TUI screenshot.
>
> Well, that is something we need to change... But then, are you aware of
> 'P' in the TUI? It allows you to go, expand whatever callchains you deem
> necessary and then press P and the current state of the UI will be
> dumped to a perf.hist. file. I.e. no need to get a png screenshot nor
> to copy things with gpm like stuff.
> [...]

I didn't know about 'P', thanks. I still find it quicker to "perf
report -n --stdio", then browse with my eyes and the scroll wheel.
(Then use a flame graph if the output isn't trivial to understand.)

... Just to express a callee order preference differently (at least
for the text dumping outputs), it's showing what's running (first few
frames) followed by how we got there. And that's the order I'm usually
wanting to know them... I still think it's the order printed by most
other text-based tools, although GUIs (like those with tree views) are
probably more usually caller order.

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 13/16] perf callchain: Switch default to 'graph,0.5,caller'

2015-10-19 Thread Brendan Gregg
On Mon, Oct 12, 2015 at 9:26 PM, Namhyung Kim  wrote:
> On Mon, Oct 05, 2015 at 06:03:35PM -0300, Arnaldo Carvalho de Melo wrote:
>> From: Arnaldo Carvalho de Melo 
>>
>> Which is the most common default found in other similar tools.
>
> I think it's more useful to change the default only when --children is
> used.  And there's a related issue too - please see below link:
>
>   https://lkml.org/lkml/2014/8/14/49

--children is default for things like "perf report --stdio" and
(implied) "perf script", so any change to --children is a change to
default.

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 13/16] perf callchain: Switch default to 'graph,0.5,caller'

2015-10-10 Thread Brendan Gregg
On Sat, Oct 10, 2015 at 12:09 AM, Ingo Molnar  wrote:
>
> * Brendan Gregg  wrote:
>
>> On Fri, Oct 9, 2015 at 2:56 PM, Arnaldo Carvalho de Melo
>>  wrote:
>> >
>> > Em Fri, Oct 09, 2015 at 01:34:33PM -0700, Brendan Gregg escreveu:
>> > > On Mon, Oct 5, 2015 at 2:03 PM, Arnaldo Carvalho de Melo
>> > >  wrote:
>> > > >
>> > > > From: Arnaldo Carvalho de Melo 
>> > > >
>> > > > Which is the most common default found in other similar tools.
>> > >
>> > > Interactive tools, sure, like the perf report TUI.
>> >
>> > > But this also changes the ordering of the non-interactive tools which
>> > > dump stacks: "perf report -n --stdio" and "perf script". The most
>> > > common default for dumping stacks is caller. Eg:
>> >
>> > And you use that for scripting?
>>
>> Yes; how I typically CPU profile:
>>
>> git clone https://github.com/brendangregg/FlameGraph
>> cd FlameGraph
>> perf record -F 99 -a -g -- sleep 60
>> perf script | ./stackcollapse-perf.pl | /flamegraph.pl > flame.svg
>>
>> Then open flame.svg in a browser and click around. Try it. :)
>
> So I tried it:
>
>  triton:~/s/FlameGraph> ls -l flame.svg
>  -rw-rw-r-- 1 mingo mingo 1022870 Oct 10 09:06 flame.svg
>
> but when I tried to view it via ImageMagick, it first showed an empty screen:
>
>  triton:~/s/FlameGraph> display flame.svg
>
> then when exiting it said:
>
>  display: non-conforming drawing primitive definition `text-anchor' @ 
> error/draw.c/DrawImage/3182.
>
> although I guess that latter is an unrealted ImageMagick problem, not caused 
> by
> the SVG.
>

Sorry, I should have said to open it in a browser (Chrome/Firefox);
the SVG contains some (entirely embedded) JavaScript to provide
interactivity (click to zoom, Ctrl-F to search).

I think it used to work in ImageMagick (without JavaScript
interactivity), but I haven't tried in a long time. I'll have to check it again.

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 13/16] perf callchain: Switch default to 'graph,0.5,caller'

2015-10-09 Thread Brendan Gregg
On Fri, Oct 9, 2015 at 2:56 PM, Arnaldo Carvalho de Melo
 wrote:
>
> Em Fri, Oct 09, 2015 at 01:34:33PM -0700, Brendan Gregg escreveu:
> > On Mon, Oct 5, 2015 at 2:03 PM, Arnaldo Carvalho de Melo
> >  wrote:
> > >
> > > From: Arnaldo Carvalho de Melo 
> > >
> > > Which is the most common default found in other similar tools.
> >
> > Interactive tools, sure, like the perf report TUI.
>
> > But this also changes the ordering of the non-interactive tools which
> > dump stacks: "perf report -n --stdio" and "perf script". The most
> > common default for dumping stacks is caller. Eg:
>
> And you use that for scripting?

Yes; how I typically CPU profile:

git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
perf record -F 99 -a -g -- sleep 60
perf script | ./stackcollapse-perf.pl | /flamegraph.pl > flame.svg

Then open flame.svg in a browser and click around. Try it. :)

But it's not just scripting; We often email around "perf report -n
--stdio" output, or attach it to tickets, when working on an issue.
Easier than trying to grab the right TUI screenshot.

>
> > # perf report -n --stdio
> > [...]
> > 16.87%334   iperf  [kernel.kallsyms] [k] 
> > copy_user_enhanced_fast_string
> >  |
> >  --- 0x7f0683ba1ccd
> >  system_call_fastpath
> >  sys_write
> >  vfs_write
> >  do_sync_write
> >  sock_aio_write
> >  do_sock_write.isra.10
> >  inet_sendmsg
> >  copy_user_enhanced_fast_string
> > [...]
> >
> > That's upside down. The current default preserves ordering from the
> > informational line onwards:
> >
> > # perf report -n --stdio -g fractal,0.5,callee
> > [...]
> > 16.87%334   iperf  [kernel.kallsyms] [k] 
> > copy_user_enhanced_fast_string
> >  |
> >  --- copy_user_enhanced_fast_string
> > |
> > |--64.37%-- inet_sendmsg
> > |  do_sock_write.isra.10
> > |  sock_aio_write
> > |  do_sync_write
> > |  vfs_write
> > |  sys_write
> > |  system_call_fastpath
> > |  0x7f0683ba1ccd
> >
> > ... Those are just short examples. Another profile I'm working on now
> > gets really messy on "perf report -n --stdio"; eg:
> >
> > perf report -n --stdio -g graph,0.5,caller
> > 94.80% 0.10% 2  iperf [kernel.vmlinux][k]
> > entry_SYSCALL_64_fastpath
> >|
> >|--94.70%-- entry_SYSCALL_64_fastpath
> >|  |
>
> >
> > The current default never gets beyond 5 levels deep. The new default
> > goes to 25 levels. At least with perf report I can override the
> > default using "-g". perf script doesn't support that.
>
> Ok, so changing defaults is not nice, but in this case looked sensible,
> ends up not being for you...

I'm pretty sure this would surprise anyone looking at dumped stacks,
where the convention is caller. pstack, jstack, gdb, systemtap,
dtrace, oops message, etc. I get that we want this for the TUI, but
not dumped stacks.

>
> > Can this patch please preserve the callee ordering for non-interactive
> > output? (perf script, perf report -n --stdio). Thanks,
>
> If this is because you do scripting on it? Wouldn't it be better to not
> depend on defaults, always specify what you want and then the bug would
> be constrained to 'perf script' where we need to provide a way to change
> the default?

Actually, for my flame graphs we should really have perf report have a
--folded output to emit folded output (I emailed perf-users), callee.

For scripting we can always specify -g.

I'm thinking of others who use perf report/script at the CLI, and
expect callee output.

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v7 0/4] perf: add support for profiling jitted code

2015-10-09 Thread Brendan Gregg
On Sun, Oct 4, 2015 at 1:05 PM, Stephane Eranian  wrote:
>
> Brendan,
>
> On Thu, Oct 1, 2015 at 3:45 PM, Brendan Gregg  
> wrote:
> > G'Day,
> >
> > On Wed, Sep 30, 2015 at 11:45 PM, Stephane Eranian  
> > wrote:
> >>
> >> This patch series extends perf record/report/annotate to enable
> >> profiling of jitted (just-in-time compiled) code. The current
> >> perf tool provides very limited support for profiling jitted
> >> code for some runtime environments. But the support is experimental
> >> and cannot be used in complex environments. It relies on files
> >> in /tmp, for instance. It does not support annotate mode or
> >> rejitted code.
> >>
> >> This patch series adds a better way of profiling jitted code
> >> with the following advantages:
> >>- support any jitted code environment (some with modifications)
> >>- support Java runtime with JVMTI interface with no modifications
> >>- provides a portable JVMTI agent library
> >>- known to support V8 runtime
> >>- known to support DART runtime
> >>- supports code rejitting and code movements
> >>- no files in /tmp
> >>- meta-data file is unique to each run
> >>- no changes to perf report/annotate
> >>- support per-thread and system-wide profiling
> >>- support monitoring of multiple simultaneous Jit runtimes
> >>- source level view in perf annotate
> >>
> >> The support is based on cooperation with the runtime. For Java runtimes,
> >> supporting the JVMTI interface, there is no change necessary. For other
> >> runtimes, modifications are necessary to emit the meta-data to support
> >> symbolization, annotation, source lines correlation of the samples.
> >> Those modifications are relatively straighforward, some have been
> >> implemented in V8 and DART.
> >>
> >> The jit environment emits a binary dump file which contains the jitted
> >> code (in raw format) and meta-data describing the mapping of functions.
> >> The binary format is documented in the jitdump.h header file. It is
> >> adapted from the OProfile jitdump format.
> >
> > While this is impressive work, I don't think I'd use it very much, and
> > I wouldn't encourage others too either. Is it right that this approach
> > needs to be turned on from runtime start, and will constantly emit
> > timestamped JIT records? I'd use that as a backup for existing
> > techniques for perf_events and jitted runtimes.
> >
> This boils down that when does the JIT runtime emit the jitdump data?
> In the case of openJDK and given how I wrote the agent, it needs to
> run from start to finish.  In order for perf to have full visibility, it needs
> to get info about all the code that has been jitted so far. This could be
> done after start if there was somehow a protocol to handle this in the
> runtime, like a signal. It would just need to dump the current status, 
> including
> all the code. The rest of the  support would work. In other words, if there 
> was
> a way to signal to the runtime that it is being monitored and that it needs to
> dump its state, then everything would work. To avoid generating more dumps,
> the runtime would also have to be informed that monitoring has stopped.


Ok, so it's possible that an on-demand approach could work, if the
agent can be modified. And since perf-map-agent made this transition
(https://github.com/jrudolph/perf-map-agent), examples of how to do
this are probably in its git history. :) The outcome could well be
something like:

perf record -F 99 -a -g -- perf_java_agent --sleep 10

Where a perf_java_agent tool attached to all Javas, dumped initial
symbols, and then logged timestamped symbols during the run (--sleep
10), then detached at the end. That would be ideal. Which just means
changing the Java agent, not the perf implementation.

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 13/16] perf callchain: Switch default to 'graph,0.5,caller'

2015-10-09 Thread Brendan Gregg
On Mon, Oct 5, 2015 at 2:03 PM, Arnaldo Carvalho de Melo
 wrote:
>
> From: Arnaldo Carvalho de Melo 
>
> Which is the most common default found in other similar tools.


Interactive tools, sure, like the perf report TUI.

But this also changes the ordering of the non-interactive tools which
dump stacks: "perf report -n --stdio" and "perf script". The most
common default for dumping stacks is caller. Eg:

# perf report -n --stdio
[...]
16.87%334   iperf  [kernel.kallsyms] [k]
copy_user_enhanced_fast_string
 |
 --- 0x7f0683ba1ccd
 system_call_fastpath
 sys_write
 vfs_write
 do_sync_write
 sock_aio_write
 do_sock_write.isra.10
 inet_sendmsg
 copy_user_enhanced_fast_string
[...]

That's upside down. The current default preserves ordering from the
informational line onwards:

# perf report -n --stdio -g fractal,0.5,callee
[...]
16.87%334   iperf  [kernel.kallsyms] [k]
copy_user_enhanced_fast_string
 |
 --- copy_user_enhanced_fast_string
|
|--64.37%-- inet_sendmsg
|  do_sock_write.isra.10
|  sock_aio_write
|  do_sync_write
|  vfs_write
|  sys_write
|  system_call_fastpath
|  0x7f0683ba1ccd

... Those are just short examples. Another profile I'm working on now
gets really messy on "perf report -n --stdio"; eg:

perf report -n --stdio -g graph,0.5,caller
94.80% 0.10% 2  iperf [kernel.vmlinux][k]
entry_SYSCALL_64_fastpath
   |
   |--94.70%-- entry_SYSCALL_64_fastpath
   |  |
   |  |--55.76%-- sys_write
   |  |  |
   |  |  |--55.10%-- vfs_write
   |  |  |  |
   |  |  |  |--54.39%-- __vfs_write
   |  |  |  |  |
   |  |  |  |   --54.34%--
sock_write_iter
   |  |  |  | |
   |  |  |  |
|--54.19%-- sock_sendmsg
   |  |  |  |
|  |
   |  |  |  |
|  |--53.99%-- inet_sendmsg
   |  |  |  |
|  |  |
   |  |  |  |
|  |  |--53.08%-- tcp_sendmsg
   |  |  |  |
|  |  |  |
   |  |  |  |
|  |  |  |--17.27%--
copy_user_enhanced_fast_string
   |  |  |  |
|  |  |  |
   |  |  |  |
|  |  |  |--11.41%-- tcp_push_one
   |  |  |  |
|  |  |  |  |
   |  |  |  |
|  |  |  |  |--11.31%-- tcp_write_xmit
   |  |  |  |
|  |  |  |  |  |
   |  |  |  |
|  |  |  |  |  |--10.00%--
tcp_transmit_skb
   |  |  |  |
|  |  |  |  |  |  |
   |  |  |  |
|  |  |  |  |  |
|--9.29%-- ip_queue_xmit
   |  |  |  |
|  |  |  |  |  |  |
  |
   |  |  |  |
|  |  |  |  |  |  |
  |--8.69%-- ip_local_out_sk
   |  |  |  |
|  |  |  |  |  |  |
  |  |
   |  |  |  |
|  |  |  |  |  |  |
  |  |--8.64%-- ip_output
   |  |  |  |
|  |  |  |  |  |  |
  |  |  ip_finish_output
   |  |  |  |
|  |  |  |  |  |  |
  |  |  |
   |  |  |  |
|  | 

Re: [PATCH v7 0/4] perf: add support for profiling jitted code

2015-10-01 Thread Brendan Gregg
G'Day,

On Wed, Sep 30, 2015 at 11:45 PM, Stephane Eranian  wrote:
>
> This patch series extends perf record/report/annotate to enable
> profiling of jitted (just-in-time compiled) code. The current
> perf tool provides very limited support for profiling jitted
> code for some runtime environments. But the support is experimental
> and cannot be used in complex environments. It relies on files
> in /tmp, for instance. It does not support annotate mode or
> rejitted code.
>
> This patch series adds a better way of profiling jitted code
> with the following advantages:
>- support any jitted code environment (some with modifications)
>- support Java runtime with JVMTI interface with no modifications
>- provides a portable JVMTI agent library
>- known to support V8 runtime
>- known to support DART runtime
>- supports code rejitting and code movements
>- no files in /tmp
>- meta-data file is unique to each run
>- no changes to perf report/annotate
>- support per-thread and system-wide profiling
>- support monitoring of multiple simultaneous Jit runtimes
>- source level view in perf annotate
>
> The support is based on cooperation with the runtime. For Java runtimes,
> supporting the JVMTI interface, there is no change necessary. For other
> runtimes, modifications are necessary to emit the meta-data to support
> symbolization, annotation, source lines correlation of the samples.
> Those modifications are relatively straighforward, some have been
> implemented in V8 and DART.
>
> The jit environment emits a binary dump file which contains the jitted
> code (in raw format) and meta-data describing the mapping of functions.
> The binary format is documented in the jitdump.h header file. It is
> adapted from the OProfile jitdump format.

While this is impressive work, I don't think I'd use it very much, and
I wouldn't encourage others too either. Is it right that this approach
needs to be turned on from runtime start, and will constantly emit
timestamped JIT records? I'd use that as a backup for existing
techniques for perf_events and jitted runtimes.

Right now we (Netflix) can profile Java in production using
perf_events, using an on-demand symbol dump: the perf-map-agent
JVMTI[1]. This agent used to be loaded on startup, like this patchset,
which cost overhead: CPU, filesystem I/O, storage. The problem was
that we didn't know which of the tens of thousands of instances we'd
want to profile, so loading an always-on JIT symbol dumper on all
instances would cost resources. With on-demand, we only dump symbols
on the few instances needed. On-demand has caveats, of course, and
symbols can change during the profile. But so far that's not been a
problem.

So this seems like a lot of changes for what won't be our primary way
of using perf_events on Java or Node.js, which we currently can
profile with perf_events effectively. So long as we're aware of that.

Brendan

[1] http://techblog.netflix.com/2015/07/java-in-flames.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v9 00/22] tracing: 'hist' triggers

2015-07-22 Thread Brendan Gregg
G'Day Tom,

On Thu, Jul 16, 2015 at 10:22 AM, Tom Zanussi
 wrote:
>
> This is v9 of the 'hist triggers' patchset.
>
[...]

I've browsed the functionality (sorry, catching up), and it looks like
this will solve a number of common problems. But it seems
tantalizingly close to solving a few more. These may already be on
your future todo list.

A) CPU stack profiling

Kernel stacktrace as a key will be hugely useful; is there a way to
enable this for a sampling profile? (eg, what perf record -F 99 does).
I take CPU profiles daily, and would prefer to aggregate stacks
in-kernel. Also, would like user stacktrace as a key (even if it's
just the hex).

B) Key buckets

Eg, imagine:

echo 'hist:keys=common_pid.execname,count.log2:val=count' >
/sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger

to get a log2 bucketized histogram of syscall read request size. Same
for any value where using the value as a key gets too verbose, and you
just want a rough look at the distribution. (Would make it more
readable if it could also be sorted by the log2 value.)

C) Latency as a bucket key

With kprobes, we could then have a log2 histogram of any function call
latency, collected efficiently. (There's already the function timers
in ftrace, which I'm using from function_graph with filters sets to
only match the target function.)

... Those are the other common use cases, that the hist functionality
seemed suited for. Beyond that gets more custom, and we can use eBPF.

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: [RFC PATCH perf/core v2 00/16] perf-probe --cache and SDT support

2015-07-19 Thread Brendan Gregg
G'Day Masami-san, Namhyung,

I'm really looking forward to this feature -- very useful, thanks!...

On Sat, Jul 18, 2015 at 9:24 PM, Namhyung Kim  wrote:
> Hi Masami,
>
> On Fri, Jul 17, 2015 at 12:21:42PM +0900, Masami Hiramatsu wrote:
>> Now I'm thinking that we should avoid using %event syntax for perf-list
>> and perf-record to avoid confusion. For example, suppose that we have
>> "libfoo:bar" SDT event, when we just scanned the libfoo binary and
>> use it via perf-record, we'll run perf record -e "%libfoo:bar".
>> However, after we set the probe via perf-probe, we have to run
>> perf record -e "libfoo:bar". That difference looks no good.
>> So, I think in both case it should accept -e "libfoo:bar" syntax.
>
> I don't remember how the SDT events should be shown to users.  Sorry
> if I'm missing something here.
>
> AFAIK an SDT event consists of a provider and an event name.  So it
> can be simply 'provider:event' like tracepoints or
> 'binary:provider_event' like uprobes.
>
> I like the former because it's simpler but it needs to guarantee that
> it doesn't clash with existing tracepoints/[ku]probes.  So IIUC we
> chose the '%' sign to distinguish them.  But after setting a probe at
> it, the group name should be the binary name.  So the whole event name
> might be changed, and this is not good.

I don't think we should worry about the clash, as the provider name
should differentiate. So I think "libfoo:bar" with perf record is
better. After adding them to the cache (via % if needed), I'd think
they would be best looking like tracepoints. Eg, listing them together
they can be differentiated, something like:

# perf list
[...]
  block:block_rq_abort   [Tracepoint event]
  block:block_rq_requeue [Tracepoint event]
  block:block_rq_complete[Tracepoint event]
[...]
  libc:memory_heap_new[User tracepoint event]
  libc:memory_heap_free   [User tracepoint event]
  libc:memory_heap_more   [User tracepoint event]
[...]

Then used the same.

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 00/22] perf tools: introduce 'perf bpf' command to load eBPF programs.

2015-05-05 Thread Brendan Gregg
On Thu, Apr 30, 2015 at 3:52 AM, Wang Nan  wrote:
[...]
> An example is pasted at the bottom of this cover letter. In that
> example, mybpfprog is configured by string in config section, and will
> be probed at __alloc_pages_nodemask. sample_bpf.o is generated using:
>
>  $ $CLANG -I/usr/src/kernel/include -I/usr/src/kernel/usr/include 
> -D__KERNEL__ \
>  -Wno-unused-value -Wno-pointer-sign \
>  -O2 -emit-llvm -c sample_bpf.c -o -| $LLC -march=bpf -filetype=obj 
> -o \
>  sample_bpf.o
>
> And can be loaded using:
>
>  $ perf bpf sample_bpf.o
[...]
>   EXAMPL 
>  - sample_bpf.c -
>  #include 
>  #include 
>  #include 
>
>  #define SEC(NAME) __attribute__((section(NAME), used))
>
>  static int (*bpf_map_delete_elem)(void *map, void *key) =
> (void *) BPF_FUNC_map_delete_elem;
>  static int (*bpf_trace_printk)(const char *fmt, int fmt_size, ...) =
> (void *) BPF_FUNC_trace_printk;
>
>  struct bpf_map_def {
> unsigned int type;
> unsigned int key_size;
> unsigned int value_size;
> unsigned int max_entries;
>  };
>
>  struct pair {
> u64 val;
> u64 ip;
>  };
>
>  struct bpf_map_def SEC("maps") my_map = {
> .type = BPF_MAP_TYPE_HASH,
> .key_size = sizeof(long),
> .value_size = sizeof(struct pair),
> .max_entries = 100,
>  };
>
>  SEC("kprobe/kmem_cache_free")
>  int bpf_prog1(struct pt_regs *ctx)
>  {
> long ptr = ctx->r14;
> bpf_map_delete_elem(&my_map, &ptr);
> return 0;
>  }
>
>  SEC("mybpfprog")
>  int bpf_prog_my(void *ctx)
>  {
> char fmt[] = "Haha\n";
> bpf_trace_printk(fmt, sizeof(fmt));
> return 0;
>  }
>
>  char _license[] SEC("license") = "GPL";
>  u32 _version SEC("version") = LINUX_VERSION_CODE;
>  char _config[] SEC("config") = ""
>  "mybpfprog=__alloc_pages_nodemask\n";

Was this just some random eBPF code to test the perf framework? Or was
it to do something useful with
kmem_cache_free()/__alloc_pages_nodemask() tracing as well? It looks a
bit incomplete.

If it's just random code, I'd include a comment to state that,
otherwise it's a bit confusing. A complete example might be better;
eg, something like Alexei's tracex1, for a simple example of
bpf_trace_printk(), or sockex1, for a simple map example.

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 0/4] perf: add support for profiling jitted code

2015-03-31 Thread Brendan Gregg
On Tue, Mar 31, 2015 at 12:33 AM, Brendan Gregg
 wrote:
> G'Day Stephane,
>
> On Mon, Mar 30, 2015 at 3:19 PM, Stephane Eranian  wrote:
> [...]
>> The current support only works when the runtime is monitored from
>> start to finish: perf record java --agentpath:libpfmjvmti.so my_class.
>>
>> Once the run is completed, the jitdump file needs to be injected into
>> the perf.data file. This is accomplished by using the perf inject command.
>> This will also generate an ELF image for each jitted function. The
>> inject MMAP records will point to those ELF images. The reasoning
>> behind using ELF images is that it makes processing for perf report
>> and annotate automatic and transparent. It also makes it easier to
>> package and analyze on a remote machine.
> [...]
>
> This is really impressive work. Do we have an idea of the overhead for
> running the java agent?
>
> Today, I'm using perf-map-agent, loaded dynamically, to dump a
> /tmp/perf*.map file as needed. My company has tens of thousands of
> Linux instances running Java, but very few need profiling, and we
> don't know which beforehand. So a snapshot-on-demand approach is
> ideal. An always-on approach, well, we'd have to know the overhead (I
> can build the agent and test...).

I built the agent and tested with an application framework
micro-benchmark, and saw the performance overhead drop after start
from about 13% initially (measured as a reduction in maximum req/sec
given fixed CPU capacity), to 1.1% after a minute, and then 0.13%
(which is really just noise) after several minutes of high load.

So the overhead is basically zero after (minutes of) warmup, at least
for my test. My jit.dump file reached 8 Mbytes, and was growing by a
tiny amount every 30 seconds or so (hence the near-zero overhead). I'm
much less concerned about overheads now.

I'll test with a production workload if I can... But I'm still curious
about why we're even doing this, instead of the previous method of
taking symbol snapshots. Is there a backstory? If it involves a case
of high symbol churn, then this should also mean non-zero overhead to
constantly log.

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 0/4] perf: add support for profiling jitted code

2015-03-31 Thread Brendan Gregg
G'Day Stephane,

On Mon, Mar 30, 2015 at 3:19 PM, Stephane Eranian  wrote:
[...]
> The current support only works when the runtime is monitored from
> start to finish: perf record java --agentpath:libpfmjvmti.so my_class.
>
> Once the run is completed, the jitdump file needs to be injected into
> the perf.data file. This is accomplished by using the perf inject command.
> This will also generate an ELF image for each jitted function. The
> inject MMAP records will point to those ELF images. The reasoning
> behind using ELF images is that it makes processing for perf report
> and annotate automatic and transparent. It also makes it easier to
> package and analyze on a remote machine.
[...]

This is really impressive work. Do we have an idea of the overhead for
running the java agent?

Today, I'm using perf-map-agent, loaded dynamically, to dump a
/tmp/perf*.map file as needed. My company has tens of thousands of
Linux instances running Java, but very few need profiling, and we
don't know which beforehand. So a snapshot-on-demand approach is
ideal. An always-on approach, well, we'd have to know the overhead (I
can build the agent and test...).

During a profile, our symbols can change, but it's not been a serious
problem. I can take a before and after snapshot of symbols
(/tmp/perf.map files), then highlight which have moved and are
therefore suspicious (not perfect, but ok). Yes, having a timestamped
symbol log would be most accurate, but I wouldn't pay a lot of
overhead for it. Is there a (horror) story behind why this was so
necessary?

Minor comment: I didn't notice an fsync() in the jvmti code. Can
symbol updates be buffered and not written out by the time perf is
reading them?

Again, this is impressive work. Just wanted to share comments from a
heavy user of perf with Java. You may already have other, better,
ideas for doing on-demand symbol dumps anyway.

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf, tools: Add PERF_PID

2014-09-09 Thread Brendan Gregg
On Tue, Sep 9, 2014 at 5:03 PM, Andi Kleen  wrote:
> From: Andi Kleen 
>
> It's currently difficult to filter out perf itself using a filter.
> This can give cascading effects during IO tracing when the IO
> perf does itself causes more trace output.
>
> The best way to filter is to use the pid. But it's difficult to get the pid
> of perf without using hacks.
>
> Add a PERF_PID meta variable to the perf filter that contains the current pid.
>
> With this patch the following works
>
> % perf record -e syscalls:sys_enter_write -a --filter 'common_pid != 
> PERF_PID' ...
>
> This won't work for more complex perf pipe lines with multiple processes,
> but at least solves the problem nicely for a single perf.

Thank you, this will be handy. I tested it and it works.

It will mean I can avoid hacks like this:
https://github.com/brendangregg/perf-tools/blob/d30e8a1a9f136d532ba78c6d48aedb99dc316be9/syscount#L144

I think it's unrelated to this patch, but the very next test I tried
didn't work as expected:

# perf record -e 'syscalls:sys_enter_*' --filter 'common_pid !=
PERF_PID' -a sleep 5

It only filters the last event from the wildcard expansion, leaving
write() unfiltered. Looks like it's because parse_filter() is only
setting the filter for perf_evlist__last(), instead of doing an
evlist__for_each(). How best to deal with this can be addressed in a
separate patch later.

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf probe: Warn user to rebuild target with debuginfo

2014-08-14 Thread Brendan Gregg
On Thu, Aug 14, 2014 at 6:51 PM, Masami Hiramatsu
 wrote:
> Here is v2 patch, which I've added "or install an appropriate debuginfo 
> pacakge." :)
[...]

Looks good, thanks.

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v4 net-next 03/26] bpf: introduce syscall(BPF, ...) and BPF maps

2014-08-14 Thread Brendan Gregg
On Wed, Aug 13, 2014 at 12:57 AM, Alexei Starovoitov  wrote:
[...]
> maps can have different types: hash, bloom filter, radix-tree, etc.
>
> The map is defined by:
>   . type
>   . max number of elements
>   . key size in bytes
>   . value size in bytes

Can values be strings or byte arrays? How would user-level bpf read
them? The two types of uses I'm thinking are:

A. Constructing a custom string in kernel-context, and using that as
the value. Eg, a truncated filename, or a dotted quad IP address, or
the raw contents of a packet.
B. I have a pointer to an existing buffer or string, eg a filename,
that will likely be around for some time (>1s). Instead of the value
storing the string, it could just be a ptr, so long as user-level bpf
has a way to read it.

Also, can keys be strings? I'd ask about multiple keys, but if they
can be a string, I can delimit in the key (eg, "PID:filename").
Thanks,

Brendan

-- 
http://www.brendangregg.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v4 net-next 25/26] samples: bpf: counting eBPF example in C

2014-08-14 Thread Brendan Gregg
On Wed, Aug 13, 2014 at 12:57 AM, Alexei Starovoitov  wrote:
> this example has two probes in C that use two different maps.
>
> 1st probe is the similar to dropmon.c. It attaches to kfree_skb tracepoint and
> count number of packet drops at different locations
>
> 2nd probe attaches to kprobe/sys_write and computes a histogram of different
> write sizes
>
> Usage:
> $ sudo ex2
>
> Should see:
> writing bpf-5 -> /sys/kernel/debug/tracing/events/skb/kfree_skb/filter
> writing bpf-8 -> /sys/kernel/debug/tracing/events/kprobes/sys_write/filter
> location 0x816efc67 count 1
>
> location 0x815d8030 count 1
> location 0x816efc67 count 3
>
> location 0x815d8030 count 4
> location 0x816efc67 count 9
>
>syscall write() stats
>  byte_size   : count distribution
>1 -> 1: 3141 |  |
>2 -> 3: 2|  |
>4 -> 7: 14   |  |
>8 -> 15   : 3268 |* |
>   16 -> 31   : 732  |  |
>   32 -> 63   : 20042|* |
>   64 -> 127  : 12154|**|
>  128 -> 255  : 2215 |***   |
>  256 -> 511  : 9|  |
>  512 -> 1023 : 0|  |
> 1024 -> 2047 : 1|  |

This is pretty awesome.

Given that this is tracing two tracepoints at once, I'd like to see a
similar example where time is stored on the first tracepoint,
retrieved on the second for a delta calculation, then presented with a
similar histogram as seen above.

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v4 net-next 17/26] tracing: allow eBPF programs to be attached to events

2014-08-14 Thread Brendan Gregg
On Wed, Aug 13, 2014 at 12:57 AM, Alexei Starovoitov  wrote:
[...]
> +/* For tracing filters save first six arguments of tracepoint events.
> + * On 64-bit architectures argN fields will match one to one to arguments 
> passed
> + * to tracepoint events.
> + * On 32-bit architectures u64 arguments to events will be seen into two
> + * consecutive argN, argN+1 fields. Pointers, u32, u16, u8, bool types will
> + * match one to one
> + */
> +struct bpf_context {
> +   unsigned long arg1;
> +   unsigned long arg2;
> +   unsigned long arg3;
> +   unsigned long arg4;
> +   unsigned long arg5;
> +   unsigned long arg6;
> +   unsigned long ret;
> +};

While this works, the argN+1 shift for 32-bit is a gotcha to learn.
Lets say arg1 was 64-bit, and my program only examined arg2. I'd need
two programs, one for 64-bit (using arg2) and 32-bit (arg3). If there
was a way not to shift arguments, I could have one program for both.
Eg, additional arg1hi, arg2hi, ... for the higher order u32s.

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf probe: Warn user to rebuild target with debuginfo

2014-08-14 Thread Brendan Gregg
On Thu, Aug 14, 2014 at 11:29 AM, Masami Hiramatsu
 wrote:
[...]
> The "rebuild with ..." part changes to "rebuild with CONFIG_DEBUG_INFO"
> if the target is the kernel or a kernel module.

Thanks, definitely an improvement! Should the kernel message also
mention kernel debuginfo packages? Depends on the distribution and
environment, but I think for some users the solution is to add the
package.

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v4 net-next 23/26] samples: bpf: elf file loader

2014-08-14 Thread Brendan Gregg
On Wed, Aug 13, 2014 at 12:57 AM, Alexei Starovoitov  wrote:
[...]
> +static int load_and_attach(const char *event, struct bpf_insn *prog, int 
> size)
> +{
> +   int fd, event_fd, err;
> +   char fmt[32];
> +   char path[256] = DEBUGFS;
> +
> +   fd = bpf_prog_load(BPF_PROG_TYPE_TRACING_FILTER, prog, size, license);
> +
> +   if (fd < 0) {
> +   printf("err %d errno %d\n", fd, errno);
> +   return fd;
> +   }

Minor suggestion: since this is sample code, I'd always print the bpf
log after this this printf() error message:

printf("%s", bpf_log_buf);

Which has helped me debug my eBPF programs, as will be the case for
anyone hacking on the examples. Or have a function for logdie(), if
the log buffer may be populated with useful messages from other error
paths as well.

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC v4 net-next 00/26] BPF syscall, maps, verifier, samples, llvm

2014-08-14 Thread Brendan Gregg
On Wed, Aug 13, 2014 at 12:57 AM, Alexei Starovoitov  wrote:
[...]
> Tracing use case got some improvements as well. Now eBPF programs can be
> attached to tracepoint, syscall, kprobe and C examples are more usable:
> ex1_kern.c - demonstrate how programs can walk in-kernel data structures
> ex2_kern.c - in-kernel event accounting and user space histograms
> See patch #25

This is great, thanks! I've been using this new support, and
successfully ported an an older tool of mine (bitesize) to eBPF. I was
using the block:block_rq_issue tracepoint, and performing a custom
in-kernel histogram, like in the ex2_kern.c example, for I/O size.

I also did some quick overhead testing and found eBPF with JIT to be
relatively fast. (I'd share numbers but it's platform specific.) The
syscall tracepoints were a bit slower than hoped, for what I think is
a well known issue.

Are there thoughts in general for how this might be used for embedded
devices, where installing clang/llvm might be prohibitive? Compile on
another system and move the binaries over? thanks,

Brendan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/