Re: [RFC PATCH 0/4] perf tools: Use the new ability of eBPF programs to access hardware PMU counter

2015-08-28 Thread Alexei Starovoitov

On 8/28/15 7:14 PM, xiakaixu wrote:

Right, this is just a little example. Actually, I have tested this
ability on kernel side and user space side, that is kprobe and uprobe.


great to hear.


At this time i wish to get your comment on the current chosen implementation.
Now the struct perf_event_map_def is introduced and the user can directly
define the struct perf_event_attr, so we can skip the parse_events process
and call the sys_perf_event_open on the events directly. This is the most
simple implementation, but I am not sure it is the most appropriate.


I think it's a bit kludgy. You are trying to squeeze more and more
information into sections and pass them via elf.
It worked for samples early on, but now it's time to do better.
Like in bcc we just write normal C and extract all necessary information
by looking at C via clang:rewriter api. I think it's a cleaner approach.
In our use case we can compile on the host, so no intermediate files,
no elf files. If you have to cross-compile you can still use the same
approach and let llvm generate .o and emit all extra stuff as another
configuration file (say in .json), then let host load .o and use .json
to setup pmu events and everything else. It will work for higher number
of use cases, but at the end I don't see how you can avoid moving to
c+python or c+whatever approach, since static configuration (whether in
.json or in elf section) are not going to be enough. You'd need a
program in user space to deal with all the data that bpf program
in kernel is collecting.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/4] perf tools: Use the new ability of eBPF programs to access hardware PMU counter

2015-08-28 Thread xiakaixu
于 2015/8/29 9:28, Alexei Starovoitov 写道:
> On 8/27/15 3:42 AM, Kaixu Xia wrote:
>> An example is pasted at the bottom of this cover letter. In that example,
>> we can get the cpu_cycles and exception taken in sys_write.
>>
>>   $ cat /sys/kernel/debug/tracing/trace_pipe
>>   $ ./perf record --event perf-bpf.o ls
>> ...
>>   cat-1653  [003] d..1 88174.613854: : ente:  CPU-3
>> cyc:48746333exc:84
>>   cat-1653  [003] d..2 88174.613861: : exit:  CPU-3
>> cyc:48756041exc:84
> 
> nice. probably more complex example that computes the delta of the pmu
> counters on the kernel side would be even more interesting.

Right, this is just a little example. Actually, I have tested this
ability on kernel side and user space side, that is kprobe and uprobe.
The collected delta of the pmu counters form kernel and glibc is correct
and meets the expected goals. I will give them in the next version.

At this time i wish to get your comment on the current chosen implementation.
Now the struct perf_event_map_def is introduced and the user can directly
define the struct perf_event_attr, so we can skip the parse_events process
and call the sys_perf_event_open on the events directly. This is the most
simple implementation, but I am not sure it is the most appropriate.
> Do you think you can extend 'perf stat' with a flag that does
> stats collection for a given kernel or user function instead of the
> whole process ?
> Then we can use perf record/report to figure out hot functions and
> follow with 'perf stat -f my_hot_func my_process' to drill into
> particular function stats.

Good idea! I will consider it when this patchset is basically completed.
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/4] perf tools: Use the new ability of eBPF programs to access hardware PMU counter

2015-08-28 Thread Alexei Starovoitov

On 8/27/15 3:42 AM, Kaixu Xia wrote:

An example is pasted at the bottom of this cover letter. In that example,
we can get the cpu_cycles and exception taken in sys_write.

  $ cat /sys/kernel/debug/tracing/trace_pipe
  $ ./perf record --event perf-bpf.o ls
...
  cat-1653  [003] d..1 88174.613854: : ente:  CPU-3 cyc:48746333
exc:84
  cat-1653  [003] d..2 88174.613861: : exit:  CPU-3 cyc:48756041
exc:84


nice. probably more complex example that computes the delta of the pmu
counters on the kernel side would be even more interesting.
Do you think you can extend 'perf stat' with a flag that does
stats collection for a given kernel or user function instead of the
whole process ?
Then we can use perf record/report to figure out hot functions and
follow with 'perf stat -f my_hot_func my_process' to drill into
particular function stats.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/4] perf tools: Use the new ability of eBPF programs to access hardware PMU counter

2015-08-28 Thread xiakaixu
于 2015/8/29 9:28, Alexei Starovoitov 写道:
 On 8/27/15 3:42 AM, Kaixu Xia wrote:
 An example is pasted at the bottom of this cover letter. In that example,
 we can get the cpu_cycles and exception taken in sys_write.

   $ cat /sys/kernel/debug/tracing/trace_pipe
   $ ./perf record --event perf-bpf.o ls
 ...
   cat-1653  [003] d..1 88174.613854: : ente:  CPU-3
 cyc:48746333exc:84
   cat-1653  [003] d..2 88174.613861: : exit:  CPU-3
 cyc:48756041exc:84
 
 nice. probably more complex example that computes the delta of the pmu
 counters on the kernel side would be even more interesting.

Right, this is just a little example. Actually, I have tested this
ability on kernel side and user space side, that is kprobe and uprobe.
The collected delta of the pmu counters form kernel and glibc is correct
and meets the expected goals. I will give them in the next version.

At this time i wish to get your comment on the current chosen implementation.
Now the struct perf_event_map_def is introduced and the user can directly
define the struct perf_event_attr, so we can skip the parse_events process
and call the sys_perf_event_open on the events directly. This is the most
simple implementation, but I am not sure it is the most appropriate.
 Do you think you can extend 'perf stat' with a flag that does
 stats collection for a given kernel or user function instead of the
 whole process ?
 Then we can use perf record/report to figure out hot functions and
 follow with 'perf stat -f my_hot_func my_process' to drill into
 particular function stats.

Good idea! I will consider it when this patchset is basically completed.
 
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/4] perf tools: Use the new ability of eBPF programs to access hardware PMU counter

2015-08-28 Thread Alexei Starovoitov

On 8/28/15 7:14 PM, xiakaixu wrote:

Right, this is just a little example. Actually, I have tested this
ability on kernel side and user space side, that is kprobe and uprobe.


great to hear.


At this time i wish to get your comment on the current chosen implementation.
Now the struct perf_event_map_def is introduced and the user can directly
define the struct perf_event_attr, so we can skip the parse_events process
and call the sys_perf_event_open on the events directly. This is the most
simple implementation, but I am not sure it is the most appropriate.


I think it's a bit kludgy. You are trying to squeeze more and more
information into sections and pass them via elf.
It worked for samples early on, but now it's time to do better.
Like in bcc we just write normal C and extract all necessary information
by looking at C via clang:rewriter api. I think it's a cleaner approach.
In our use case we can compile on the host, so no intermediate files,
no elf files. If you have to cross-compile you can still use the same
approach and let llvm generate .o and emit all extra stuff as another
configuration file (say in .json), then let host load .o and use .json
to setup pmu events and everything else. It will work for higher number
of use cases, but at the end I don't see how you can avoid moving to
c+python or c+whatever approach, since static configuration (whether in
.json or in elf section) are not going to be enough. You'd need a
program in user space to deal with all the data that bpf program
in kernel is collecting.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/4] perf tools: Use the new ability of eBPF programs to access hardware PMU counter

2015-08-28 Thread Alexei Starovoitov

On 8/27/15 3:42 AM, Kaixu Xia wrote:

An example is pasted at the bottom of this cover letter. In that example,
we can get the cpu_cycles and exception taken in sys_write.

  $ cat /sys/kernel/debug/tracing/trace_pipe
  $ ./perf record --event perf-bpf.o ls
...
  cat-1653  [003] d..1 88174.613854: : ente:  CPU-3 cyc:48746333
exc:84
  cat-1653  [003] d..2 88174.613861: : exit:  CPU-3 cyc:48756041
exc:84


nice. probably more complex example that computes the delta of the pmu
counters on the kernel side would be even more interesting.
Do you think you can extend 'perf stat' with a flag that does
stats collection for a given kernel or user function instead of the
whole process ?
Then we can use perf record/report to figure out hot functions and
follow with 'perf stat -f my_hot_func my_process' to drill into
particular function stats.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 0/4] perf tools: Use the new ability of eBPF programs to access hardware PMU counter

2015-08-27 Thread Kaixu Xia
According to the discussions on this subject 
https://lkml.org/lkml/2015/5/27/1027,
we want to give eBPF programs the ability to access hardware PMU counter
and use this ability with perf.

Now the kernel side patch set 'bpf: Introduce the new ability of eBPF
programs to access hardware PMU counter' has been applied and can be
found in the net-next tree.

  ffe8690c85b8 ("perf: add the necessary core perf APIs when accessing events 
counters in eBPF programs")
  2a36f0b92eb6 ("bpf: Make the bpf_prog_array_map more generic")
  ea317b267e9d ("bpf: Add new bpf map type to store the pointer to struct 
perf_event")
  35578d798400 ("bpf: Implement function bpf_perf_event_read() that get the 
selected hardware PMU conuter")
  47efb30274cb ("samples/bpf: example of get selected PMU counter value")

According to the design plan, we still need the perf side code.
This patch set based on Wang Nan's patches (perf tools: filtering events
using eBPF programs).
(git://git.kernel.org/pub/scm/linux/kernel/git/pi3orama/linux 
tags/perf-ebpf-for-acme-20150821)
The kernel side patch set above also need to be mergerd if you want to
test this patch set.

An example is pasted at the bottom of this cover letter. In that example,
we can get the cpu_cycles and exception taken in sys_write.

 $ cat /sys/kernel/debug/tracing/trace_pipe
 $ ./perf record --event perf-bpf.o ls
...
 cat-1653  [003] d..1 88174.613854: : ente:  CPU-3  cyc:48746333
exc:84
 cat-1653  [003] d..2 88174.613861: : exit:  CPU-3  cyc:48756041
exc:84
 cat-1653  [003] d..1 88174.613872: : ente:  CPU-3  cyc:48771199
exc:86
 cat-1653  [003] d..2 88174.613879: : exit:  CPU-3  cyc:48780448
exc:86
 cat-1678  [003] d..1 88174.615001: : ente:  CPU-3  cyc:50293479
exc:93
sshd-1669  [000] d..1 88174.615199: : ente:  CPU-0  cyc:44402694
exc:51
sshd-1669  [000] d..2 88174.615283: : exit:  CPU-0  cyc:44517335
exc:51
  ls-1680  [003] d..1 88174.620260: : ente:  CPU-3  cyc:57281750
exc:241
sshd-1669  [000] d..1 88174.620474: : ente:  CPU-0  cyc:44998837
exc:69
sshd-1669  [000] d..2 88174.620549: : exit:  CPU-0  cyc:45101855
exc:69
sshd-1669  [000] d..1 88174.620608: : ente:  CPU-0  cyc:45181848
exc:77
sshd-1669  [000] d..2 88174.620709: : exit:  CPU-0  cyc:45317439
exc:78
sshd-1669  [000] d..1 88174.620801: : ente:  CPU-0  cyc:45441321
exc:87
sshd-1669  [000] d..2 88174.620856: : exit:  CPU-0  cyc:45515882
exc:87
...

Limitation of this patch set: The eBPF programs can only create and access
the perf events depend on CPUs and can not do that depend on PID.

The detail of patches is as follow:

Patch 1/4 introduces bpf_update_elem() and perf_event_open() in
introduces bpf_update_elem() and perf_event_open(). We can store
the pointers to struct perf_event to maps;

Patch 2/4 collects BPF_MAP_TYPE_PERF_EVENT_ARRAY map definitions
from 'maps' section and get the event & map match;

Patch 3/4 saves the perf event fds from "maps" sections to
'struct bpf_object'. So we can enable/disable these perf events
at the appropriate time;

Patch 4/4 enable/disable the perf events stored in 'struct bpf_object';

   EXAMPL 
  - perf-bpf.c -
 
 struct perf_event_map_def SEC("maps") my_cycles_map = {
.map_def = {
.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(u32),
.max_entries = 32,
},
.attr = {
.freq = 0,
.inherit = 0,
.sample_period = 0x7fffULL,
.type = PERF_TYPE_HARDWARE,
.read_format = 0,
.sample_type = 0,
.config = 0,/* PMU: cycles */
},
 };

 struct perf_event_map_def SEC("maps") my_exception_map = {
.map_def = {
.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(u32),
.max_entries = 32,
},
.attr = {
.freq = 0,
.inherit = 0,
.sample_period = 0x7fffULL,
.type = PERF_TYPE_RAW,
.read_format = 0,
.sample_type = 0,
.config = 0x09,/* PMU: exception */
},
 };

 SEC("ente=sys_write")
 int bpf_prog_1(struct pt_regs *ctx)
 {
u64 count_cycles, count_exception;
u32 key = bpf_get_smp_processor_id();
char fmt[] = "ente:  CPU-%d cyc:%lluexc:%llu\n";

count_cycles = bpf_perf_event_read(_cycles_map, key);
count_exception = bpf_perf_event_read(_exception_map, key);
bpf_trace_printk(fmt, sizeof(fmt), key, count_cycles, count_exception);

return 0;
 }

 SEC("exit=sys_write%return")
 

[RFC PATCH 0/4] perf tools: Use the new ability of eBPF programs to access hardware PMU counter

2015-08-27 Thread Kaixu Xia
According to the discussions on this subject 
https://lkml.org/lkml/2015/5/27/1027,
we want to give eBPF programs the ability to access hardware PMU counter
and use this ability with perf.

Now the kernel side patch set 'bpf: Introduce the new ability of eBPF
programs to access hardware PMU counter' has been applied and can be
found in the net-next tree.

  ffe8690c85b8 (perf: add the necessary core perf APIs when accessing events 
counters in eBPF programs)
  2a36f0b92eb6 (bpf: Make the bpf_prog_array_map more generic)
  ea317b267e9d (bpf: Add new bpf map type to store the pointer to struct 
perf_event)
  35578d798400 (bpf: Implement function bpf_perf_event_read() that get the 
selected hardware PMU conuter)
  47efb30274cb (samples/bpf: example of get selected PMU counter value)

According to the design plan, we still need the perf side code.
This patch set based on Wang Nan's patches (perf tools: filtering events
using eBPF programs).
(git://git.kernel.org/pub/scm/linux/kernel/git/pi3orama/linux 
tags/perf-ebpf-for-acme-20150821)
The kernel side patch set above also need to be mergerd if you want to
test this patch set.

An example is pasted at the bottom of this cover letter. In that example,
we can get the cpu_cycles and exception taken in sys_write.

 $ cat /sys/kernel/debug/tracing/trace_pipe
 $ ./perf record --event perf-bpf.o ls
...
 cat-1653  [003] d..1 88174.613854: : ente:  CPU-3  cyc:48746333
exc:84
 cat-1653  [003] d..2 88174.613861: : exit:  CPU-3  cyc:48756041
exc:84
 cat-1653  [003] d..1 88174.613872: : ente:  CPU-3  cyc:48771199
exc:86
 cat-1653  [003] d..2 88174.613879: : exit:  CPU-3  cyc:48780448
exc:86
 cat-1678  [003] d..1 88174.615001: : ente:  CPU-3  cyc:50293479
exc:93
sshd-1669  [000] d..1 88174.615199: : ente:  CPU-0  cyc:44402694
exc:51
sshd-1669  [000] d..2 88174.615283: : exit:  CPU-0  cyc:44517335
exc:51
  ls-1680  [003] d..1 88174.620260: : ente:  CPU-3  cyc:57281750
exc:241
sshd-1669  [000] d..1 88174.620474: : ente:  CPU-0  cyc:44998837
exc:69
sshd-1669  [000] d..2 88174.620549: : exit:  CPU-0  cyc:45101855
exc:69
sshd-1669  [000] d..1 88174.620608: : ente:  CPU-0  cyc:45181848
exc:77
sshd-1669  [000] d..2 88174.620709: : exit:  CPU-0  cyc:45317439
exc:78
sshd-1669  [000] d..1 88174.620801: : ente:  CPU-0  cyc:45441321
exc:87
sshd-1669  [000] d..2 88174.620856: : exit:  CPU-0  cyc:45515882
exc:87
...

Limitation of this patch set: The eBPF programs can only create and access
the perf events depend on CPUs and can not do that depend on PID.

The detail of patches is as follow:

Patch 1/4 introduces bpf_update_elem() and perf_event_open() in
introduces bpf_update_elem() and perf_event_open(). We can store
the pointers to struct perf_event to maps;

Patch 2/4 collects BPF_MAP_TYPE_PERF_EVENT_ARRAY map definitions
from 'maps' section and get the event  map match;

Patch 3/4 saves the perf event fds from maps sections to
'struct bpf_object'. So we can enable/disable these perf events
at the appropriate time;

Patch 4/4 enable/disable the perf events stored in 'struct bpf_object';

   EXAMPL 
  - perf-bpf.c -
 
 struct perf_event_map_def SEC(maps) my_cycles_map = {
.map_def = {
.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(u32),
.max_entries = 32,
},
.attr = {
.freq = 0,
.inherit = 0,
.sample_period = 0x7fffULL,
.type = PERF_TYPE_HARDWARE,
.read_format = 0,
.sample_type = 0,
.config = 0,/* PMU: cycles */
},
 };

 struct perf_event_map_def SEC(maps) my_exception_map = {
.map_def = {
.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(u32),
.max_entries = 32,
},
.attr = {
.freq = 0,
.inherit = 0,
.sample_period = 0x7fffULL,
.type = PERF_TYPE_RAW,
.read_format = 0,
.sample_type = 0,
.config = 0x09,/* PMU: exception */
},
 };

 SEC(ente=sys_write)
 int bpf_prog_1(struct pt_regs *ctx)
 {
u64 count_cycles, count_exception;
u32 key = bpf_get_smp_processor_id();
char fmt[] = ente:  CPU-%d cyc:%lluexc:%llu\n;

count_cycles = bpf_perf_event_read(my_cycles_map, key);
count_exception = bpf_perf_event_read(my_exception_map, key);
bpf_trace_printk(fmt, sizeof(fmt), key, count_cycles, count_exception);

return 0;
 }

 SEC(exit=sys_write%return)
 int