On Fri, Jul 29, 2016 at 8:34 PM, Wangnan (F) <wangn...@huawei.com> wrote: > > > On 2016/7/30 2:05, Brendan Gregg wrote: >> >> On Tue, Jul 19, 2016 at 4:20 PM, Brendan Gregg <bgr...@netflix.com> wrote: >>> >>> When perf is performing hrtimer-based sampling, this tracepoint can be >>> used >>> by BPF to run additional logic on each sample. For example, BPF can fetch >>> stack traces and frequency count them in kernel context, for an efficient >>> profiler. >> >> Any comments on this patch? Thanks, >> >> Brendan > > > Sorry for the late. > > I think it is a useful feature. Could you please provide an example > to show how to use it in perf?
Yes, the following example samples at 999 Hertz, and emits the instruction pointer only when it is within a custom address range, as checked by BPF. Eg: # ./perf record -e bpf-output/no-inherit,name=evt/ \ -e ./sampleip_range.c/map:channel.event=evt/ \ -a ./perf record -F 999 -e cpu-clock -N -a -o /dev/null sleep 5 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.000 MB /dev/null ] [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.134 MB perf.data (222 samples) ] # ./perf script -F comm,pid,time,bpf-output 'bpf-output' not valid for hardware events. Ignoring. 'bpf-output' not valid for unknown events. Ignoring. 'bpf-output' not valid for unknown events. Ignoring. dd 6501 3058.117379: BPF output: 0000: 3c 4c 21 81 ff ff ff ff <L!..... 0008: 00 00 00 00 .... dd 6501 3058.130392: BPF output: 0000: 55 4c 21 81 ff ff ff ff UL!..... 0008: 00 00 00 00 .... dd 6501 3058.131393: BPF output: 0000: 55 4c 21 81 ff ff ff ff UL!..... 0008: 00 00 00 00 .... dd 6501 3058.149411: BPF output: 0000: e1 4b 21 81 ff ff ff ff .K!..... 0008: 00 00 00 00 .... dd 6501 3058.155417: BPF output: 0000: 76 4c 21 81 ff ff ff ff vL!..... 0008: 00 00 00 00 .... For that example, perf is running a BPF program to emit filtered details, and running a second perf to configure sampling. We can certainly improve how this works. And this will be much more interesting once perf can emit maps, and a perf BPF program can populate a map. Here's sampleip_range.c: /************************ BEGIN **************************/ #include <uapi/linux/bpf.h> #include <uapi/linux/ptrace.h> #define SEC(NAME) __attribute__((section(NAME), used)) /* * Edit the following to match the instruction address range you want to * sample. Eg, look in /proc/kallsyms. The addresses will change for each * kernel version and build. */ #define RANGE_START 0xffffffff81214b90 #define RANGE_END 0xffffffff81214cd0 struct bpf_map_def { unsigned int type; unsigned int key_size; unsigned int value_size; unsigned int max_entries; }; static int (*probe_read)(void *dst, int size, void *src) = (void *)BPF_FUNC_probe_read; static int (*get_smp_processor_id)(void) = (void *)BPF_FUNC_get_smp_processor_id; static int (*perf_event_output)(void *, struct bpf_map_def *, int, void *, unsigned long) = (void *)BPF_FUNC_perf_event_output; struct bpf_map_def SEC("maps") channel = { .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY, .key_size = sizeof(int), .value_size = sizeof(u32), .max_entries = __NR_CPUS__, }; /* from /sys/kernel/debug/tracing/events/perf/perf_hrtimer/format */ struct perf_hrtimer_args { unsigned long long pad; struct pt_regs *regs; struct perf_event *event; }; SEC("perf:perf_hrtimer") int func(struct perf_hrtimer_args *ctx) { struct pt_regs regs = {}; probe_read(®s, sizeof(regs), ctx->regs); if (regs.ip >= RANGE_START && regs.ip < RANGE_END) { perf_event_output(ctx, &channel, get_smp_processor_id(), ®s.ip, sizeof(regs.ip)); } return 0; } char _license[] SEC("license") = "GPL"; int _version SEC("version") = LINUX_VERSION_CODE; /************************* END ***************************/ > > If I understand correctly, I can have a BPF script run 99 times per > second using > > # perf -e cpu-clock/freq=99/ -e mybpf.c ... > > And in mybpf.c, attach a BPF script on the new tracepoint. Right? > > Also, since we already have timer:hrtimer_expire_entry, please provide > some further information about why we need a new tracepoint. timer:hrtimer_expire_entry fires for much more than just the perf timer. The perf:perf_hrtimer tracepoint also has registers and perf context as arguments, which can be used for profiling programs. Thanks for the comments, Brendan