Re: [RFC][PATCH 00/10] Add trace event support to eBPF
On Tue, 2016-02-16 at 20:51 -0800, Alexei Starovoitov wrote: > On Tue, Feb 16, 2016 at 04:35:27PM -0600, Tom Zanussi wrote: > > On Sun, 2016-02-14 at 01:02 +0100, Alexei Starovoitov wrote: > > > On Fri, Feb 12, 2016 at 10:11:18AM -0600, Tom Zanussi wrote: > > > > this hist triggers belong in the kernel. BPF already can do > > > way more complex aggregation and histograms. > > > > Way more? I still can't accomplish with eBPF some of the most basic and > > helpful use cases that I can with hist triggers, such as using > > stacktraces as hash keys. And the locking in the eBPF hashmap > > implementation prevents anything like the function_hist [1] tracer from > > being implemented on top of it: > > Both statements are not true. Erm, not exactly... > In the link from previous email take a look at funccount_example.txt: > # ./funccount 'vfs_*' > Tracing... Ctrl-C to end. > ^C > ADDR FUNC COUNT > 811efe81 vfs_create1 > 811f24a1 vfs_rename1 > 81215191 vfs_fsync_range 2 > 81231df1 vfs_lock_file30 > 811e8dd1 vfs_fstatat 152 > 811e8d71 vfs_fstat 154 > 811e4381 vfs_write 166 > 811e8c71 vfs_getattr_nosec 262 > 811e8d41 vfs_getattr 262 > 811e3221 vfs_open264 > 811e4251 vfs_read470 > Detaching... > When I tried to use it to trace all functions: # ./funccount.py '*' I got about 5 minutes worth of these kinds of error messages, which I couldn't break out of: write of "p:kprobes/p_rootfs_mount rootfs_mount" into kprobe_events failed: Device or resource busy open(/sys/kernel/debug/tracing/events/kprobes/p_legacy_pic_int_noop/id): Too many open files open(/sys/kernel/debug/tracing/events/kprobes/p_legacy_pic_irq_pending_noop/id): Too many open files open(/sys/kernel/debug/tracing/events/kprobes/p_legacy_pic_probe/id): Too many open files open(/sys/kernel/debug/tracing/events/kprobes/p_unmask_8259A_irq/id): Too many open files open(/sys/kernel/debug/tracing/events/kprobes/p_enable_8259A_irq/id): Too many open files open(/sys/kernel/debug/tracing/events/kprobes/p_mask_8259A_irq/id): Too many open files open(/sys/kernel/debug/tracing/events/kprobes/p_disable_8259A_irq/id): Too many open files open(/sys/kernel/debug/tracing/events/kprobes/p_i8259A_irq_pending/id): Too many open files ... Which is probably a good thing, since that's a relatively common thing for someone to try, whereas this subset probably isn't: # ./funccount.py '*spin*' Which on my machine resulted in a hard lockup on all CPUs. I'm not set up to grab serial output on that machine, but here's a screenshot of most of the stacktrace, all I could get: http://picpaste.com/bcc-spinstar-hardlock-fobQbcuG.JPG There's nothing special about that machine and it's running a stock 4.4.0 kernel, so it should be pretty easy to reproduce on anything with just a BPF-enabled config. If not, let me know and I'll send more specific info. > And this is done without adding new code to the kernel. > > Another example is offwaketime that uses two stack traces as > part of single key. > Which has the below code in the script itself implementing a stack walker: static u64 get_frame(u64 *bp) { if (*bp) { // The following stack walker is x86_64 specific u64 ret = 0; if (bpf_probe_read(&ret, sizeof(ret), (void *)(*bp+8))) return 0; if (bpf_probe_read(bp, sizeof(*bp), (void *)*bp)) *bp = 0; if (ret < __START_KERNEL_map) return 0; return ret; } return 0; } int waker(struct pt_regs *ctx, struct task_struct *p) { u32 pid = p->pid; if (!(FILTER)) return 0; u64 bp = 0; struct wokeby_t woke = {}; int depth = 0; bpf_get_current_comm(&woke.name, sizeof(woke.name)); bp = ctx->bp; // unrolled loop (MAXWDEPTH): if (!(woke.ret[depth++] = get_frame(&bp))) goto out; if (!(woke.ret[depth++] = get_frame(&bp))) goto out; if (!(woke.ret[depth++] = get_frame(&bp))) goto out; if (!(woke.ret[depth++] = get_frame(&bp))) goto out; if (!(woke.ret[depth++] = get_frame(&bp))) goto out; if (!(woke.ret[depth++] = get_frame(&bp))) goto out; if (!(woke.ret[depth++] = get_frame(&bp))) goto out; if (!(woke.ret[depth++] = get_frame(&bp))) goto out; if (!(woke.ret[depth++] = get_frame(&bp))) goto out; woke.ret[depth] = get_frame(&bp); I think one call to this pretty
Re: [RFC][PATCH 00/10] Add trace event support to eBPF
On Sun, 2016-02-14 at 01:02 +0100, Alexei Starovoitov wrote: > On Fri, Feb 12, 2016 at 10:11:18AM -0600, Tom Zanussi wrote: > > Hi, > > > > As promised in previous threads, this patchset shares some common > > functionality with the hist triggers code and enables trace events to > > be accessed from eBPF programs. > > great that you've started working on BPF! > > > It needs to be applied on top of current tracing/for-next with the > > hist triggers v13 patches applied: > > > > https://lkml.org/lkml/2015/12/10/640 > > > > Any input welcome, especially things that could be done better wrt > > eBPF, since I'm not as familiar with that code... > > this dependency looks odd. As I was saying a year ago I don't think It's not actually a dependency, just a consequence of trying to share code for this RFC that happened to already be in hist triggers. I could just as well have removed the hunk that deletes that code from hist triggers, but one of the main points was to show how they could be shared. > this hist triggers belong in the kernel. BPF already can do > way more complex aggregation and histograms. Way more? I still can't accomplish with eBPF some of the most basic and helpful use cases that I can with hist triggers, such as using stacktraces as hash keys. And the locking in the eBPF hashmap implementation prevents anything like the function_hist [1] tracer from being implemented on top of it: . . . ip: [a0166350] gen7_render_get_cmd_length_mask hitcount: 18551173520 ip: [817589b0] _cond_resched hitcount: 19242157574 ip: [8175bd60] _raw_spin_lock hitcount: 19275222963 ip: [81232c60] __fdget hitcount: 22865629762 ip: [81232c00] __fget_lighthitcount: 24594932641 ip: [8175bed0] _raw_spin_lock_irqsave hitcount: 26510496751 ip: [8175bb30] _raw_spin_unlock_irqrestore hitcount: 26525916042 ip: [a018d6c0] gen6_ring_get_seqno hitcount: 95888137753 Totals: Hits: 1020596753338 Entries: 7102 Dropped: 0 [1] http://git.yoctoproject.org/cgit/cgit.cgi/linux-yocto-contrib/commit/?h=tzanussi/func-hist-v0&id=fa123181a068574bd9e300416cc947aa7424d808 > Take a look at all the tools written on top of it: > https://github.com/iovisor/bcc/tree/master/tools That's great, but it's all out-of-tree. Supporting out-of-tree users has never been justification for merging in-kernel code (or for blocking it from being merged). > I don't buy into reasoning that hist triggers are needed to do > performance analysis in the busybox. If not in busybox, > just use perf+bpf or bcc. > I have systems with tiny amounts of memory and storage that have zero chance of ever having a compiler or Python running on them. It's indispensable on those systems to have access to all and nothing more than the ftrace-based tools and trace event files directly. I even have systems that don't have a shell at all, but can nonetheless access all the ftrace and trace event functionality through a different channel. hist triggers is just another one of those tools, a logical progression towards allowing users in such a situation (among others) to make more efficient and concise use of the trace event data. I don't see what your problem with that is. > that aside we do need to be able to attach bpf to tracepoints. > The key goal of bpf+tracepoints is to be faster than bpf+kprobe. > kprobes were good enough for almost all use cases so far except > when we started processing millions of events per second via > kprobe+bpf. At that point even optimized kprobe adds too much > overhead. > It sounds from this like you're mainly interested in hooking into the trace events but not the trace event data, which is fine, but other tools might find some value in the trace event data, even if it may not be the most optimal path. I haven't measured the overhead of the cost of accessing data from the trace buffers, but I do know that the hist triggers have no problem logging millions of events per second. In the past I have measured the basic hist triggers mechanism and found it to be somewhat faster than the ftrace function tracer itself: over a kernel compile: no tracing: real110m47.817s user99m34.144s sys7m19.928s function tracer enabled: real 138m6.854s user 100m57.692s sys 36m27.724s function_graph tracer enabled: real194m38.052s user102m32.192s sys97m47.056s function_hist tracer enabled: real128m44.682s user100m29.080s sys26m52.880s Given the fact that I haven't heard people complain