On Thu, Aug 4, 2016 at 6:43 PM, Alexei Starovoitov <alexei.starovoi...@gmail.com> wrote: > On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote: >> On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote: >> >> > As for pmu tracepoints: if I were to instrument it (although I wasn't >> > planning to), I'd put a tracepoint in perf_event_overflow() called >> > "perf:perf_overflow", with the same arguments. That could then be used >> > for all PMU overflow events, without needing to add specific >> > tracepoints. >> >> Could we not teach BPF to replace event->overflow_handler and inject >> itself there? >> >> We don't currently have nice interfaces for doing that, but it should be >> possible to do I think. We already have the indirect function call, so >> injecting ourself there has 0 overhead.
Sounds like a good idea, especially for things like struct file_operations so that we can statically instrument file system read/writes with zero non-enabled overhead, and not worry about high frequency workloads (>10M events/sec). These perf probes aren't high frequency, though, and the code is not normally in use, so overhead should be much less of a concern. Sampling at 999 Hertz * CPUs is as frequent as I'd go. And if the tracepoint code is still adding a mem read, conditional, and branch, then that's not many instructions, especially considering the normal use case of these perf functions: creating records and writing to a perf ring buffer, then picking that up in user space by perf, then either processing it live or writing to perf.data, back to the file system, etc. It would be hard to benchmark the effect of adding a few instructions to that path (and any results may be more sensitive to cache line placement than the instructions). The perf:perf_hrtimer probe point is also reading state mid-way through a function, so it's not quite as simple as wrapping the function pointer. I do like that idea, though, but for things like struct file_operations. > > you're right. All makes sense. I guess I was too lazy to look into > how to do it properly. Adding a tracepoint looked like quick and > easy way to achieve the same. > As far as api goes probably existing IOC_SET_BPF ioctl will do too. > Currently overflow_handler is set at event alloc time. If we start > changing it on the fly with atomic xchg(), afaik things shouldn't > break, since each overflow_handler is run to completion and doesn't > change global state, right? > How would it be implemented? I was thinking of adding explicit wrappers, eg: static ssize_t __ext4_file_write_iter_tracepoint(struct kiocb *iocb, struct iov_iter *from) { trace_ext4_write_iter(iocb, from); ext4_file_write_iter(iocb, from); } Then switching in __ext4_file_write_iter_tracepoint() as needed. I was wondering about doing this dynamically, but wouldn't that then create another problem of maintenance -- we'd need to decorate such code with a comment, at least, to say "this function is exposed as a tracepoint". If a dynamic approach is still desired, and the goal is zero non-enabled overhead, then I'd also wonder if we could leverage kprobes to do this. Imagine walking a file_operations to find the addresses, and then kprobe-ing the addresses. Not the same (or probably as efficient) as setting the function pointer, but, using existing kprobes. Brendan