On 7/17/15 3:43 AM, kaixu xia wrote:
There are many useful PMUs provided by X86 and other architectures. By combining PMU, kprobe and eBPF program together, many interesting things can be done. For example, by probing at sched:sched_switch we can measure IPC changing between different processes by watching 'cycle' PMU counter; by probing at entry and exit points of a kernel function we are able to compute cache miss rate for a function by collecting 'cache-misses' counter and see the differences. In summary, we can define the begin and end points of a procedure, insert kprobes on them, attach two BPF programs and let them collect specific PMU counter.
that would be definitely a useful feature. As far as overall design I think it should be done slightly differently. The addition of 'flags' to all maps is a bit hacky and it seems has few holes. It's better to reuse 'store fds into maps' code that prog_array is doing. You can add new map type BPF_MAP_TYPE_PERF_EVENT_ARRAY and reuse most of the arraymap.c code. The program also wouldn't need to do lookup+read_pmu, so instead of: r0 = 0 (the chosen key: CPU-0) *(u32 *)(fp - 4) = r0 value = bpf_map_lookup_elem(map_fd, fp - 4); count = bpf_read_pmu(value); you will be able to do: count = bpf_perf_event_read(perf_event_array_map_fd, index) which will be faster. note, I'd prefer 'bpf_perf_event_read' name for the helper. Then inside helper we really cannot do mutex, sleep or smp_call, but since programs are always executed in preempt disabled and never from NMI, I think something like the following should work: u64 bpf_perf_event_read(u64 r1, u64 index,...) { struct bpf_perf_event_array *array = (void *) (long) r1; struct perf_event *event; if (unlikely(index >= array->map.max_entries)) return -EINVAL; event = array->events[index]; if (event->state != PERF_EVENT_STATE_ACTIVE) return -EINVAL; if (event->oncpu != raw_smp_processor_id()) return -EINVAL; __perf_event_read(event); return perf_event_count(event); } not sure whether we need to disable irq around __perf_event_read, I think it should be ok without. Also during store of FD into perf_event_array you'd need to filter out all crazy events. I would limit it to few basic types first. btw, make sure you do your tests with lockdep and other debugs on. and for the sample code please use C for the bpf program. Not many people can read bpf asm ;) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/