On Thu, Sep 20, 2018 at 03:56:51PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 20, 2018 at 10:25:45AM -0300, Arnaldo Carvalho de Melo wrote:
> > PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due
> > to having to cope with munmapping parts of existing mmaps, etc.
> > 
> > I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for
> > now it would be used just in this clean case for undoing a
> > PERF_RECORD_MMAP for a BPF program.
> > 
> > The ABI is already complicated, starting to use something called
> > PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever,
> > I think.
> 
> Agreed, the PERF_RECORD_MUNMAP patch was fairly trivial, the difficult
> part was getting the perf tool to dtrt for that use-case. But if we need
> unmap events, doing the unmap record now is the right thing.

Thanks for the pointers!
The answer is a bit long. pls bear with me.

I have considered adding MUNMAP to match existing MMAP, but went
without it because I didn't want to introduce new bit in perf_event_attr
and emit these new events in a misbalanced conditional way for prog load/unload.
Like old perf is asking kernel for mmap events via mmap bit, so prog load events
will be in perf.data, but old perf report won't recognize them anyway.
Whereas new perf would certainly want to catch bpf events and will set
both mmap and mumap bits.

Then if I add MUNMAP event without new bit and emit MMAP/MUNMAP
conditionally based on single mmap bit they will confuse old perf
and it will print warning about 'unknown events'.

Both situations are ugly, hence I went with reuse of MMAP event
for both load/unload.
In such case old perf silently ignores them. Which is what I wanted.
When we upgrade the kernel we cannot synchronize the kernel upgrade
(or downgrade) with user space perf package upgrade.
Hence not confusing old perf is important.
With new kernel new bpf mmap events get into perf.data and
new perf picks them up.

Few more considerations:

I consider synthetic perf events to be non-ABI. Meaning they're
emitted by perf user space into perf.data and there is a convention
on names, but it's not a kernel abi. Like RECORD_MMAP with
event.filename == "[module_name]" is an indication for perf report
to parse elf/build-id of dso==module_name.
There is no such support in the kernel. Kernel doesn't emit
such events for module load/unload. If in the future
we decide to extend kernel with such events they don't have
to match what user space perf does today.

Why this is important? To get to next step.
As Arnaldo pointed out this patch set is missing support for
JITed prog annotations and displaying asm code. Absolutely correct.
This set only helps perf to reveal the names of bpf progs that _were_
running at the time of perf record, but there is no way yet for
perf report to show asm code of the prog that was running.
In that sense bpf is drastically different from java, other jits
and normal profiling.
bpf JIT happens in the kernel and only kernel knows the mapping
between original source and JITed code.
In addition there are bpf2bpf functions. In the future there will
be bpf libraries, more type info, line number support, etc.
I strongly believe perf RECORD_* events should NOT care about
the development that happens on the bpf side.
The only thing kernel will be telling user space is that bpf prog
with prog_id=X was loaded.
Then based on prog_id the 'perf record' phase can query the kernel
for bpf related information. There is already a way to fetch
JITed image based on prog_id.
Then perf will emit synthetic RECORD_FOOBAR into perf.data
that will contain bpf related info (like complete JITed image)
and perf report can process it later and annotate things in UI.

It may seem that there is a race here.
Like when 'perf record' see 'bpf prog was loaded with prog_id=X' event
it will ask the kernel about prog_id=X, but that prog could be
unloaded already.
In such case prog_id will not exist and perf record can ignore such event.
So no race.
The kernel doesn't need to pass all information about bpf prog to
the user space via RECORD_*. Instead 'perf record' can emit
synthetic events into perf.data.
I was thinking to extend RECORD_MMAP with prog_id already
(instead of passing kallsyms's bpf prog name in event->mmap.filename)
but bpf functions don't have their own prog_id. Multiple bpf funcs
with different JITed blobs are considered to be a part of single prog_id.
So as a step one I'm only extending RECORD_MMAP with addr and kallsym
name of bpf function/prog.
As a step two the plan is to add notification mechanism for prog load/unload
that will include prog_id and design new synthetic RECORD_* events in
perf user space that will contain JITed code, line info, BTF, etc.

TLDR:
step 1 (this patch set)
Single bpf prog_load can call multiple
bpf_prog_kallsyms_add() -> RECORD_MMAP with addr+kallsym only
Similarly unload calls multiple
bpf_prog_kallsyms_del() -> RECORD_MMAP with addr only

step 2 (future work)
single event for bpf prog_load with prog_id only.
Either via perf ring buffer or ftrace or tracepoints or some
other notification mechanism.

It may seem that step 2 obsoletes step 1. It can, but I think
it will complement it. There is a lot more code there and
a lot more discussions to have.
Step 1 is already big improvement.

Thoughts?

Reply via email to