Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-11-01 Thread Song Liu
Hi Arnaldo,

On Wed, Oct 17, 2018 at 5:11 AM Arnaldo Carvalho de Melo
 wrote:
>
> Adding Alexey, Jiri and Namhyung as they worked/are working on
> multithreading 'perf record'.

I have read Alexey's work on enabling aio for perf-record
(https://lkml.org/lkml/2018/10/15/169).
But I feel it is not really related to this work. Did I miss anything here?

For VIP events, I think we need more mmap. Currently, the default setting
uses 1 mmap per cpu. To capture VIP events, I think we need another mmap
per CPU. The VIP events will be send to the special mmap. Then, user space
will process these event before the end of perf-record.

Does this approach make sense?

Thanks!
Song


Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-10-17 Thread Alexei Starovoitov
On 10/17/18 2:31 PM, Arnaldo Carvalho de Melo wrote:
>
> Keep all that info in a file, as I described above. Or keep it for a
> while, to give that thread in userspace time to get it and tell the
> kernel that it can trow it away.

stashing by kernel into a file is a huge headache, since format of the
file becomes kernel abi.
Plus why have it in two places (in a file and in normal kernel data
structures)?

> It may well be that most of the time the 'perf record' thread catching
> those events picks that up and saves it in
> /var/tmp/bcc/bpf_prog_BUILDID/ even before the program gets unloaded,
> no?

Whether perf user space stashes that info into perf.data as
synthetic records or stashes it somewhere in /var/tmp/ sound about
equivalent to me. Both have their pros and cons. This is certainly
a good topic to discuss further.

But asking kernel to keep JITed images and all relevant bpf data
after program has been unloaded sounds really scary to me.
I struggling to think through the security implications with that.
How long kernel suppose to keep that? Some timeout?

As I explained already the time it takes for perf to do
_single_ get_fd_by_id syscall when it sees RECORD_MMAP with prog_id
is pretty instant.
All other syscalls to grab JITed image and everything else can
be done later. The prog will not go away because perf will hold an fd.
If prog was somehow unloaded before perf could do get_fd_by_id
no one cares about such programs, since there is close to zero
chance that this program was attached to anything and absolutely
no chance that it run.



Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-10-17 Thread Arnaldo Carvalho de Melo
Em Wed, Oct 17, 2018 at 07:08:37PM +, Alexei Starovoitov escreveu:
> On 10/17/18 11:53 AM, Arnaldo Carvalho de Melo wrote:
> > Em Wed, Oct 17, 2018 at 04:36:08PM +, Alexei Starovoitov escreveu:
> >> On 10/17/18 8:09 AM, David Ahern wrote:
> >>> On 10/16/18 11:43 PM, Song Liu wrote:
>  I agree that processing events while recording has significant overhead.
>  In this case, perf user space need to know details about the the jited 
>  BPF
>  program. It is impossible to pass all these details to user space through
>  the relatively stable ring_buffer API. Therefore, some processing of the
>  data is necessary (get bpf prog_id from ring buffer, and then fetch 
>  program
>  details via BPF_OBJ_GET_INFO_BY_FD.
> 
>  I have some idea on processing important data with relatively low 
>  overhead.
>  Let me try implement it.
> 
> >>>
> >>> As I understand it, you want this series:
> >>>
> >>>  kernel: add event to perf buffer on bpf prog load
> >>>
> >>>  userspace: perf reads the event and grabs information about the program
> >>> from the fd
> >>>
> >>> Is that correct?
> >>>
> >>> Userpsace is not awakened immediately when an event is added the the
> >>> ring. It is awakened once the number of events crosses a watermark. That
> >>> means there is an unknown - and potentially long - time window where the
> >>> program can be unloaded before perf reads the event.
> >
> >>> So no matter what you do expecting perf record to be able to process the
> >>> event quickly is an unreasonable expectation.
> >
> >> yes... unless we go with threaded model as Arnaldo suggested and use
> >> single event as a watermark to wakeup our perf thread.
> >> In such case there is still a race window between user space waking up
> >> and doing _single_ bpf_get_fd_from_id() call to hold that prog
> >> and some other process trying to instantly unload the prog it
> >> just loaded.
> >> I think such race window is extremely tiny and if perf misses
> >> those load/unload events it's a good thing, since there is no chance
> >> that normal pmu event samples would be happening during prog execution.

> >> The alternative approach with no race window at all is to burden kernel
> >> RECORD_* events with _all_ information about bpf prog. Which is jited
> >> addresses, jited image itself, info about all subprogs, info about line
> >> info, all BTF data, etc. As I said earlier I'm strongly against such
> >> RECORD_* bloating.
> >> Instead we need to find a way to process new RECORD_BPF events with
> >> single prog_id field in perf user space with minimal race
> >> and threaded approach sounds like a win to me.

> > There is another alternative, I think: put just a content based hash,
> > like a git commit id into a PERF_RECORD_MMAP3 new record, and when the
> > validator does the jit, etc, it stashes the content that
> > BPF_OBJ_GET_INFO_BY_FD would get somewhere, some filesystem populated by
> > the kernel right after getting stuff from sys_bpf() and preparing it for
> > use, then we know that in (start, end) we have blob foo with content id,
> > that we will use to retrieve information that augments what we know with
> > just (start, end, id) and allows annotation, etc.

> > That stash space for jitted stuff gets garbage collected from time to
> > time or is even completely disabled if the user is not interested in
> > such augmentation, just like one can do disabling perf's ~/.debug/
> > directory hashed by build-id.

> > I think with this we have no races, the PERF_RECORD_MMAP3 gets just what
> > is in PERF_RECORD_MMAP2 plus some extra 20 bytes for such content based
> > cookie and we solve the other race we already have with kernel modules,
> > DSOs, etc.

> > I have mentioned this before, there were objections, perhaps this time I
> > formulated in a different way that makes it more interesting?
 
> that 'content based hash' we already have. It's called program tag.

But that was calculated by whom? Userspace? It can't do that, its the
kernel that ultimately puts together, from what userspace gave it, what
we need to do performance analysis, line numbers, etc.

> and we already taught iovisor/bcc to stash that stuff into
> /var/tmp/bcc/bpf_prog_TAG/ directory.
> Unfortunately that approach didn't work.
> JITed image only exists in the kernel. It's there only when

That is why I said that _the_ kernel stashes that thing, not bcc/iovisor
or perf, the kernel calculates the hash, and it also puts that into the
PERF_RECORD_MMAP3, so the tool sees it and goes to get it from the place
the kernel stashed it.

> program is loaded and it's the one that matter the most for performance
> analysis, since sample IPs are pointing into it.

Agreed

> Also the line info mapping that user space knows is not helping much
> either, since verifier optimizes the instructions and then JIT
> does more. The debug_info <-> JIT relationship must be preserved
> by the kernel and returned to user space.


Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-10-17 Thread Alexei Starovoitov
On 10/17/18 11:53 AM, Arnaldo Carvalho de Melo wrote:
> Em Wed, Oct 17, 2018 at 04:36:08PM +, Alexei Starovoitov escreveu:
>> On 10/17/18 8:09 AM, David Ahern wrote:
>>> On 10/16/18 11:43 PM, Song Liu wrote:
 I agree that processing events while recording has significant overhead.
 In this case, perf user space need to know details about the the jited BPF
 program. It is impossible to pass all these details to user space through
 the relatively stable ring_buffer API. Therefore, some processing of the
 data is necessary (get bpf prog_id from ring buffer, and then fetch program
 details via BPF_OBJ_GET_INFO_BY_FD.

 I have some idea on processing important data with relatively low overhead.
 Let me try implement it.

>>>
>>> As I understand it, you want this series:
>>>
>>>  kernel: add event to perf buffer on bpf prog load
>>>
>>>  userspace: perf reads the event and grabs information about the program
>>> from the fd
>>>
>>> Is that correct?
>>>
>>> Userpsace is not awakened immediately when an event is added the the
>>> ring. It is awakened once the number of events crosses a watermark. That
>>> means there is an unknown - and potentially long - time window where the
>>> program can be unloaded before perf reads the event.
>
>>> So no matter what you do expecting perf record to be able to process the
>>> event quickly is an unreasonable expectation.
>
>> yes... unless we go with threaded model as Arnaldo suggested and use
>> single event as a watermark to wakeup our perf thread.
>> In such case there is still a race window between user space waking up
>> and doing _single_ bpf_get_fd_from_id() call to hold that prog
>> and some other process trying to instantly unload the prog it
>> just loaded.
>> I think such race window is extremely tiny and if perf misses
>> those load/unload events it's a good thing, since there is no chance
>> that normal pmu event samples would be happening during prog execution.
>
>> The alternative approach with no race window at all is to burden kernel
>> RECORD_* events with _all_ information about bpf prog. Which is jited
>> addresses, jited image itself, info about all subprogs, info about line
>> info, all BTF data, etc. As I said earlier I'm strongly against such
>> RECORD_* bloating.
>> Instead we need to find a way to process new RECORD_BPF events with
>> single prog_id field in perf user space with minimal race
>> and threaded approach sounds like a win to me.
>
> There is another alternative, I think: put just a content based hash,
> like a git commit id into a PERF_RECORD_MMAP3 new record, and when the
> validator does the jit, etc, it stashes the content that
> BPF_OBJ_GET_INFO_BY_FD would get somewhere, some filesystem populated by
> the kernel right after getting stuff from sys_bpf() and preparing it for
> use, then we know that in (start, end) we have blob foo with content id,
> that we will use to retrieve information that augments what we know with
> just (start, end, id) and allows annotation, etc.
>
> That stash space for jitted stuff gets garbage collected from time to
> time or is even completely disabled if the user is not interested in
> such augmentation, just like one can do disabling perf's ~/.debug/
> directory hashed by build-id.
>
> I think with this we have no races, the PERF_RECORD_MMAP3 gets just what
> is in PERF_RECORD_MMAP2 plus some extra 20 bytes for such content based
> cookie and we solve the other race we already have with kernel modules,
> DSOs, etc.
>
> I have mentioned this before, there were objections, perhaps this time I
> formulated in a different way that makes it more interesting?

that 'content based hash' we already have. It's called program tag.
and we already taught iovisor/bcc to stash that stuff into
/var/tmp/bcc/bpf_prog_TAG/ directory.
Unfortunately that approach didn't work.
JITed image only exists in the kernel. It's there only when
program is loaded and it's the one that matter the most for performance
analysis, since sample IPs are pointing into it.
Also the line info mapping that user space knows is not helping much
either, since verifier optimizes the instructions and then JIT
does more. The debug_info <-> JIT relationship must be preserved
by the kernel and returned to user space.

The only other non-race way is to keep all that info in the kernel after
program is unloaded, but that is even worse than bloating RECORD*s




Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-10-17 Thread Arnaldo Carvalho de Melo
Em Wed, Oct 17, 2018 at 04:36:08PM +, Alexei Starovoitov escreveu:
> On 10/17/18 8:09 AM, David Ahern wrote:
> > On 10/16/18 11:43 PM, Song Liu wrote:
> >> I agree that processing events while recording has significant overhead.
> >> In this case, perf user space need to know details about the the jited BPF
> >> program. It is impossible to pass all these details to user space through
> >> the relatively stable ring_buffer API. Therefore, some processing of the
> >> data is necessary (get bpf prog_id from ring buffer, and then fetch program
> >> details via BPF_OBJ_GET_INFO_BY_FD.
> >>
> >> I have some idea on processing important data with relatively low overhead.
> >> Let me try implement it.
> >>
> >
> > As I understand it, you want this series:
> >
> >  kernel: add event to perf buffer on bpf prog load
> >
> >  userspace: perf reads the event and grabs information about the program
> > from the fd
> >
> > Is that correct?
> >
> > Userpsace is not awakened immediately when an event is added the the
> > ring. It is awakened once the number of events crosses a watermark. That
> > means there is an unknown - and potentially long - time window where the
> > program can be unloaded before perf reads the event.

> > So no matter what you do expecting perf record to be able to process the
> > event quickly is an unreasonable expectation.
 
> yes... unless we go with threaded model as Arnaldo suggested and use
> single event as a watermark to wakeup our perf thread.
> In such case there is still a race window between user space waking up
> and doing _single_ bpf_get_fd_from_id() call to hold that prog
> and some other process trying to instantly unload the prog it
> just loaded.
> I think such race window is extremely tiny and if perf misses
> those load/unload events it's a good thing, since there is no chance
> that normal pmu event samples would be happening during prog execution.
 
> The alternative approach with no race window at all is to burden kernel
> RECORD_* events with _all_ information about bpf prog. Which is jited
> addresses, jited image itself, info about all subprogs, info about line
> info, all BTF data, etc. As I said earlier I'm strongly against such
> RECORD_* bloating.
> Instead we need to find a way to process new RECORD_BPF events with
> single prog_id field in perf user space with minimal race
> and threaded approach sounds like a win to me.

There is another alternative, I think: put just a content based hash,
like a git commit id into a PERF_RECORD_MMAP3 new record, and when the
validator does the jit, etc, it stashes the content that
BPF_OBJ_GET_INFO_BY_FD would get somewhere, some filesystem populated by
the kernel right after getting stuff from sys_bpf() and preparing it for
use, then we know that in (start, end) we have blob foo with content id,
that we will use to retrieve information that augments what we know with
just (start, end, id) and allows annotation, etc.

That stash space for jitted stuff gets garbage collected from time to
time or is even completely disabled if the user is not interested in
such augmentation, just like one can do disabling perf's ~/.debug/
directory hashed by build-id.

I think with this we have no races, the PERF_RECORD_MMAP3 gets just what
is in PERF_RECORD_MMAP2 plus some extra 20 bytes for such content based
cookie and we solve the other race we already have with kernel modules,
DSOs, etc.

I have mentioned this before, there were objections, perhaps this time I
formulated in a different way that makes it more interesting?

- Arnaldo


Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-10-17 Thread Alexei Starovoitov
On 10/17/18 8:09 AM, David Ahern wrote:
> On 10/16/18 11:43 PM, Song Liu wrote:
>> I agree that processing events while recording has significant overhead.
>> In this case, perf user space need to know details about the the jited BPF
>> program. It is impossible to pass all these details to user space through
>> the relatively stable ring_buffer API. Therefore, some processing of the
>> data is necessary (get bpf prog_id from ring buffer, and then fetch program
>> details via BPF_OBJ_GET_INFO_BY_FD.
>>
>> I have some idea on processing important data with relatively low overhead.
>> Let me try implement it.
>>
>
> As I understand it, you want this series:
>
>  kernel: add event to perf buffer on bpf prog load
>
>  userspace: perf reads the event and grabs information about the program
> from the fd
>
> Is that correct?
>
> Userpsace is not awakened immediately when an event is added the the
> ring. It is awakened once the number of events crosses a watermark. That
> means there is an unknown - and potentially long - time window where the
> program can be unloaded before perf reads the event.
>
> So no matter what you do expecting perf record to be able to process the
> event quickly is an unreasonable expectation.

yes... unless we go with threaded model as Arnaldo suggested and use
single event as a watermark to wakeup our perf thread.
In such case there is still a race window between user space waking up
and doing _single_ bpf_get_fd_from_id() call to hold that prog
and some other process trying to instantly unload the prog it
just loaded.
I think such race window is extremely tiny and if perf misses
those load/unload events it's a good thing, since there is no chance
that normal pmu event samples would be happening during prog execution.

The alternative approach with no race window at all is to burden kernel
RECORD_* events with _all_ information about bpf prog. Which is jited
addresses, jited image itself, info about all subprogs, info about line
info, all BTF data, etc. As I said earlier I'm strongly against such
RECORD_* bloating.
Instead we need to find a way to process new RECORD_BPF events with
single prog_id field in perf user space with minimal race
and threaded approach sounds like a win to me.



Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-10-17 Thread Song Liu
Hi David,

On Wed, Oct 17, 2018 at 8:09 AM David Ahern  wrote:
>
> On 10/16/18 11:43 PM, Song Liu wrote:
> > I agree that processing events while recording has significant overhead.
> > In this case, perf user space need to know details about the the jited BPF
> > program. It is impossible to pass all these details to user space through
> > the relatively stable ring_buffer API. Therefore, some processing of the
> > data is necessary (get bpf prog_id from ring buffer, and then fetch program
> > details via BPF_OBJ_GET_INFO_BY_FD.
> >
> > I have some idea on processing important data with relatively low overhead.
> > Let me try implement it.
> >
>
> As I understand it, you want this series:
>
>  kernel: add event to perf buffer on bpf prog load
>
>  userspace: perf reads the event and grabs information about the program
> from the fd
>
> Is that correct?

Yes, this is correct.

>
> Userpsace is not awakened immediately when an event is added the the
> ring. It is awakened once the number of events crosses a watermark. That
> means there is an unknown - and potentially long - time window where the
> program can be unloaded before perf reads the event.
>
> So no matter what you do expecting perf record to be able to process the
> event quickly is an unreasonable expectation.

In this case, we don't really need to gather the information immediately. We
will lost information about some short-living BPF programs. These BPF
programs are less important for the performance analysis anyway. I guess
the only missing case is when some BPF program get load/unload many
times, so each instance is short-living, but the overall perf impact is
significant. I think this case is not interesting either, as BPF programs
should not be used like this (considering the kernel need to translate and
jit the program, unloading and reloading soon doesn't make any sense).

Thanks,
Song


Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-10-17 Thread Song Liu
On Wed, Oct 17, 2018 at 5:50 AM Arnaldo Carvalho de Melo
 wrote:
>
> Em Wed, Oct 17, 2018 at 09:11:40AM -0300, Arnaldo Carvalho de Melo escreveu:
> > Adding Alexey, Jiri and Namhyung as they worked/are working on
> > multithreading 'perf record'.
> >
> > Em Tue, Oct 16, 2018 at 11:43:11PM -0700, Song Liu escreveu:
> > > On Tue, Oct 16, 2018 at 4:43 PM David Ahern  wrote:
> > > > On 10/15/18 4:33 PM, Song Liu wrote:
> > > > > I am working with Alexei on the idea of fetching BPF program 
> > > > > information via
> > > > > BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT
> > > > > to perf_event_type, and dumped these events to perf event ring buffer.
> >
> > > > > I found that perf will not process event until the end of perf-record:
> >
> > > > > root@virt-test:~# ~/perf record -ag -- sleep 10
> > > > > .. 10 seconds later
> > > > > [ perf record: Woken up 34 times to write data ]
> > > > > machine__process_bpf_event: prog_id 6 loaded
> > > > > machine__process_bpf_event: prog_id 6 unloaded
> > > > > [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ]
> >
> > > > > In this example, the bpf program was loaded and then unloaded in
> > > > > another terminal. When machine__process_bpf_event() processes
> > > > > the load event, the bpf program is already unloaded. Therefore,
> > > > > machine__process_bpf_event() will not be able to get information
> > > > > about the program via BPF_OBJ_GET_INFO_BY_FD cmd.
> >
> > > > > To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD
> > > > > as soon as perf get the event from kernel. I looked around the perf
> > > > > code for a while. But I haven't found a good example where some
> > > > > events are processed before the end of perf-record. Could you
> > > > > please help me with this?
> >
> > > > perf record does not process events as they are generated. Its sole job
> > > > is pushing data from the maps to a file as fast as possible meaning in
> > > > bulk based on current read and write locations.
> >
> > > > Adding code to process events will add significant overhead to the
> > > > record command and will not really solve your race problem.
> >
> > > I agree that processing events while recording has significant overhead.
> > > In this case, perf user space need to know details about the the jited BPF
> > > program. It is impossible to pass all these details to user space through
> > > the relatively stable ring_buffer API. Therefore, some processing of the
> > > data is necessary (get bpf prog_id from ring buffer, and then fetch 
> > > program
> > > details via BPF_OBJ_GET_INFO_BY_FD.
> >
> > > I have some idea on processing important data with relatively low 
> > > overhead.
> > > Let me try implement it.
> >
> > Well, you could have a separate thread processing just those kinds of
> > events, associate it with a dummy event where you only ask for
> > PERF_RECORD_BPF_EVENTs.
> >
> > Here is how to setup the PERF_TYPE_SOFTWARE/PERF_COUNT_SW_DUMMY
> > perf_event_attr:
> >
> > [root@seventh ~]# perf record -vv -e dummy sleep 01
> > 
> > perf_event_attr:
> >   type 1
> >   size 112
> >   config   0x9
> >   { sample_period, sample_freq }   4000
> >   sample_type  IP|TID|TIME|PERIOD
> >   disabled 1
> >   inherit  1
>
> These you would have disabled, no need for
> PERF_RECORD_{MMAP*,COMM,FORK,EXIT} just PERF_RECORD_BPF_EVENT
>
> >   mmap 1
> >   comm 1
> >   task 1
> >   mmap21
> >   comm_exec1
>
>

Thanks Arnaldo! This looks better than my original idea (using POLLPRI
to highlight
special events). I will try implement the BPF_EVENT in this direction.

Song


Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-10-17 Thread David Ahern
On 10/16/18 11:43 PM, Song Liu wrote:
> I agree that processing events while recording has significant overhead.
> In this case, perf user space need to know details about the the jited BPF
> program. It is impossible to pass all these details to user space through
> the relatively stable ring_buffer API. Therefore, some processing of the
> data is necessary (get bpf prog_id from ring buffer, and then fetch program
> details via BPF_OBJ_GET_INFO_BY_FD.
> 
> I have some idea on processing important data with relatively low overhead.
> Let me try implement it.
> 

As I understand it, you want this series:

 kernel: add event to perf buffer on bpf prog load

 userspace: perf reads the event and grabs information about the program
from the fd

Is that correct?

Userpsace is not awakened immediately when an event is added the the
ring. It is awakened once the number of events crosses a watermark. That
means there is an unknown - and potentially long - time window where the
program can be unloaded before perf reads the event.

So no matter what you do expecting perf record to be able to process the
event quickly is an unreasonable expectation.


Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-10-17 Thread Arnaldo Carvalho de Melo
Em Wed, Oct 17, 2018 at 09:11:40AM -0300, Arnaldo Carvalho de Melo escreveu:
> Adding Alexey, Jiri and Namhyung as they worked/are working on
> multithreading 'perf record'.
> 
> Em Tue, Oct 16, 2018 at 11:43:11PM -0700, Song Liu escreveu:
> > On Tue, Oct 16, 2018 at 4:43 PM David Ahern  wrote:
> > > On 10/15/18 4:33 PM, Song Liu wrote:
> > > > I am working with Alexei on the idea of fetching BPF program 
> > > > information via
> > > > BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT
> > > > to perf_event_type, and dumped these events to perf event ring buffer.
> 
> > > > I found that perf will not process event until the end of perf-record:
> 
> > > > root@virt-test:~# ~/perf record -ag -- sleep 10
> > > > .. 10 seconds later
> > > > [ perf record: Woken up 34 times to write data ]
> > > > machine__process_bpf_event: prog_id 6 loaded
> > > > machine__process_bpf_event: prog_id 6 unloaded
> > > > [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ]
> 
> > > > In this example, the bpf program was loaded and then unloaded in
> > > > another terminal. When machine__process_bpf_event() processes
> > > > the load event, the bpf program is already unloaded. Therefore,
> > > > machine__process_bpf_event() will not be able to get information
> > > > about the program via BPF_OBJ_GET_INFO_BY_FD cmd.
> 
> > > > To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD
> > > > as soon as perf get the event from kernel. I looked around the perf
> > > > code for a while. But I haven't found a good example where some
> > > > events are processed before the end of perf-record. Could you
> > > > please help me with this?
> 
> > > perf record does not process events as they are generated. Its sole job
> > > is pushing data from the maps to a file as fast as possible meaning in
> > > bulk based on current read and write locations.
> 
> > > Adding code to process events will add significant overhead to the
> > > record command and will not really solve your race problem.
> 
> > I agree that processing events while recording has significant overhead.
> > In this case, perf user space need to know details about the the jited BPF
> > program. It is impossible to pass all these details to user space through
> > the relatively stable ring_buffer API. Therefore, some processing of the
> > data is necessary (get bpf prog_id from ring buffer, and then fetch program
> > details via BPF_OBJ_GET_INFO_BY_FD.
>  
> > I have some idea on processing important data with relatively low overhead.
> > Let me try implement it.
> 
> Well, you could have a separate thread processing just those kinds of
> events, associate it with a dummy event where you only ask for
> PERF_RECORD_BPF_EVENTs.
> 
> Here is how to setup the PERF_TYPE_SOFTWARE/PERF_COUNT_SW_DUMMY
> perf_event_attr:
> 
> [root@seventh ~]# perf record -vv -e dummy sleep 01
> 
> perf_event_attr:
>   type 1
>   size 112
>   config   0x9
>   { sample_period, sample_freq }   4000
>   sample_type  IP|TID|TIME|PERIOD
>   disabled 1
>   inherit  1

These you would have disabled, no need for
PERF_RECORD_{MMAP*,COMM,FORK,EXIT} just PERF_RECORD_BPF_EVENT

>   mmap 1
>   comm 1
>   task 1
>   mmap21
>   comm_exec1


>   freq 1
>   enable_on_exec   1
>   sample_id_all1
>   exclude_guest1
> 
> sys_perf_event_open: pid 12046  cpu 0  group_fd -1  flags 0x8 = 4
> sys_perf_event_open: pid 12046  cpu 1  group_fd -1  flags 0x8 = 5
> sys_perf_event_open: pid 12046  cpu 2  group_fd -1  flags 0x8 = 6
> sys_perf_event_open: pid 12046  cpu 3  group_fd -1  flags 0x8 = 8
> mmap size 528384B
> perf event ring buffer mmapped per cpu
> Synthesizing TSC conversion information
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.014 MB perf.data ]
> [root@seventh ~]#
> 
> [root@seventh ~]# perf evlist -v
> dummy: type: 1, size: 112, config: 0x9, { sample_period, sample_freq }: 4000, 
> sample_type: IP|TID|TIME|PERIOD, disabled: 1, inherit: 1, mmap: 1, comm: 1, 
> freq: 1, enable_on_exec: 1, task: 1, sample_id_all: 1, exclude_guest: 1, 
> mmap2: 1, comm_exec: 1
> [root@seventh ~]# 
> 
> There is work ongoing in dumping one file per cpu and then, at post
> processing time merging all those files to get ordering, so one more
> file, for these VIP events, that require per-event processing would be
> ordered at that time with all the other per-cpu files.
> 
> - Arnaldo


Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-10-17 Thread Arnaldo Carvalho de Melo
Adding Alexey, Jiri and Namhyung as they worked/are working on
multithreading 'perf record'.

Em Tue, Oct 16, 2018 at 11:43:11PM -0700, Song Liu escreveu:
> On Tue, Oct 16, 2018 at 4:43 PM David Ahern  wrote:
> > On 10/15/18 4:33 PM, Song Liu wrote:
> > > I am working with Alexei on the idea of fetching BPF program information 
> > > via
> > > BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT
> > > to perf_event_type, and dumped these events to perf event ring buffer.

> > > I found that perf will not process event until the end of perf-record:

> > > root@virt-test:~# ~/perf record -ag -- sleep 10
> > > .. 10 seconds later
> > > [ perf record: Woken up 34 times to write data ]
> > > machine__process_bpf_event: prog_id 6 loaded
> > > machine__process_bpf_event: prog_id 6 unloaded
> > > [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ]

> > > In this example, the bpf program was loaded and then unloaded in
> > > another terminal. When machine__process_bpf_event() processes
> > > the load event, the bpf program is already unloaded. Therefore,
> > > machine__process_bpf_event() will not be able to get information
> > > about the program via BPF_OBJ_GET_INFO_BY_FD cmd.

> > > To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD
> > > as soon as perf get the event from kernel. I looked around the perf
> > > code for a while. But I haven't found a good example where some
> > > events are processed before the end of perf-record. Could you
> > > please help me with this?

> > perf record does not process events as they are generated. Its sole job
> > is pushing data from the maps to a file as fast as possible meaning in
> > bulk based on current read and write locations.

> > Adding code to process events will add significant overhead to the
> > record command and will not really solve your race problem.

> I agree that processing events while recording has significant overhead.
> In this case, perf user space need to know details about the the jited BPF
> program. It is impossible to pass all these details to user space through
> the relatively stable ring_buffer API. Therefore, some processing of the
> data is necessary (get bpf prog_id from ring buffer, and then fetch program
> details via BPF_OBJ_GET_INFO_BY_FD.
 
> I have some idea on processing important data with relatively low overhead.
> Let me try implement it.

Well, you could have a separate thread processing just those kinds of
events, associate it with a dummy event where you only ask for
PERF_RECORD_BPF_EVENTs.

Here is how to setup the PERF_TYPE_SOFTWARE/PERF_COUNT_SW_DUMMY
perf_event_attr:

[root@seventh ~]# perf record -vv -e dummy sleep 01

perf_event_attr:
  type 1
  size 112
  config   0x9
  { sample_period, sample_freq }   4000
  sample_type  IP|TID|TIME|PERIOD
  disabled 1
  inherit  1
  mmap 1
  comm 1
  freq 1
  enable_on_exec   1
  task 1
  sample_id_all1
  exclude_guest1
  mmap21
  comm_exec1

sys_perf_event_open: pid 12046  cpu 0  group_fd -1  flags 0x8 = 4
sys_perf_event_open: pid 12046  cpu 1  group_fd -1  flags 0x8 = 5
sys_perf_event_open: pid 12046  cpu 2  group_fd -1  flags 0x8 = 6
sys_perf_event_open: pid 12046  cpu 3  group_fd -1  flags 0x8 = 8
mmap size 528384B
perf event ring buffer mmapped per cpu
Synthesizing TSC conversion information
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.014 MB perf.data ]
[root@seventh ~]#

[root@seventh ~]# perf evlist -v
dummy: type: 1, size: 112, config: 0x9, { sample_period, sample_freq }: 4000, 
sample_type: IP|TID|TIME|PERIOD, disabled: 1, inherit: 1, mmap: 1, comm: 1, 
freq: 1, enable_on_exec: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 
1, comm_exec: 1
[root@seventh ~]# 

There is work ongoing in dumping one file per cpu and then, at post
processing time merging all those files to get ordering, so one more
file, for these VIP events, that require per-event processing would be
ordered at that time with all the other per-cpu files.

- Arnaldo


Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-10-17 Thread Song Liu
Hi David,

On Tue, Oct 16, 2018 at 4:43 PM David Ahern  wrote:
>
> On 10/15/18 4:33 PM, Song Liu wrote:
> > I am working with Alexei on the idea of fetching BPF program information via
> > BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT
> > to perf_event_type, and dumped these events to perf event ring buffer.
> >
> > I found that perf will not process event until the end of perf-record:
> >
> > root@virt-test:~# ~/perf record -ag -- sleep 10
> > .. 10 seconds later
> > [ perf record: Woken up 34 times to write data ]
> > machine__process_bpf_event: prog_id 6 loaded
> > machine__process_bpf_event: prog_id 6 unloaded
> > [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ]
> >
> > In this example, the bpf program was loaded and then unloaded in
> > another terminal. When machine__process_bpf_event() processes
> > the load event, the bpf program is already unloaded. Therefore,
> > machine__process_bpf_event() will not be able to get information
> > about the program via BPF_OBJ_GET_INFO_BY_FD cmd.
> >
> > To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD
> > as soon as perf get the event from kernel. I looked around the perf
> > code for a while. But I haven't found a good example where some
> > events are processed before the end of perf-record. Could you
> > please help me with this?
>
> perf record does not process events as they are generated. Its sole job
> is pushing data from the maps to a file as fast as possible meaning in
> bulk based on current read and write locations.
>
> Adding code to process events will add significant overhead to the
> record command and will not really solve your race problem.

Thanks for the comment.

I agree that processing events while recording has significant overhead.
In this case, perf user space need to know details about the the jited BPF
program. It is impossible to pass all these details to user space through
the relatively stable ring_buffer API. Therefore, some processing of the
data is necessary (get bpf prog_id from ring buffer, and then fetch program
details via BPF_OBJ_GET_INFO_BY_FD.

I have some idea on processing important data with relatively low overhead.
Let me try implement it.

Thanks again,
Song


Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-10-16 Thread David Ahern
On 10/15/18 4:33 PM, Song Liu wrote:
> I am working with Alexei on the idea of fetching BPF program information via
> BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT
> to perf_event_type, and dumped these events to perf event ring buffer.
> 
> I found that perf will not process event until the end of perf-record:
> 
> root@virt-test:~# ~/perf record -ag -- sleep 10
> .. 10 seconds later
> [ perf record: Woken up 34 times to write data ]
> machine__process_bpf_event: prog_id 6 loaded
> machine__process_bpf_event: prog_id 6 unloaded
> [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ]
> 
> In this example, the bpf program was loaded and then unloaded in
> another terminal. When machine__process_bpf_event() processes
> the load event, the bpf program is already unloaded. Therefore,
> machine__process_bpf_event() will not be able to get information
> about the program via BPF_OBJ_GET_INFO_BY_FD cmd.
> 
> To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD
> as soon as perf get the event from kernel. I looked around the perf
> code for a while. But I haven't found a good example where some
> events are processed before the end of perf-record. Could you
> please help me with this?

perf record does not process events as they are generated. Its sole job
is pushing data from the maps to a file as fast as possible meaning in
bulk based on current read and write locations.

Adding code to process events will add significant overhead to the
record command and will not really solve your race problem.


Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-10-15 Thread Song Liu
On Fri, Sep 21, 2018 at 3:15 PM Alexei Starovoitov
 wrote:
>
> On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote:
> >
> > > I have considered adding MUNMAP to match existing MMAP, but went
> > > without it because I didn't want to introduce new bit in perf_event_attr
> > > and emit these new events in a misbalanced conditional way for prog 
> > > load/unload.
> > > Like old perf is asking kernel for mmap events via mmap bit, so prog load 
> > > events
> >
> > By prog load events you mean that old perf, having perf_event_attr.mmap = 1 
> > ||
> > perf_event_attr.mmap2 = 1 will cause the new kernel to emit
> > PERF_RECORD_MMAP records for the range of addresses that a BPF program
> > is being loaded on, right?
>
> right. it would be weird when prog load events are there, but not unload.
>
> > > will be in perf.data, but old perf report won't recognize them anyway.
> >
> > Why not? It should lookup the symbol and find it in the rb_tree of maps,
> > with a DSO name equal to what was in the PERF_RECORD_MMAP emitted by the
> > BPF core, no? It'll be an unresolved symbol, but a resolved map.
> >
> > > Whereas new perf would certainly want to catch bpf events and will set
> > > both mmap and mumap bits.
> >
> > new perf with your code will find a symbol, not a map, because your code
> > catches a special case PERF_RECORD_MMAP and instead of creating a
> > 'struct map' will create a 'struct symbol' and insert it in the kallsyms
> > 'struct map', right?
>
> right.
> bpf progs are more similar to kernel functions than to modules.
> For modules it makes sense to create a new map and insert symbols into it.
> For bpf JITed images there is no DSO to parse.
> Single bpf elf file may contain multiple bpf progs and each prog may contain
> multiple bpf functions. They will be loaded at different time and
> will have different life time.
>
> > In theory the old perf should catch the PERF_RECORD_MMAP with a string
> > in the filename part and insert a new map into the kernel mmap rb_tree,
> > and then samples would be resolved to this map, but since there is no
> > backing DSO with a symtab, it would stop at that, just stating that the
> > map is called NAME-OF-BPF-PROGRAM. This is all from memory, possibly
> > there is something in there that makes it ignore this PERF_RECORD_MMAP
> > emitted by the BPF kernel code when loading a new program.
>
> In /proc/kcore there is already a section for module range.
> Hence when perf processes bpf load/unload events the map is already created.
> Therefore the patch 3 only searches for it and inserts new symbol into it.
>
> In that sense the reuse of RECORD_MMAP event for bpf progs is indeed
> not exactly clean, since no new map is created.
> It's probably better to introduce PERF_RECORD_[INSERT|ERASE]_KSYM events ?
>
> Such event potentially can be used for offline ksym resolution.
> perf could process /proc/kallsyms during perf record and emit all of them
> as synthetic PERF_RECORD_INSERT_KSYM into perf.data, so perf report can run
> on a different server and still find the right symbols.
>
> I guess, we can do bpf specific events too and keep RECORD_MMAP as-is.
> How about single PERF_RECORD_BPF event with internal flag for load/unload ?
>
> > Right, that is another unfortunate state of affairs, kernel module
> > load/unload should already be supported, reported by the kernel via a
> > proper PERF_RECORD_MODULE_LOAD/UNLOAD
>
> I agree with Peter here. It would nice, but low priority.
> modules are mostly static. Loaded once and stay there.
>
> > There is another longstanding TODO list entry: PERF_RECORD_MMAP records
> > should include a build-id, to avoid either userspace getting confused
> > when there is an update of some mmap DSO, for long running sessions, for
> > instance, or to have to scan the just recorded perf.data file for DSOs
> > with samples to then read it from the file system (more races).
> >
> > Have you ever considered having a build-id for bpf objects that could be
> > used here?
>
> build-id concept is not applicable to bpf.
> bpf elf files on the disc don't have good correlation with what is
> running in the kernel. bpf bytestream is converted and optimized
> by the verifier. Then JITed.
> So debug info left in .o file and original bpf bytestream in .o are
> mostly useless.
> For bpf programs we have 'program tag'. It is computed over original
> bpf bytestream, so both kernel and user space can compute it.
> In libbcc we use /var/tmp/bcc/bpf_prog_TAG/ directory to store original
> source code of the program, so users looking at kernel stack traces
> with bpf_prog_TAG can find the source.
> It's similar to build-id, but not going to help perf to annotate
> actual x86 instructions inside JITed image and show src code.
> Since JIT runs in the kernel this problem cannot be solved by user space only.
> It's a difficult problem and we have a plan to tackle that,
> but it's step 2. A bunch of infra is needed on bpf side to
> preserve the 

Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-21 Thread Alexei Starovoitov
On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote:
>  
> > I have considered adding MUNMAP to match existing MMAP, but went
> > without it because I didn't want to introduce new bit in perf_event_attr
> > and emit these new events in a misbalanced conditional way for prog 
> > load/unload.
> > Like old perf is asking kernel for mmap events via mmap bit, so prog load 
> > events
> 
> By prog load events you mean that old perf, having perf_event_attr.mmap = 1 ||
> perf_event_attr.mmap2 = 1 will cause the new kernel to emit
> PERF_RECORD_MMAP records for the range of addresses that a BPF program
> is being loaded on, right?

right. it would be weird when prog load events are there, but not unload.

> > will be in perf.data, but old perf report won't recognize them anyway.
> 
> Why not? It should lookup the symbol and find it in the rb_tree of maps,
> with a DSO name equal to what was in the PERF_RECORD_MMAP emitted by the
> BPF core, no? It'll be an unresolved symbol, but a resolved map.
> 
> > Whereas new perf would certainly want to catch bpf events and will set
> > both mmap and mumap bits.
> 
> new perf with your code will find a symbol, not a map, because your code
> catches a special case PERF_RECORD_MMAP and instead of creating a
> 'struct map' will create a 'struct symbol' and insert it in the kallsyms
> 'struct map', right?

right.
bpf progs are more similar to kernel functions than to modules.
For modules it makes sense to create a new map and insert symbols into it.
For bpf JITed images there is no DSO to parse.
Single bpf elf file may contain multiple bpf progs and each prog may contain
multiple bpf functions. They will be loaded at different time and
will have different life time.

> In theory the old perf should catch the PERF_RECORD_MMAP with a string
> in the filename part and insert a new map into the kernel mmap rb_tree,
> and then samples would be resolved to this map, but since there is no
> backing DSO with a symtab, it would stop at that, just stating that the
> map is called NAME-OF-BPF-PROGRAM. This is all from memory, possibly
> there is something in there that makes it ignore this PERF_RECORD_MMAP
> emitted by the BPF kernel code when loading a new program.

In /proc/kcore there is already a section for module range.
Hence when perf processes bpf load/unload events the map is already created.
Therefore the patch 3 only searches for it and inserts new symbol into it.

In that sense the reuse of RECORD_MMAP event for bpf progs is indeed
not exactly clean, since no new map is created.
It's probably better to introduce PERF_RECORD_[INSERT|ERASE]_KSYM events ?

Such event potentially can be used for offline ksym resolution.
perf could process /proc/kallsyms during perf record and emit all of them
as synthetic PERF_RECORD_INSERT_KSYM into perf.data, so perf report can run
on a different server and still find the right symbols.

I guess, we can do bpf specific events too and keep RECORD_MMAP as-is.
How about single PERF_RECORD_BPF event with internal flag for load/unload ?

> Right, that is another unfortunate state of affairs, kernel module
> load/unload should already be supported, reported by the kernel via a
> proper PERF_RECORD_MODULE_LOAD/UNLOAD

I agree with Peter here. It would nice, but low priority.
modules are mostly static. Loaded once and stay there.

> There is another longstanding TODO list entry: PERF_RECORD_MMAP records
> should include a build-id, to avoid either userspace getting confused
> when there is an update of some mmap DSO, for long running sessions, for
> instance, or to have to scan the just recorded perf.data file for DSOs
> with samples to then read it from the file system (more races).
> 
> Have you ever considered having a build-id for bpf objects that could be
> used here?

build-id concept is not applicable to bpf.
bpf elf files on the disc don't have good correlation with what is
running in the kernel. bpf bytestream is converted and optimized
by the verifier. Then JITed.
So debug info left in .o file and original bpf bytestream in .o are
mostly useless.
For bpf programs we have 'program tag'. It is computed over original
bpf bytestream, so both kernel and user space can compute it.
In libbcc we use /var/tmp/bcc/bpf_prog_TAG/ directory to store original
source code of the program, so users looking at kernel stack traces
with bpf_prog_TAG can find the source.
It's similar to build-id, but not going to help perf to annotate
actual x86 instructions inside JITed image and show src code.
Since JIT runs in the kernel this problem cannot be solved by user space only.
It's a difficult problem and we have a plan to tackle that,
but it's step 2. A bunch of infra is needed on bpf side to
preserve the association during src_in_C -> original bpf insns ->
translated bpf insns -> JITed asm.
Then report it back to user space and then teach perf to properly annotate 
progs.

> Will the JITed code from some BPF bytecode be different if the same
> 

Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-21 Thread Peter Zijlstra
On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote:
> There is another longstanding TODO list entry: PERF_RECORD_MMAP records
> should include a build-id

I throught the problem was that the kernel doesn't have the build-id in
the first place. So it cannot hand them out.



Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-21 Thread Peter Zijlstra
On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote:
> > I consider synthetic perf events to be non-ABI. Meaning they're
> > emitted by perf user space into perf.data and there is a convention
> > on names, but it's not a kernel abi. Like RECORD_MMAP with
> > event.filename == "[module_name]" is an indication for perf report
> > to parse elf/build-id of dso==module_name.
> > There is no such support in the kernel. Kernel doesn't emit
> > such events for module load/unload. If in the future
> > we decide to extend kernel with such events they don't have
> > to match what user space perf does today.
> 
> Right, that is another unfortunate state of affairs, kernel module
> load/unload should already be supported, reported by the kernel via a
> proper PERF_RECORD_MODULE_LOAD/UNLOAD

Just wondering, is anyone actually doing enough module loading for this
to matter? (asks the CONFIG_MODULES=n guy).

I thought that was all a relatively static affair; you boot, you get
loadead a few modules for present hardware, the end.

Anyway, no real objection, just wonder if it's worth it.


Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-21 Thread Arnaldo Carvalho de Melo
Em Thu, Sep 20, 2018 at 08:14:46PM -0700, Alexei Starovoitov escreveu:
> On Thu, Sep 20, 2018 at 03:56:51PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 20, 2018 at 10:25:45AM -0300, Arnaldo Carvalho de Melo wrote:
> > > PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due
> > > to having to cope with munmapping parts of existing mmaps, etc.
> > > 
> > > I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for
> > > now it would be used just in this clean case for undoing a
> > > PERF_RECORD_MMAP for a BPF program.
> > > 
> > > The ABI is already complicated, starting to use something called
> > > PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever,
> > > I think.
> > 
> > Agreed, the PERF_RECORD_MUNMAP patch was fairly trivial, the difficult
> > part was getting the perf tool to dtrt for that use-case. But if we need
> > unmap events, doing the unmap record now is the right thing.
> 
> Thanks for the pointers!
> The answer is a bit long. pls bear with me.

Ditto with me :-)
 
> I have considered adding MUNMAP to match existing MMAP, but went
> without it because I didn't want to introduce new bit in perf_event_attr
> and emit these new events in a misbalanced conditional way for prog 
> load/unload.
> Like old perf is asking kernel for mmap events via mmap bit, so prog load 
> events

By prog load events you mean that old perf, having perf_event_attr.mmap = 1 ||
perf_event_attr.mmap2 = 1 will cause the new kernel to emit
PERF_RECORD_MMAP records for the range of addresses that a BPF program
is being loaded on, right?

> will be in perf.data, but old perf report won't recognize them anyway.

Why not? It should lookup the symbol and find it in the rb_tree of maps,
with a DSO name equal to what was in the PERF_RECORD_MMAP emitted by the
BPF core, no? It'll be an unresolved symbol, but a resolved map.

> Whereas new perf would certainly want to catch bpf events and will set
> both mmap and mumap bits.

new perf with your code will find a symbol, not a map, because your code
catches a special case PERF_RECORD_MMAP and instead of creating a
'struct map' will create a 'struct symbol' and insert it in the kallsyms
'struct map', right?
 
> Then if I add MUNMAP event without new bit and emit MMAP/MUNMAP
> conditionally based on single mmap bit they will confuse old perf
> and it will print warning about 'unknown events'.

That is unfortunate and I'll turn that part into a pr_debug()
 
> Both situations are ugly, hence I went with reuse of MMAP event
> for both load/unload.

So, its doubly odd, i.e. MMAP used for mmap() and for munmap() and the
effects in the tooling is not to create or remove a 'struct map', but to
alter an existing symbol table for the kallsyms map.

> In such case old perf silently ignores them. Which is what I wanted.

In theory the old perf should catch the PERF_RECORD_MMAP with a string
in the filename part and insert a new map into the kernel mmap rb_tree,
and then samples would be resolved to this map, but since there is no
backing DSO with a symtab, it would stop at that, just stating that the
map is called NAME-OF-BPF-PROGRAM. This is all from memory, possibly
there is something in there that makes it ignore this PERF_RECORD_MMAP
emitted by the BPF kernel code when loading a new program.

> When we upgrade the kernel we cannot synchronize the kernel upgrade
> (or downgrade) with user space perf package upgrade.

sure

> Hence not confusing old perf is important.

Thanks for trying to achieve that, and its a pity that that "unknown
record" message is a pr_warning or pr_info and not a pr_debug().

> With new kernel new bpf mmap events get into perf.data and
> new perf picks them up.
> 
> Few more considerations:
> 
> I consider synthetic perf events to be non-ABI. Meaning they're
> emitted by perf user space into perf.data and there is a convention
> on names, but it's not a kernel abi. Like RECORD_MMAP with
> event.filename == "[module_name]" is an indication for perf report
> to parse elf/build-id of dso==module_name.
> There is no such support in the kernel. Kernel doesn't emit
> such events for module load/unload. If in the future
> we decide to extend kernel with such events they don't have
> to match what user space perf does today.

Right, that is another unfortunate state of affairs, kernel module
load/unload should already be supported, reported by the kernel via a
proper PERF_RECORD_MODULE_LOAD/UNLOAD
 
> Why this is important? To get to next step.
> As Arnaldo pointed out this patch set is missing support for
> JITed prog annotations and displaying asm code. Absolutely correct.
> This set only helps perf to reveal the names of bpf progs that _were_
> running at the time of perf record, but there is no way yet for
> perf report to show asm code of the prog that was running.
> In that sense bpf is drastically different from java, other jits
> and normal profiling.
> bpf JIT happens in the kernel and only kernel knows the mapping
> 

Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-20 Thread Alexei Starovoitov
On Thu, Sep 20, 2018 at 03:56:51PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 20, 2018 at 10:25:45AM -0300, Arnaldo Carvalho de Melo wrote:
> > PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due
> > to having to cope with munmapping parts of existing mmaps, etc.
> > 
> > I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for
> > now it would be used just in this clean case for undoing a
> > PERF_RECORD_MMAP for a BPF program.
> > 
> > The ABI is already complicated, starting to use something called
> > PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever,
> > I think.
> 
> Agreed, the PERF_RECORD_MUNMAP patch was fairly trivial, the difficult
> part was getting the perf tool to dtrt for that use-case. But if we need
> unmap events, doing the unmap record now is the right thing.

Thanks for the pointers!
The answer is a bit long. pls bear with me.

I have considered adding MUNMAP to match existing MMAP, but went
without it because I didn't want to introduce new bit in perf_event_attr
and emit these new events in a misbalanced conditional way for prog load/unload.
Like old perf is asking kernel for mmap events via mmap bit, so prog load events
will be in perf.data, but old perf report won't recognize them anyway.
Whereas new perf would certainly want to catch bpf events and will set
both mmap and mumap bits.

Then if I add MUNMAP event without new bit and emit MMAP/MUNMAP
conditionally based on single mmap bit they will confuse old perf
and it will print warning about 'unknown events'.

Both situations are ugly, hence I went with reuse of MMAP event
for both load/unload.
In such case old perf silently ignores them. Which is what I wanted.
When we upgrade the kernel we cannot synchronize the kernel upgrade
(or downgrade) with user space perf package upgrade.
Hence not confusing old perf is important.
With new kernel new bpf mmap events get into perf.data and
new perf picks them up.

Few more considerations:

I consider synthetic perf events to be non-ABI. Meaning they're
emitted by perf user space into perf.data and there is a convention
on names, but it's not a kernel abi. Like RECORD_MMAP with
event.filename == "[module_name]" is an indication for perf report
to parse elf/build-id of dso==module_name.
There is no such support in the kernel. Kernel doesn't emit
such events for module load/unload. If in the future
we decide to extend kernel with such events they don't have
to match what user space perf does today.

Why this is important? To get to next step.
As Arnaldo pointed out this patch set is missing support for
JITed prog annotations and displaying asm code. Absolutely correct.
This set only helps perf to reveal the names of bpf progs that _were_
running at the time of perf record, but there is no way yet for
perf report to show asm code of the prog that was running.
In that sense bpf is drastically different from java, other jits
and normal profiling.
bpf JIT happens in the kernel and only kernel knows the mapping
between original source and JITed code.
In addition there are bpf2bpf functions. In the future there will
be bpf libraries, more type info, line number support, etc.
I strongly believe perf RECORD_* events should NOT care about
the development that happens on the bpf side.
The only thing kernel will be telling user space is that bpf prog
with prog_id=X was loaded.
Then based on prog_id the 'perf record' phase can query the kernel
for bpf related information. There is already a way to fetch
JITed image based on prog_id.
Then perf will emit synthetic RECORD_FOOBAR into perf.data
that will contain bpf related info (like complete JITed image)
and perf report can process it later and annotate things in UI.

It may seem that there is a race here.
Like when 'perf record' see 'bpf prog was loaded with prog_id=X' event
it will ask the kernel about prog_id=X, but that prog could be
unloaded already.
In such case prog_id will not exist and perf record can ignore such event.
So no race.
The kernel doesn't need to pass all information about bpf prog to
the user space via RECORD_*. Instead 'perf record' can emit
synthetic events into perf.data.
I was thinking to extend RECORD_MMAP with prog_id already
(instead of passing kallsyms's bpf prog name in event->mmap.filename)
but bpf functions don't have their own prog_id. Multiple bpf funcs
with different JITed blobs are considered to be a part of single prog_id.
So as a step one I'm only extending RECORD_MMAP with addr and kallsym
name of bpf function/prog.
As a step two the plan is to add notification mechanism for prog load/unload
that will include prog_id and design new synthetic RECORD_* events in
perf user space that will contain JITed code, line info, BTF, etc.

TLDR:
step 1 (this patch set)
Single bpf prog_load can call multiple
bpf_prog_kallsyms_add() -> RECORD_MMAP with addr+kallsym only
Similarly unload calls multiple
bpf_prog_kallsyms_del() -> RECORD_MMAP with addr only

step 2 (future work)

Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-20 Thread Peter Zijlstra
On Thu, Sep 20, 2018 at 10:25:45AM -0300, Arnaldo Carvalho de Melo wrote:
> PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due
> to having to cope with munmapping parts of existing mmaps, etc.
> 
> I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for
> now it would be used just in this clean case for undoing a
> PERF_RECORD_MMAP for a BPF program.
> 
> The ABI is already complicated, starting to use something called
> PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever,
> I think.

Agreed, the PERF_RECORD_MUNMAP patch was fairly trivial, the difficult
part was getting the perf tool to dtrt for that use-case. But if we need
unmap events, doing the unmap record now is the right thing.




Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-20 Thread Peter Zijlstra
On Thu, Sep 20, 2018 at 10:44:24AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 19, 2018 at 03:39:34PM -0700, Alexei Starovoitov wrote:
> >  void bpf_prog_kallsyms_del(struct bpf_prog *fp)
> >  {
> > +   unsigned long symbol_start, symbol_end;
> > +   /* mmap_record.filename cannot be NULL and has to be u64 aligned */
> > +   char buf[sizeof(u64)] = {};
> > +
> > if (!bpf_prog_kallsyms_candidate(fp))
> > return;
> >  
> > spin_lock_bh(_lock);
> > bpf_prog_ksym_node_del(fp->aux);
> > spin_unlock_bh(_lock);
> > +   bpf_get_prog_addr_region(fp, _start, _end);
> > +   perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start,
> > +buf, sizeof(buf));
> >  }
> 
> So perf doesn't normally issue unmap events.. We've talked about doing
> that, but so far it's never really need needed I think.
> 
> I feels a bit weird to start issuing unmap events for this.

FWIW:

  
https://lkml.kernel.org/r/20170127130702.gi6...@twins.programming.kicks-ass.net

has talk of PERF_RECORD_MUNMAP


Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-20 Thread Arnaldo Carvalho de Melo
Em Thu, Sep 20, 2018 at 10:44:24AM +0200, Peter Zijlstra escreveu:
> On Wed, Sep 19, 2018 at 03:39:34PM -0700, Alexei Starovoitov wrote:
> >  void bpf_prog_kallsyms_del(struct bpf_prog *fp)
> >  {
> > +   unsigned long symbol_start, symbol_end;
> > +   /* mmap_record.filename cannot be NULL and has to be u64 aligned */
> > +   char buf[sizeof(u64)] = {};
> > +
> > if (!bpf_prog_kallsyms_candidate(fp))
> > return;
> >  
> > spin_lock_bh(_lock);
> > bpf_prog_ksym_node_del(fp->aux);
> > spin_unlock_bh(_lock);
> > +   bpf_get_prog_addr_region(fp, _start, _end);
> > +   perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start,
> > +buf, sizeof(buf));
> >  }
> 
> So perf doesn't normally issue unmap events.. We've talked about doing
> that, but so far it's never really need needed I think.
 
> I feels a bit weird to start issuing unmap events for this.

For reference, this surfaced here:

https://lkml.org/lkml/2017/1/27/452

Start of the thread, that involves postgresql, JIT, LLVM, perf is here:

https://lkml.org/lkml/2016/12/10/1

PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due
to having to cope with munmapping parts of existing mmaps, etc.

I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for
now it would be used just in this clean case for undoing a
PERF_RECORD_MMAP for a BPF program.

The ABI is already complicated, starting to use something called
PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever,
I think.

- Arnaldo


Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-20 Thread Peter Zijlstra
On Wed, Sep 19, 2018 at 03:39:34PM -0700, Alexei Starovoitov wrote:
>  void bpf_prog_kallsyms_del(struct bpf_prog *fp)
>  {
> + unsigned long symbol_start, symbol_end;
> + /* mmap_record.filename cannot be NULL and has to be u64 aligned */
> + char buf[sizeof(u64)] = {};
> +
>   if (!bpf_prog_kallsyms_candidate(fp))
>   return;
>  
>   spin_lock_bh(_lock);
>   bpf_prog_ksym_node_del(fp->aux);
>   spin_unlock_bh(_lock);
> + bpf_get_prog_addr_region(fp, _start, _end);
> + perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start,
> +  buf, sizeof(buf));
>  }

So perf doesn't normally issue unmap events.. We've talked about doing
that, but so far it's never really need needed I think.

I feels a bit weird to start issuing unmap events for this.


Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-19 Thread Song Liu



> On Sep 19, 2018, at 5:59 PM, Alexei Starovoitov  wrote:
> 
> On 9/19/18 4:44 PM, Song Liu wrote:
>> 
>> 
>>> On Sep 19, 2018, at 3:39 PM, Alexei Starovoitov  wrote:
>>> 
>>> use perf_event_mmap_bpf_prog() helper to notify user space
>>> about JITed bpf programs.
>>> Use RECORD_MMAP perf event to tell user space where JITed bpf program was 
>>> loaded.
>>> Use empty program name as unload indication.
>>> 
>>> Signed-off-by: Alexei Starovoitov 
>>> ---
>>> kernel/bpf/core.c | 22 --
>>> 1 file changed, 20 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>>> index 3f5bf1af0826..ddf11fdafd36 100644
>>> --- a/kernel/bpf/core.c
>>> +++ b/kernel/bpf/core.c
>>> @@ -384,7 +384,7 @@ bpf_get_prog_addr_region(const struct bpf_prog *prog,
>>> *symbol_end   = addr + hdr->pages * PAGE_SIZE;
>>> }
>>> 
>>> -static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
>>> +static char *bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
>>> {
>>> const char *end = sym + KSYM_NAME_LEN;
>>> 
>>> @@ -402,9 +402,10 @@ static void bpf_get_prog_name(const struct bpf_prog 
>>> *prog, char *sym)
>>> sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_");
>>> sym  = bin2hex(sym, prog->tag, sizeof(prog->tag));
>>> if (prog->aux->name[0])
>>> -   snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name);
>>> +   sym += snprintf(sym, (size_t)(end - sym), "_%s", 
>>> prog->aux->name);
>>> else
>>> *sym = 0;
>>> +   return sym;
>>> }
>>> 
>>> static __always_inline unsigned long
>>> @@ -480,23 +481,40 @@ static bool bpf_prog_kallsyms_verify_off(const struct 
>>> bpf_prog *fp)
>>> 
>>> void bpf_prog_kallsyms_add(struct bpf_prog *fp)
>>> {
>>> +   unsigned long symbol_start, symbol_end;
>>> +   char buf[KSYM_NAME_LEN], *sym;
>>> +
>>> if (!bpf_prog_kallsyms_candidate(fp) ||
>>> !capable(CAP_SYS_ADMIN))
>>> return;
>>> 
>>> +   bpf_get_prog_addr_region(fp, _start, _end);
>>> +   sym = bpf_get_prog_name(fp, buf);
>>> +   sym++; /* sym - buf is the length of the name including trailing 0 */
>>> +   while (!IS_ALIGNED(sym - buf, sizeof(u64)))
>>> +   *sym++ = 0;
>> 
>> nit: This logic feels a little weird to me. How about we wrap the extra logic
>> in a separate function:
>> 
>> size_t bpf_get_prog_name_u64_aligned(const struct bpf_prog fp, char *buf)
> 
> probably bpf_get_prog_name_u64_padded() ?
> would be cleaner indeed.

bpf_get_prog_name_u64_padded does sound better. 

> Will send v2 once Peter and Arnaldo provide their feedback.
> 
> Thanks for reviewing!



Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-19 Thread Alexei Starovoitov

On 9/19/18 4:44 PM, Song Liu wrote:




On Sep 19, 2018, at 3:39 PM, Alexei Starovoitov  wrote:

use perf_event_mmap_bpf_prog() helper to notify user space
about JITed bpf programs.
Use RECORD_MMAP perf event to tell user space where JITed bpf program was 
loaded.
Use empty program name as unload indication.

Signed-off-by: Alexei Starovoitov 
---
kernel/bpf/core.c | 22 --
1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 3f5bf1af0826..ddf11fdafd36 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -384,7 +384,7 @@ bpf_get_prog_addr_region(const struct bpf_prog *prog,
*symbol_end   = addr + hdr->pages * PAGE_SIZE;
}

-static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
+static char *bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
{
const char *end = sym + KSYM_NAME_LEN;

@@ -402,9 +402,10 @@ static void bpf_get_prog_name(const struct bpf_prog *prog, 
char *sym)
sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_");
sym  = bin2hex(sym, prog->tag, sizeof(prog->tag));
if (prog->aux->name[0])
-   snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name);
+   sym += snprintf(sym, (size_t)(end - sym), "_%s", 
prog->aux->name);
else
*sym = 0;
+   return sym;
}

static __always_inline unsigned long
@@ -480,23 +481,40 @@ static bool bpf_prog_kallsyms_verify_off(const struct 
bpf_prog *fp)

void bpf_prog_kallsyms_add(struct bpf_prog *fp)
{
+   unsigned long symbol_start, symbol_end;
+   char buf[KSYM_NAME_LEN], *sym;
+
if (!bpf_prog_kallsyms_candidate(fp) ||
!capable(CAP_SYS_ADMIN))
return;

+   bpf_get_prog_addr_region(fp, _start, _end);
+   sym = bpf_get_prog_name(fp, buf);
+   sym++; /* sym - buf is the length of the name including trailing 0 */
+   while (!IS_ALIGNED(sym - buf, sizeof(u64)))
+   *sym++ = 0;


nit: This logic feels a little weird to me. How about we wrap the extra logic
in a separate function:

size_t bpf_get_prog_name_u64_aligned(const struct bpf_prog fp, char *buf)


probably bpf_get_prog_name_u64_padded() ?
would be cleaner indeed.

Will send v2 once Peter and Arnaldo provide their feedback.

Thanks for reviewing!



Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload

2018-09-19 Thread Song Liu



> On Sep 19, 2018, at 3:39 PM, Alexei Starovoitov  wrote:
> 
> use perf_event_mmap_bpf_prog() helper to notify user space
> about JITed bpf programs.
> Use RECORD_MMAP perf event to tell user space where JITed bpf program was 
> loaded.
> Use empty program name as unload indication.
> 
> Signed-off-by: Alexei Starovoitov 
> ---
> kernel/bpf/core.c | 22 --
> 1 file changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 3f5bf1af0826..ddf11fdafd36 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -384,7 +384,7 @@ bpf_get_prog_addr_region(const struct bpf_prog *prog,
>   *symbol_end   = addr + hdr->pages * PAGE_SIZE;
> }
> 
> -static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
> +static char *bpf_get_prog_name(const struct bpf_prog *prog, char *sym)
> {
>   const char *end = sym + KSYM_NAME_LEN;
> 
> @@ -402,9 +402,10 @@ static void bpf_get_prog_name(const struct bpf_prog 
> *prog, char *sym)
>   sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_");
>   sym  = bin2hex(sym, prog->tag, sizeof(prog->tag));
>   if (prog->aux->name[0])
> - snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name);
> + sym += snprintf(sym, (size_t)(end - sym), "_%s", 
> prog->aux->name);
>   else
>   *sym = 0;
> + return sym;
> }
> 
> static __always_inline unsigned long
> @@ -480,23 +481,40 @@ static bool bpf_prog_kallsyms_verify_off(const struct 
> bpf_prog *fp)
> 
> void bpf_prog_kallsyms_add(struct bpf_prog *fp)
> {
> + unsigned long symbol_start, symbol_end;
> + char buf[KSYM_NAME_LEN], *sym;
> +
>   if (!bpf_prog_kallsyms_candidate(fp) ||
>   !capable(CAP_SYS_ADMIN))
>   return;
> 
> + bpf_get_prog_addr_region(fp, _start, _end);
> + sym = bpf_get_prog_name(fp, buf);
> + sym++; /* sym - buf is the length of the name including trailing 0 */
> + while (!IS_ALIGNED(sym - buf, sizeof(u64)))
> + *sym++ = 0;

nit: This logic feels a little weird to me. How about we wrap the extra logic
in a separate function:

size_t bpf_get_prog_name_u64_aligned(const struct bpf_prog fp, char *buf)

where the return value is the u64 aligned size. 

Other than this 

Acked-by: Song Liu 

>   spin_lock_bh(_lock);
>   bpf_prog_ksym_node_add(fp->aux);
>   spin_unlock_bh(_lock);
> + perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start,
> +  buf, sym - buf);
> }
> 
> void bpf_prog_kallsyms_del(struct bpf_prog *fp)
> {
> + unsigned long symbol_start, symbol_end;
> + /* mmap_record.filename cannot be NULL and has to be u64 aligned */
> + char buf[sizeof(u64)] = {};
> +
>   if (!bpf_prog_kallsyms_candidate(fp))
>   return;
> 
>   spin_lock_bh(_lock);
>   bpf_prog_ksym_node_del(fp->aux);
>   spin_unlock_bh(_lock);
> + bpf_get_prog_addr_region(fp, _start, _end);
> + perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start,
> +  buf, sizeof(buf));
> }
> 
> static struct bpf_prog *bpf_prog_kallsyms_find(unsigned long addr)
> -- 
> 2.17.1
>