Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
Hi Arnaldo, On Wed, Oct 17, 2018 at 5:11 AM Arnaldo Carvalho de Melo wrote: > > Adding Alexey, Jiri and Namhyung as they worked/are working on > multithreading 'perf record'. I have read Alexey's work on enabling aio for perf-record (https://lkml.org/lkml/2018/10/15/169). But I feel it is not really related to this work. Did I miss anything here? For VIP events, I think we need more mmap. Currently, the default setting uses 1 mmap per cpu. To capture VIP events, I think we need another mmap per CPU. The VIP events will be send to the special mmap. Then, user space will process these event before the end of perf-record. Does this approach make sense? Thanks! Song
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
On 10/17/18 2:31 PM, Arnaldo Carvalho de Melo wrote: > > Keep all that info in a file, as I described above. Or keep it for a > while, to give that thread in userspace time to get it and tell the > kernel that it can trow it away. stashing by kernel into a file is a huge headache, since format of the file becomes kernel abi. Plus why have it in two places (in a file and in normal kernel data structures)? > It may well be that most of the time the 'perf record' thread catching > those events picks that up and saves it in > /var/tmp/bcc/bpf_prog_BUILDID/ even before the program gets unloaded, > no? Whether perf user space stashes that info into perf.data as synthetic records or stashes it somewhere in /var/tmp/ sound about equivalent to me. Both have their pros and cons. This is certainly a good topic to discuss further. But asking kernel to keep JITed images and all relevant bpf data after program has been unloaded sounds really scary to me. I struggling to think through the security implications with that. How long kernel suppose to keep that? Some timeout? As I explained already the time it takes for perf to do _single_ get_fd_by_id syscall when it sees RECORD_MMAP with prog_id is pretty instant. All other syscalls to grab JITed image and everything else can be done later. The prog will not go away because perf will hold an fd. If prog was somehow unloaded before perf could do get_fd_by_id no one cares about such programs, since there is close to zero chance that this program was attached to anything and absolutely no chance that it run.
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
Em Wed, Oct 17, 2018 at 07:08:37PM +, Alexei Starovoitov escreveu: > On 10/17/18 11:53 AM, Arnaldo Carvalho de Melo wrote: > > Em Wed, Oct 17, 2018 at 04:36:08PM +, Alexei Starovoitov escreveu: > >> On 10/17/18 8:09 AM, David Ahern wrote: > >>> On 10/16/18 11:43 PM, Song Liu wrote: > I agree that processing events while recording has significant overhead. > In this case, perf user space need to know details about the the jited > BPF > program. It is impossible to pass all these details to user space through > the relatively stable ring_buffer API. Therefore, some processing of the > data is necessary (get bpf prog_id from ring buffer, and then fetch > program > details via BPF_OBJ_GET_INFO_BY_FD. > > I have some idea on processing important data with relatively low > overhead. > Let me try implement it. > > >>> > >>> As I understand it, you want this series: > >>> > >>> kernel: add event to perf buffer on bpf prog load > >>> > >>> userspace: perf reads the event and grabs information about the program > >>> from the fd > >>> > >>> Is that correct? > >>> > >>> Userpsace is not awakened immediately when an event is added the the > >>> ring. It is awakened once the number of events crosses a watermark. That > >>> means there is an unknown - and potentially long - time window where the > >>> program can be unloaded before perf reads the event. > > > >>> So no matter what you do expecting perf record to be able to process the > >>> event quickly is an unreasonable expectation. > > > >> yes... unless we go with threaded model as Arnaldo suggested and use > >> single event as a watermark to wakeup our perf thread. > >> In such case there is still a race window between user space waking up > >> and doing _single_ bpf_get_fd_from_id() call to hold that prog > >> and some other process trying to instantly unload the prog it > >> just loaded. > >> I think such race window is extremely tiny and if perf misses > >> those load/unload events it's a good thing, since there is no chance > >> that normal pmu event samples would be happening during prog execution. > >> The alternative approach with no race window at all is to burden kernel > >> RECORD_* events with _all_ information about bpf prog. Which is jited > >> addresses, jited image itself, info about all subprogs, info about line > >> info, all BTF data, etc. As I said earlier I'm strongly against such > >> RECORD_* bloating. > >> Instead we need to find a way to process new RECORD_BPF events with > >> single prog_id field in perf user space with minimal race > >> and threaded approach sounds like a win to me. > > There is another alternative, I think: put just a content based hash, > > like a git commit id into a PERF_RECORD_MMAP3 new record, and when the > > validator does the jit, etc, it stashes the content that > > BPF_OBJ_GET_INFO_BY_FD would get somewhere, some filesystem populated by > > the kernel right after getting stuff from sys_bpf() and preparing it for > > use, then we know that in (start, end) we have blob foo with content id, > > that we will use to retrieve information that augments what we know with > > just (start, end, id) and allows annotation, etc. > > That stash space for jitted stuff gets garbage collected from time to > > time or is even completely disabled if the user is not interested in > > such augmentation, just like one can do disabling perf's ~/.debug/ > > directory hashed by build-id. > > I think with this we have no races, the PERF_RECORD_MMAP3 gets just what > > is in PERF_RECORD_MMAP2 plus some extra 20 bytes for such content based > > cookie and we solve the other race we already have with kernel modules, > > DSOs, etc. > > I have mentioned this before, there were objections, perhaps this time I > > formulated in a different way that makes it more interesting? > that 'content based hash' we already have. It's called program tag. But that was calculated by whom? Userspace? It can't do that, its the kernel that ultimately puts together, from what userspace gave it, what we need to do performance analysis, line numbers, etc. > and we already taught iovisor/bcc to stash that stuff into > /var/tmp/bcc/bpf_prog_TAG/ directory. > Unfortunately that approach didn't work. > JITed image only exists in the kernel. It's there only when That is why I said that _the_ kernel stashes that thing, not bcc/iovisor or perf, the kernel calculates the hash, and it also puts that into the PERF_RECORD_MMAP3, so the tool sees it and goes to get it from the place the kernel stashed it. > program is loaded and it's the one that matter the most for performance > analysis, since sample IPs are pointing into it. Agreed > Also the line info mapping that user space knows is not helping much > either, since verifier optimizes the instructions and then JIT > does more. The debug_info <-> JIT relationship must be preserved > by the kernel and returned to user space.
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
On 10/17/18 11:53 AM, Arnaldo Carvalho de Melo wrote: > Em Wed, Oct 17, 2018 at 04:36:08PM +, Alexei Starovoitov escreveu: >> On 10/17/18 8:09 AM, David Ahern wrote: >>> On 10/16/18 11:43 PM, Song Liu wrote: I agree that processing events while recording has significant overhead. In this case, perf user space need to know details about the the jited BPF program. It is impossible to pass all these details to user space through the relatively stable ring_buffer API. Therefore, some processing of the data is necessary (get bpf prog_id from ring buffer, and then fetch program details via BPF_OBJ_GET_INFO_BY_FD. I have some idea on processing important data with relatively low overhead. Let me try implement it. >>> >>> As I understand it, you want this series: >>> >>> kernel: add event to perf buffer on bpf prog load >>> >>> userspace: perf reads the event and grabs information about the program >>> from the fd >>> >>> Is that correct? >>> >>> Userpsace is not awakened immediately when an event is added the the >>> ring. It is awakened once the number of events crosses a watermark. That >>> means there is an unknown - and potentially long - time window where the >>> program can be unloaded before perf reads the event. > >>> So no matter what you do expecting perf record to be able to process the >>> event quickly is an unreasonable expectation. > >> yes... unless we go with threaded model as Arnaldo suggested and use >> single event as a watermark to wakeup our perf thread. >> In such case there is still a race window between user space waking up >> and doing _single_ bpf_get_fd_from_id() call to hold that prog >> and some other process trying to instantly unload the prog it >> just loaded. >> I think such race window is extremely tiny and if perf misses >> those load/unload events it's a good thing, since there is no chance >> that normal pmu event samples would be happening during prog execution. > >> The alternative approach with no race window at all is to burden kernel >> RECORD_* events with _all_ information about bpf prog. Which is jited >> addresses, jited image itself, info about all subprogs, info about line >> info, all BTF data, etc. As I said earlier I'm strongly against such >> RECORD_* bloating. >> Instead we need to find a way to process new RECORD_BPF events with >> single prog_id field in perf user space with minimal race >> and threaded approach sounds like a win to me. > > There is another alternative, I think: put just a content based hash, > like a git commit id into a PERF_RECORD_MMAP3 new record, and when the > validator does the jit, etc, it stashes the content that > BPF_OBJ_GET_INFO_BY_FD would get somewhere, some filesystem populated by > the kernel right after getting stuff from sys_bpf() and preparing it for > use, then we know that in (start, end) we have blob foo with content id, > that we will use to retrieve information that augments what we know with > just (start, end, id) and allows annotation, etc. > > That stash space for jitted stuff gets garbage collected from time to > time or is even completely disabled if the user is not interested in > such augmentation, just like one can do disabling perf's ~/.debug/ > directory hashed by build-id. > > I think with this we have no races, the PERF_RECORD_MMAP3 gets just what > is in PERF_RECORD_MMAP2 plus some extra 20 bytes for such content based > cookie and we solve the other race we already have with kernel modules, > DSOs, etc. > > I have mentioned this before, there were objections, perhaps this time I > formulated in a different way that makes it more interesting? that 'content based hash' we already have. It's called program tag. and we already taught iovisor/bcc to stash that stuff into /var/tmp/bcc/bpf_prog_TAG/ directory. Unfortunately that approach didn't work. JITed image only exists in the kernel. It's there only when program is loaded and it's the one that matter the most for performance analysis, since sample IPs are pointing into it. Also the line info mapping that user space knows is not helping much either, since verifier optimizes the instructions and then JIT does more. The debug_info <-> JIT relationship must be preserved by the kernel and returned to user space. The only other non-race way is to keep all that info in the kernel after program is unloaded, but that is even worse than bloating RECORD*s
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
Em Wed, Oct 17, 2018 at 04:36:08PM +, Alexei Starovoitov escreveu: > On 10/17/18 8:09 AM, David Ahern wrote: > > On 10/16/18 11:43 PM, Song Liu wrote: > >> I agree that processing events while recording has significant overhead. > >> In this case, perf user space need to know details about the the jited BPF > >> program. It is impossible to pass all these details to user space through > >> the relatively stable ring_buffer API. Therefore, some processing of the > >> data is necessary (get bpf prog_id from ring buffer, and then fetch program > >> details via BPF_OBJ_GET_INFO_BY_FD. > >> > >> I have some idea on processing important data with relatively low overhead. > >> Let me try implement it. > >> > > > > As I understand it, you want this series: > > > > kernel: add event to perf buffer on bpf prog load > > > > userspace: perf reads the event and grabs information about the program > > from the fd > > > > Is that correct? > > > > Userpsace is not awakened immediately when an event is added the the > > ring. It is awakened once the number of events crosses a watermark. That > > means there is an unknown - and potentially long - time window where the > > program can be unloaded before perf reads the event. > > So no matter what you do expecting perf record to be able to process the > > event quickly is an unreasonable expectation. > yes... unless we go with threaded model as Arnaldo suggested and use > single event as a watermark to wakeup our perf thread. > In such case there is still a race window between user space waking up > and doing _single_ bpf_get_fd_from_id() call to hold that prog > and some other process trying to instantly unload the prog it > just loaded. > I think such race window is extremely tiny and if perf misses > those load/unload events it's a good thing, since there is no chance > that normal pmu event samples would be happening during prog execution. > The alternative approach with no race window at all is to burden kernel > RECORD_* events with _all_ information about bpf prog. Which is jited > addresses, jited image itself, info about all subprogs, info about line > info, all BTF data, etc. As I said earlier I'm strongly against such > RECORD_* bloating. > Instead we need to find a way to process new RECORD_BPF events with > single prog_id field in perf user space with minimal race > and threaded approach sounds like a win to me. There is another alternative, I think: put just a content based hash, like a git commit id into a PERF_RECORD_MMAP3 new record, and when the validator does the jit, etc, it stashes the content that BPF_OBJ_GET_INFO_BY_FD would get somewhere, some filesystem populated by the kernel right after getting stuff from sys_bpf() and preparing it for use, then we know that in (start, end) we have blob foo with content id, that we will use to retrieve information that augments what we know with just (start, end, id) and allows annotation, etc. That stash space for jitted stuff gets garbage collected from time to time or is even completely disabled if the user is not interested in such augmentation, just like one can do disabling perf's ~/.debug/ directory hashed by build-id. I think with this we have no races, the PERF_RECORD_MMAP3 gets just what is in PERF_RECORD_MMAP2 plus some extra 20 bytes for such content based cookie and we solve the other race we already have with kernel modules, DSOs, etc. I have mentioned this before, there were objections, perhaps this time I formulated in a different way that makes it more interesting? - Arnaldo
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
On 10/17/18 8:09 AM, David Ahern wrote: > On 10/16/18 11:43 PM, Song Liu wrote: >> I agree that processing events while recording has significant overhead. >> In this case, perf user space need to know details about the the jited BPF >> program. It is impossible to pass all these details to user space through >> the relatively stable ring_buffer API. Therefore, some processing of the >> data is necessary (get bpf prog_id from ring buffer, and then fetch program >> details via BPF_OBJ_GET_INFO_BY_FD. >> >> I have some idea on processing important data with relatively low overhead. >> Let me try implement it. >> > > As I understand it, you want this series: > > kernel: add event to perf buffer on bpf prog load > > userspace: perf reads the event and grabs information about the program > from the fd > > Is that correct? > > Userpsace is not awakened immediately when an event is added the the > ring. It is awakened once the number of events crosses a watermark. That > means there is an unknown - and potentially long - time window where the > program can be unloaded before perf reads the event. > > So no matter what you do expecting perf record to be able to process the > event quickly is an unreasonable expectation. yes... unless we go with threaded model as Arnaldo suggested and use single event as a watermark to wakeup our perf thread. In such case there is still a race window between user space waking up and doing _single_ bpf_get_fd_from_id() call to hold that prog and some other process trying to instantly unload the prog it just loaded. I think such race window is extremely tiny and if perf misses those load/unload events it's a good thing, since there is no chance that normal pmu event samples would be happening during prog execution. The alternative approach with no race window at all is to burden kernel RECORD_* events with _all_ information about bpf prog. Which is jited addresses, jited image itself, info about all subprogs, info about line info, all BTF data, etc. As I said earlier I'm strongly against such RECORD_* bloating. Instead we need to find a way to process new RECORD_BPF events with single prog_id field in perf user space with minimal race and threaded approach sounds like a win to me.
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
Hi David, On Wed, Oct 17, 2018 at 8:09 AM David Ahern wrote: > > On 10/16/18 11:43 PM, Song Liu wrote: > > I agree that processing events while recording has significant overhead. > > In this case, perf user space need to know details about the the jited BPF > > program. It is impossible to pass all these details to user space through > > the relatively stable ring_buffer API. Therefore, some processing of the > > data is necessary (get bpf prog_id from ring buffer, and then fetch program > > details via BPF_OBJ_GET_INFO_BY_FD. > > > > I have some idea on processing important data with relatively low overhead. > > Let me try implement it. > > > > As I understand it, you want this series: > > kernel: add event to perf buffer on bpf prog load > > userspace: perf reads the event and grabs information about the program > from the fd > > Is that correct? Yes, this is correct. > > Userpsace is not awakened immediately when an event is added the the > ring. It is awakened once the number of events crosses a watermark. That > means there is an unknown - and potentially long - time window where the > program can be unloaded before perf reads the event. > > So no matter what you do expecting perf record to be able to process the > event quickly is an unreasonable expectation. In this case, we don't really need to gather the information immediately. We will lost information about some short-living BPF programs. These BPF programs are less important for the performance analysis anyway. I guess the only missing case is when some BPF program get load/unload many times, so each instance is short-living, but the overall perf impact is significant. I think this case is not interesting either, as BPF programs should not be used like this (considering the kernel need to translate and jit the program, unloading and reloading soon doesn't make any sense). Thanks, Song
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
On Wed, Oct 17, 2018 at 5:50 AM Arnaldo Carvalho de Melo wrote: > > Em Wed, Oct 17, 2018 at 09:11:40AM -0300, Arnaldo Carvalho de Melo escreveu: > > Adding Alexey, Jiri and Namhyung as they worked/are working on > > multithreading 'perf record'. > > > > Em Tue, Oct 16, 2018 at 11:43:11PM -0700, Song Liu escreveu: > > > On Tue, Oct 16, 2018 at 4:43 PM David Ahern wrote: > > > > On 10/15/18 4:33 PM, Song Liu wrote: > > > > > I am working with Alexei on the idea of fetching BPF program > > > > > information via > > > > > BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT > > > > > to perf_event_type, and dumped these events to perf event ring buffer. > > > > > > > I found that perf will not process event until the end of perf-record: > > > > > > > root@virt-test:~# ~/perf record -ag -- sleep 10 > > > > > .. 10 seconds later > > > > > [ perf record: Woken up 34 times to write data ] > > > > > machine__process_bpf_event: prog_id 6 loaded > > > > > machine__process_bpf_event: prog_id 6 unloaded > > > > > [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ] > > > > > > > In this example, the bpf program was loaded and then unloaded in > > > > > another terminal. When machine__process_bpf_event() processes > > > > > the load event, the bpf program is already unloaded. Therefore, > > > > > machine__process_bpf_event() will not be able to get information > > > > > about the program via BPF_OBJ_GET_INFO_BY_FD cmd. > > > > > > > To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD > > > > > as soon as perf get the event from kernel. I looked around the perf > > > > > code for a while. But I haven't found a good example where some > > > > > events are processed before the end of perf-record. Could you > > > > > please help me with this? > > > > > > perf record does not process events as they are generated. Its sole job > > > > is pushing data from the maps to a file as fast as possible meaning in > > > > bulk based on current read and write locations. > > > > > > Adding code to process events will add significant overhead to the > > > > record command and will not really solve your race problem. > > > > > I agree that processing events while recording has significant overhead. > > > In this case, perf user space need to know details about the the jited BPF > > > program. It is impossible to pass all these details to user space through > > > the relatively stable ring_buffer API. Therefore, some processing of the > > > data is necessary (get bpf prog_id from ring buffer, and then fetch > > > program > > > details via BPF_OBJ_GET_INFO_BY_FD. > > > > > I have some idea on processing important data with relatively low > > > overhead. > > > Let me try implement it. > > > > Well, you could have a separate thread processing just those kinds of > > events, associate it with a dummy event where you only ask for > > PERF_RECORD_BPF_EVENTs. > > > > Here is how to setup the PERF_TYPE_SOFTWARE/PERF_COUNT_SW_DUMMY > > perf_event_attr: > > > > [root@seventh ~]# perf record -vv -e dummy sleep 01 > > > > perf_event_attr: > > type 1 > > size 112 > > config 0x9 > > { sample_period, sample_freq } 4000 > > sample_type IP|TID|TIME|PERIOD > > disabled 1 > > inherit 1 > > These you would have disabled, no need for > PERF_RECORD_{MMAP*,COMM,FORK,EXIT} just PERF_RECORD_BPF_EVENT > > > mmap 1 > > comm 1 > > task 1 > > mmap21 > > comm_exec1 > > Thanks Arnaldo! This looks better than my original idea (using POLLPRI to highlight special events). I will try implement the BPF_EVENT in this direction. Song
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
On 10/16/18 11:43 PM, Song Liu wrote: > I agree that processing events while recording has significant overhead. > In this case, perf user space need to know details about the the jited BPF > program. It is impossible to pass all these details to user space through > the relatively stable ring_buffer API. Therefore, some processing of the > data is necessary (get bpf prog_id from ring buffer, and then fetch program > details via BPF_OBJ_GET_INFO_BY_FD. > > I have some idea on processing important data with relatively low overhead. > Let me try implement it. > As I understand it, you want this series: kernel: add event to perf buffer on bpf prog load userspace: perf reads the event and grabs information about the program from the fd Is that correct? Userpsace is not awakened immediately when an event is added the the ring. It is awakened once the number of events crosses a watermark. That means there is an unknown - and potentially long - time window where the program can be unloaded before perf reads the event. So no matter what you do expecting perf record to be able to process the event quickly is an unreasonable expectation.
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
Em Wed, Oct 17, 2018 at 09:11:40AM -0300, Arnaldo Carvalho de Melo escreveu: > Adding Alexey, Jiri and Namhyung as they worked/are working on > multithreading 'perf record'. > > Em Tue, Oct 16, 2018 at 11:43:11PM -0700, Song Liu escreveu: > > On Tue, Oct 16, 2018 at 4:43 PM David Ahern wrote: > > > On 10/15/18 4:33 PM, Song Liu wrote: > > > > I am working with Alexei on the idea of fetching BPF program > > > > information via > > > > BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT > > > > to perf_event_type, and dumped these events to perf event ring buffer. > > > > > I found that perf will not process event until the end of perf-record: > > > > > root@virt-test:~# ~/perf record -ag -- sleep 10 > > > > .. 10 seconds later > > > > [ perf record: Woken up 34 times to write data ] > > > > machine__process_bpf_event: prog_id 6 loaded > > > > machine__process_bpf_event: prog_id 6 unloaded > > > > [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ] > > > > > In this example, the bpf program was loaded and then unloaded in > > > > another terminal. When machine__process_bpf_event() processes > > > > the load event, the bpf program is already unloaded. Therefore, > > > > machine__process_bpf_event() will not be able to get information > > > > about the program via BPF_OBJ_GET_INFO_BY_FD cmd. > > > > > To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD > > > > as soon as perf get the event from kernel. I looked around the perf > > > > code for a while. But I haven't found a good example where some > > > > events are processed before the end of perf-record. Could you > > > > please help me with this? > > > > perf record does not process events as they are generated. Its sole job > > > is pushing data from the maps to a file as fast as possible meaning in > > > bulk based on current read and write locations. > > > > Adding code to process events will add significant overhead to the > > > record command and will not really solve your race problem. > > > I agree that processing events while recording has significant overhead. > > In this case, perf user space need to know details about the the jited BPF > > program. It is impossible to pass all these details to user space through > > the relatively stable ring_buffer API. Therefore, some processing of the > > data is necessary (get bpf prog_id from ring buffer, and then fetch program > > details via BPF_OBJ_GET_INFO_BY_FD. > > > I have some idea on processing important data with relatively low overhead. > > Let me try implement it. > > Well, you could have a separate thread processing just those kinds of > events, associate it with a dummy event where you only ask for > PERF_RECORD_BPF_EVENTs. > > Here is how to setup the PERF_TYPE_SOFTWARE/PERF_COUNT_SW_DUMMY > perf_event_attr: > > [root@seventh ~]# perf record -vv -e dummy sleep 01 > > perf_event_attr: > type 1 > size 112 > config 0x9 > { sample_period, sample_freq } 4000 > sample_type IP|TID|TIME|PERIOD > disabled 1 > inherit 1 These you would have disabled, no need for PERF_RECORD_{MMAP*,COMM,FORK,EXIT} just PERF_RECORD_BPF_EVENT > mmap 1 > comm 1 > task 1 > mmap21 > comm_exec1 > freq 1 > enable_on_exec 1 > sample_id_all1 > exclude_guest1 > > sys_perf_event_open: pid 12046 cpu 0 group_fd -1 flags 0x8 = 4 > sys_perf_event_open: pid 12046 cpu 1 group_fd -1 flags 0x8 = 5 > sys_perf_event_open: pid 12046 cpu 2 group_fd -1 flags 0x8 = 6 > sys_perf_event_open: pid 12046 cpu 3 group_fd -1 flags 0x8 = 8 > mmap size 528384B > perf event ring buffer mmapped per cpu > Synthesizing TSC conversion information > [ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 0.014 MB perf.data ] > [root@seventh ~]# > > [root@seventh ~]# perf evlist -v > dummy: type: 1, size: 112, config: 0x9, { sample_period, sample_freq }: 4000, > sample_type: IP|TID|TIME|PERIOD, disabled: 1, inherit: 1, mmap: 1, comm: 1, > freq: 1, enable_on_exec: 1, task: 1, sample_id_all: 1, exclude_guest: 1, > mmap2: 1, comm_exec: 1 > [root@seventh ~]# > > There is work ongoing in dumping one file per cpu and then, at post > processing time merging all those files to get ordering, so one more > file, for these VIP events, that require per-event processing would be > ordered at that time with all the other per-cpu files. > > - Arnaldo
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
Adding Alexey, Jiri and Namhyung as they worked/are working on multithreading 'perf record'. Em Tue, Oct 16, 2018 at 11:43:11PM -0700, Song Liu escreveu: > On Tue, Oct 16, 2018 at 4:43 PM David Ahern wrote: > > On 10/15/18 4:33 PM, Song Liu wrote: > > > I am working with Alexei on the idea of fetching BPF program information > > > via > > > BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT > > > to perf_event_type, and dumped these events to perf event ring buffer. > > > I found that perf will not process event until the end of perf-record: > > > root@virt-test:~# ~/perf record -ag -- sleep 10 > > > .. 10 seconds later > > > [ perf record: Woken up 34 times to write data ] > > > machine__process_bpf_event: prog_id 6 loaded > > > machine__process_bpf_event: prog_id 6 unloaded > > > [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ] > > > In this example, the bpf program was loaded and then unloaded in > > > another terminal. When machine__process_bpf_event() processes > > > the load event, the bpf program is already unloaded. Therefore, > > > machine__process_bpf_event() will not be able to get information > > > about the program via BPF_OBJ_GET_INFO_BY_FD cmd. > > > To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD > > > as soon as perf get the event from kernel. I looked around the perf > > > code for a while. But I haven't found a good example where some > > > events are processed before the end of perf-record. Could you > > > please help me with this? > > perf record does not process events as they are generated. Its sole job > > is pushing data from the maps to a file as fast as possible meaning in > > bulk based on current read and write locations. > > Adding code to process events will add significant overhead to the > > record command and will not really solve your race problem. > I agree that processing events while recording has significant overhead. > In this case, perf user space need to know details about the the jited BPF > program. It is impossible to pass all these details to user space through > the relatively stable ring_buffer API. Therefore, some processing of the > data is necessary (get bpf prog_id from ring buffer, and then fetch program > details via BPF_OBJ_GET_INFO_BY_FD. > I have some idea on processing important data with relatively low overhead. > Let me try implement it. Well, you could have a separate thread processing just those kinds of events, associate it with a dummy event where you only ask for PERF_RECORD_BPF_EVENTs. Here is how to setup the PERF_TYPE_SOFTWARE/PERF_COUNT_SW_DUMMY perf_event_attr: [root@seventh ~]# perf record -vv -e dummy sleep 01 perf_event_attr: type 1 size 112 config 0x9 { sample_period, sample_freq } 4000 sample_type IP|TID|TIME|PERIOD disabled 1 inherit 1 mmap 1 comm 1 freq 1 enable_on_exec 1 task 1 sample_id_all1 exclude_guest1 mmap21 comm_exec1 sys_perf_event_open: pid 12046 cpu 0 group_fd -1 flags 0x8 = 4 sys_perf_event_open: pid 12046 cpu 1 group_fd -1 flags 0x8 = 5 sys_perf_event_open: pid 12046 cpu 2 group_fd -1 flags 0x8 = 6 sys_perf_event_open: pid 12046 cpu 3 group_fd -1 flags 0x8 = 8 mmap size 528384B perf event ring buffer mmapped per cpu Synthesizing TSC conversion information [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.014 MB perf.data ] [root@seventh ~]# [root@seventh ~]# perf evlist -v dummy: type: 1, size: 112, config: 0x9, { sample_period, sample_freq }: 4000, sample_type: IP|TID|TIME|PERIOD, disabled: 1, inherit: 1, mmap: 1, comm: 1, freq: 1, enable_on_exec: 1, task: 1, sample_id_all: 1, exclude_guest: 1, mmap2: 1, comm_exec: 1 [root@seventh ~]# There is work ongoing in dumping one file per cpu and then, at post processing time merging all those files to get ordering, so one more file, for these VIP events, that require per-event processing would be ordered at that time with all the other per-cpu files. - Arnaldo
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
Hi David, On Tue, Oct 16, 2018 at 4:43 PM David Ahern wrote: > > On 10/15/18 4:33 PM, Song Liu wrote: > > I am working with Alexei on the idea of fetching BPF program information via > > BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT > > to perf_event_type, and dumped these events to perf event ring buffer. > > > > I found that perf will not process event until the end of perf-record: > > > > root@virt-test:~# ~/perf record -ag -- sleep 10 > > .. 10 seconds later > > [ perf record: Woken up 34 times to write data ] > > machine__process_bpf_event: prog_id 6 loaded > > machine__process_bpf_event: prog_id 6 unloaded > > [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ] > > > > In this example, the bpf program was loaded and then unloaded in > > another terminal. When machine__process_bpf_event() processes > > the load event, the bpf program is already unloaded. Therefore, > > machine__process_bpf_event() will not be able to get information > > about the program via BPF_OBJ_GET_INFO_BY_FD cmd. > > > > To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD > > as soon as perf get the event from kernel. I looked around the perf > > code for a while. But I haven't found a good example where some > > events are processed before the end of perf-record. Could you > > please help me with this? > > perf record does not process events as they are generated. Its sole job > is pushing data from the maps to a file as fast as possible meaning in > bulk based on current read and write locations. > > Adding code to process events will add significant overhead to the > record command and will not really solve your race problem. Thanks for the comment. I agree that processing events while recording has significant overhead. In this case, perf user space need to know details about the the jited BPF program. It is impossible to pass all these details to user space through the relatively stable ring_buffer API. Therefore, some processing of the data is necessary (get bpf prog_id from ring buffer, and then fetch program details via BPF_OBJ_GET_INFO_BY_FD. I have some idea on processing important data with relatively low overhead. Let me try implement it. Thanks again, Song
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
On 10/15/18 4:33 PM, Song Liu wrote: > I am working with Alexei on the idea of fetching BPF program information via > BPF_OBJ_GET_INFO_BY_FD cmd. I added PERF_RECORD_BPF_EVENT > to perf_event_type, and dumped these events to perf event ring buffer. > > I found that perf will not process event until the end of perf-record: > > root@virt-test:~# ~/perf record -ag -- sleep 10 > .. 10 seconds later > [ perf record: Woken up 34 times to write data ] > machine__process_bpf_event: prog_id 6 loaded > machine__process_bpf_event: prog_id 6 unloaded > [ perf record: Captured and wrote 9.337 MB perf.data (93178 samples) ] > > In this example, the bpf program was loaded and then unloaded in > another terminal. When machine__process_bpf_event() processes > the load event, the bpf program is already unloaded. Therefore, > machine__process_bpf_event() will not be able to get information > about the program via BPF_OBJ_GET_INFO_BY_FD cmd. > > To solve this problem, we will need to run BPF_OBJ_GET_INFO_BY_FD > as soon as perf get the event from kernel. I looked around the perf > code for a while. But I haven't found a good example where some > events are processed before the end of perf-record. Could you > please help me with this? perf record does not process events as they are generated. Its sole job is pushing data from the maps to a file as fast as possible meaning in bulk based on current read and write locations. Adding code to process events will add significant overhead to the record command and will not really solve your race problem.
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
On Fri, Sep 21, 2018 at 3:15 PM Alexei Starovoitov wrote: > > On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote: > > > > > I have considered adding MUNMAP to match existing MMAP, but went > > > without it because I didn't want to introduce new bit in perf_event_attr > > > and emit these new events in a misbalanced conditional way for prog > > > load/unload. > > > Like old perf is asking kernel for mmap events via mmap bit, so prog load > > > events > > > > By prog load events you mean that old perf, having perf_event_attr.mmap = 1 > > || > > perf_event_attr.mmap2 = 1 will cause the new kernel to emit > > PERF_RECORD_MMAP records for the range of addresses that a BPF program > > is being loaded on, right? > > right. it would be weird when prog load events are there, but not unload. > > > > will be in perf.data, but old perf report won't recognize them anyway. > > > > Why not? It should lookup the symbol and find it in the rb_tree of maps, > > with a DSO name equal to what was in the PERF_RECORD_MMAP emitted by the > > BPF core, no? It'll be an unresolved symbol, but a resolved map. > > > > > Whereas new perf would certainly want to catch bpf events and will set > > > both mmap and mumap bits. > > > > new perf with your code will find a symbol, not a map, because your code > > catches a special case PERF_RECORD_MMAP and instead of creating a > > 'struct map' will create a 'struct symbol' and insert it in the kallsyms > > 'struct map', right? > > right. > bpf progs are more similar to kernel functions than to modules. > For modules it makes sense to create a new map and insert symbols into it. > For bpf JITed images there is no DSO to parse. > Single bpf elf file may contain multiple bpf progs and each prog may contain > multiple bpf functions. They will be loaded at different time and > will have different life time. > > > In theory the old perf should catch the PERF_RECORD_MMAP with a string > > in the filename part and insert a new map into the kernel mmap rb_tree, > > and then samples would be resolved to this map, but since there is no > > backing DSO with a symtab, it would stop at that, just stating that the > > map is called NAME-OF-BPF-PROGRAM. This is all from memory, possibly > > there is something in there that makes it ignore this PERF_RECORD_MMAP > > emitted by the BPF kernel code when loading a new program. > > In /proc/kcore there is already a section for module range. > Hence when perf processes bpf load/unload events the map is already created. > Therefore the patch 3 only searches for it and inserts new symbol into it. > > In that sense the reuse of RECORD_MMAP event for bpf progs is indeed > not exactly clean, since no new map is created. > It's probably better to introduce PERF_RECORD_[INSERT|ERASE]_KSYM events ? > > Such event potentially can be used for offline ksym resolution. > perf could process /proc/kallsyms during perf record and emit all of them > as synthetic PERF_RECORD_INSERT_KSYM into perf.data, so perf report can run > on a different server and still find the right symbols. > > I guess, we can do bpf specific events too and keep RECORD_MMAP as-is. > How about single PERF_RECORD_BPF event with internal flag for load/unload ? > > > Right, that is another unfortunate state of affairs, kernel module > > load/unload should already be supported, reported by the kernel via a > > proper PERF_RECORD_MODULE_LOAD/UNLOAD > > I agree with Peter here. It would nice, but low priority. > modules are mostly static. Loaded once and stay there. > > > There is another longstanding TODO list entry: PERF_RECORD_MMAP records > > should include a build-id, to avoid either userspace getting confused > > when there is an update of some mmap DSO, for long running sessions, for > > instance, or to have to scan the just recorded perf.data file for DSOs > > with samples to then read it from the file system (more races). > > > > Have you ever considered having a build-id for bpf objects that could be > > used here? > > build-id concept is not applicable to bpf. > bpf elf files on the disc don't have good correlation with what is > running in the kernel. bpf bytestream is converted and optimized > by the verifier. Then JITed. > So debug info left in .o file and original bpf bytestream in .o are > mostly useless. > For bpf programs we have 'program tag'. It is computed over original > bpf bytestream, so both kernel and user space can compute it. > In libbcc we use /var/tmp/bcc/bpf_prog_TAG/ directory to store original > source code of the program, so users looking at kernel stack traces > with bpf_prog_TAG can find the source. > It's similar to build-id, but not going to help perf to annotate > actual x86 instructions inside JITed image and show src code. > Since JIT runs in the kernel this problem cannot be solved by user space only. > It's a difficult problem and we have a plan to tackle that, > but it's step 2. A bunch of infra is needed on bpf side to > preserve the
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote: > > > I have considered adding MUNMAP to match existing MMAP, but went > > without it because I didn't want to introduce new bit in perf_event_attr > > and emit these new events in a misbalanced conditional way for prog > > load/unload. > > Like old perf is asking kernel for mmap events via mmap bit, so prog load > > events > > By prog load events you mean that old perf, having perf_event_attr.mmap = 1 || > perf_event_attr.mmap2 = 1 will cause the new kernel to emit > PERF_RECORD_MMAP records for the range of addresses that a BPF program > is being loaded on, right? right. it would be weird when prog load events are there, but not unload. > > will be in perf.data, but old perf report won't recognize them anyway. > > Why not? It should lookup the symbol and find it in the rb_tree of maps, > with a DSO name equal to what was in the PERF_RECORD_MMAP emitted by the > BPF core, no? It'll be an unresolved symbol, but a resolved map. > > > Whereas new perf would certainly want to catch bpf events and will set > > both mmap and mumap bits. > > new perf with your code will find a symbol, not a map, because your code > catches a special case PERF_RECORD_MMAP and instead of creating a > 'struct map' will create a 'struct symbol' and insert it in the kallsyms > 'struct map', right? right. bpf progs are more similar to kernel functions than to modules. For modules it makes sense to create a new map and insert symbols into it. For bpf JITed images there is no DSO to parse. Single bpf elf file may contain multiple bpf progs and each prog may contain multiple bpf functions. They will be loaded at different time and will have different life time. > In theory the old perf should catch the PERF_RECORD_MMAP with a string > in the filename part and insert a new map into the kernel mmap rb_tree, > and then samples would be resolved to this map, but since there is no > backing DSO with a symtab, it would stop at that, just stating that the > map is called NAME-OF-BPF-PROGRAM. This is all from memory, possibly > there is something in there that makes it ignore this PERF_RECORD_MMAP > emitted by the BPF kernel code when loading a new program. In /proc/kcore there is already a section for module range. Hence when perf processes bpf load/unload events the map is already created. Therefore the patch 3 only searches for it and inserts new symbol into it. In that sense the reuse of RECORD_MMAP event for bpf progs is indeed not exactly clean, since no new map is created. It's probably better to introduce PERF_RECORD_[INSERT|ERASE]_KSYM events ? Such event potentially can be used for offline ksym resolution. perf could process /proc/kallsyms during perf record and emit all of them as synthetic PERF_RECORD_INSERT_KSYM into perf.data, so perf report can run on a different server and still find the right symbols. I guess, we can do bpf specific events too and keep RECORD_MMAP as-is. How about single PERF_RECORD_BPF event with internal flag for load/unload ? > Right, that is another unfortunate state of affairs, kernel module > load/unload should already be supported, reported by the kernel via a > proper PERF_RECORD_MODULE_LOAD/UNLOAD I agree with Peter here. It would nice, but low priority. modules are mostly static. Loaded once and stay there. > There is another longstanding TODO list entry: PERF_RECORD_MMAP records > should include a build-id, to avoid either userspace getting confused > when there is an update of some mmap DSO, for long running sessions, for > instance, or to have to scan the just recorded perf.data file for DSOs > with samples to then read it from the file system (more races). > > Have you ever considered having a build-id for bpf objects that could be > used here? build-id concept is not applicable to bpf. bpf elf files on the disc don't have good correlation with what is running in the kernel. bpf bytestream is converted and optimized by the verifier. Then JITed. So debug info left in .o file and original bpf bytestream in .o are mostly useless. For bpf programs we have 'program tag'. It is computed over original bpf bytestream, so both kernel and user space can compute it. In libbcc we use /var/tmp/bcc/bpf_prog_TAG/ directory to store original source code of the program, so users looking at kernel stack traces with bpf_prog_TAG can find the source. It's similar to build-id, but not going to help perf to annotate actual x86 instructions inside JITed image and show src code. Since JIT runs in the kernel this problem cannot be solved by user space only. It's a difficult problem and we have a plan to tackle that, but it's step 2. A bunch of infra is needed on bpf side to preserve the association during src_in_C -> original bpf insns -> translated bpf insns -> JITed asm. Then report it back to user space and then teach perf to properly annotate progs. > Will the JITed code from some BPF bytecode be different if the same >
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote: > There is another longstanding TODO list entry: PERF_RECORD_MMAP records > should include a build-id I throught the problem was that the kernel doesn't have the build-id in the first place. So it cannot hand them out.
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
On Fri, Sep 21, 2018 at 09:25:00AM -0300, Arnaldo Carvalho de Melo wrote: > > I consider synthetic perf events to be non-ABI. Meaning they're > > emitted by perf user space into perf.data and there is a convention > > on names, but it's not a kernel abi. Like RECORD_MMAP with > > event.filename == "[module_name]" is an indication for perf report > > to parse elf/build-id of dso==module_name. > > There is no such support in the kernel. Kernel doesn't emit > > such events for module load/unload. If in the future > > we decide to extend kernel with such events they don't have > > to match what user space perf does today. > > Right, that is another unfortunate state of affairs, kernel module > load/unload should already be supported, reported by the kernel via a > proper PERF_RECORD_MODULE_LOAD/UNLOAD Just wondering, is anyone actually doing enough module loading for this to matter? (asks the CONFIG_MODULES=n guy). I thought that was all a relatively static affair; you boot, you get loadead a few modules for present hardware, the end. Anyway, no real objection, just wonder if it's worth it.
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
Em Thu, Sep 20, 2018 at 08:14:46PM -0700, Alexei Starovoitov escreveu: > On Thu, Sep 20, 2018 at 03:56:51PM +0200, Peter Zijlstra wrote: > > On Thu, Sep 20, 2018 at 10:25:45AM -0300, Arnaldo Carvalho de Melo wrote: > > > PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due > > > to having to cope with munmapping parts of existing mmaps, etc. > > > > > > I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for > > > now it would be used just in this clean case for undoing a > > > PERF_RECORD_MMAP for a BPF program. > > > > > > The ABI is already complicated, starting to use something called > > > PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever, > > > I think. > > > > Agreed, the PERF_RECORD_MUNMAP patch was fairly trivial, the difficult > > part was getting the perf tool to dtrt for that use-case. But if we need > > unmap events, doing the unmap record now is the right thing. > > Thanks for the pointers! > The answer is a bit long. pls bear with me. Ditto with me :-) > I have considered adding MUNMAP to match existing MMAP, but went > without it because I didn't want to introduce new bit in perf_event_attr > and emit these new events in a misbalanced conditional way for prog > load/unload. > Like old perf is asking kernel for mmap events via mmap bit, so prog load > events By prog load events you mean that old perf, having perf_event_attr.mmap = 1 || perf_event_attr.mmap2 = 1 will cause the new kernel to emit PERF_RECORD_MMAP records for the range of addresses that a BPF program is being loaded on, right? > will be in perf.data, but old perf report won't recognize them anyway. Why not? It should lookup the symbol and find it in the rb_tree of maps, with a DSO name equal to what was in the PERF_RECORD_MMAP emitted by the BPF core, no? It'll be an unresolved symbol, but a resolved map. > Whereas new perf would certainly want to catch bpf events and will set > both mmap and mumap bits. new perf with your code will find a symbol, not a map, because your code catches a special case PERF_RECORD_MMAP and instead of creating a 'struct map' will create a 'struct symbol' and insert it in the kallsyms 'struct map', right? > Then if I add MUNMAP event without new bit and emit MMAP/MUNMAP > conditionally based on single mmap bit they will confuse old perf > and it will print warning about 'unknown events'. That is unfortunate and I'll turn that part into a pr_debug() > Both situations are ugly, hence I went with reuse of MMAP event > for both load/unload. So, its doubly odd, i.e. MMAP used for mmap() and for munmap() and the effects in the tooling is not to create or remove a 'struct map', but to alter an existing symbol table for the kallsyms map. > In such case old perf silently ignores them. Which is what I wanted. In theory the old perf should catch the PERF_RECORD_MMAP with a string in the filename part and insert a new map into the kernel mmap rb_tree, and then samples would be resolved to this map, but since there is no backing DSO with a symtab, it would stop at that, just stating that the map is called NAME-OF-BPF-PROGRAM. This is all from memory, possibly there is something in there that makes it ignore this PERF_RECORD_MMAP emitted by the BPF kernel code when loading a new program. > When we upgrade the kernel we cannot synchronize the kernel upgrade > (or downgrade) with user space perf package upgrade. sure > Hence not confusing old perf is important. Thanks for trying to achieve that, and its a pity that that "unknown record" message is a pr_warning or pr_info and not a pr_debug(). > With new kernel new bpf mmap events get into perf.data and > new perf picks them up. > > Few more considerations: > > I consider synthetic perf events to be non-ABI. Meaning they're > emitted by perf user space into perf.data and there is a convention > on names, but it's not a kernel abi. Like RECORD_MMAP with > event.filename == "[module_name]" is an indication for perf report > to parse elf/build-id of dso==module_name. > There is no such support in the kernel. Kernel doesn't emit > such events for module load/unload. If in the future > we decide to extend kernel with such events they don't have > to match what user space perf does today. Right, that is another unfortunate state of affairs, kernel module load/unload should already be supported, reported by the kernel via a proper PERF_RECORD_MODULE_LOAD/UNLOAD > Why this is important? To get to next step. > As Arnaldo pointed out this patch set is missing support for > JITed prog annotations and displaying asm code. Absolutely correct. > This set only helps perf to reveal the names of bpf progs that _were_ > running at the time of perf record, but there is no way yet for > perf report to show asm code of the prog that was running. > In that sense bpf is drastically different from java, other jits > and normal profiling. > bpf JIT happens in the kernel and only kernel knows the mapping >
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
On Thu, Sep 20, 2018 at 03:56:51PM +0200, Peter Zijlstra wrote: > On Thu, Sep 20, 2018 at 10:25:45AM -0300, Arnaldo Carvalho de Melo wrote: > > PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due > > to having to cope with munmapping parts of existing mmaps, etc. > > > > I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for > > now it would be used just in this clean case for undoing a > > PERF_RECORD_MMAP for a BPF program. > > > > The ABI is already complicated, starting to use something called > > PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever, > > I think. > > Agreed, the PERF_RECORD_MUNMAP patch was fairly trivial, the difficult > part was getting the perf tool to dtrt for that use-case. But if we need > unmap events, doing the unmap record now is the right thing. Thanks for the pointers! The answer is a bit long. pls bear with me. I have considered adding MUNMAP to match existing MMAP, but went without it because I didn't want to introduce new bit in perf_event_attr and emit these new events in a misbalanced conditional way for prog load/unload. Like old perf is asking kernel for mmap events via mmap bit, so prog load events will be in perf.data, but old perf report won't recognize them anyway. Whereas new perf would certainly want to catch bpf events and will set both mmap and mumap bits. Then if I add MUNMAP event without new bit and emit MMAP/MUNMAP conditionally based on single mmap bit they will confuse old perf and it will print warning about 'unknown events'. Both situations are ugly, hence I went with reuse of MMAP event for both load/unload. In such case old perf silently ignores them. Which is what I wanted. When we upgrade the kernel we cannot synchronize the kernel upgrade (or downgrade) with user space perf package upgrade. Hence not confusing old perf is important. With new kernel new bpf mmap events get into perf.data and new perf picks them up. Few more considerations: I consider synthetic perf events to be non-ABI. Meaning they're emitted by perf user space into perf.data and there is a convention on names, but it's not a kernel abi. Like RECORD_MMAP with event.filename == "[module_name]" is an indication for perf report to parse elf/build-id of dso==module_name. There is no such support in the kernel. Kernel doesn't emit such events for module load/unload. If in the future we decide to extend kernel with such events they don't have to match what user space perf does today. Why this is important? To get to next step. As Arnaldo pointed out this patch set is missing support for JITed prog annotations and displaying asm code. Absolutely correct. This set only helps perf to reveal the names of bpf progs that _were_ running at the time of perf record, but there is no way yet for perf report to show asm code of the prog that was running. In that sense bpf is drastically different from java, other jits and normal profiling. bpf JIT happens in the kernel and only kernel knows the mapping between original source and JITed code. In addition there are bpf2bpf functions. In the future there will be bpf libraries, more type info, line number support, etc. I strongly believe perf RECORD_* events should NOT care about the development that happens on the bpf side. The only thing kernel will be telling user space is that bpf prog with prog_id=X was loaded. Then based on prog_id the 'perf record' phase can query the kernel for bpf related information. There is already a way to fetch JITed image based on prog_id. Then perf will emit synthetic RECORD_FOOBAR into perf.data that will contain bpf related info (like complete JITed image) and perf report can process it later and annotate things in UI. It may seem that there is a race here. Like when 'perf record' see 'bpf prog was loaded with prog_id=X' event it will ask the kernel about prog_id=X, but that prog could be unloaded already. In such case prog_id will not exist and perf record can ignore such event. So no race. The kernel doesn't need to pass all information about bpf prog to the user space via RECORD_*. Instead 'perf record' can emit synthetic events into perf.data. I was thinking to extend RECORD_MMAP with prog_id already (instead of passing kallsyms's bpf prog name in event->mmap.filename) but bpf functions don't have their own prog_id. Multiple bpf funcs with different JITed blobs are considered to be a part of single prog_id. So as a step one I'm only extending RECORD_MMAP with addr and kallsym name of bpf function/prog. As a step two the plan is to add notification mechanism for prog load/unload that will include prog_id and design new synthetic RECORD_* events in perf user space that will contain JITed code, line info, BTF, etc. TLDR: step 1 (this patch set) Single bpf prog_load can call multiple bpf_prog_kallsyms_add() -> RECORD_MMAP with addr+kallsym only Similarly unload calls multiple bpf_prog_kallsyms_del() -> RECORD_MMAP with addr only step 2 (future work)
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
On Thu, Sep 20, 2018 at 10:25:45AM -0300, Arnaldo Carvalho de Melo wrote: > PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due > to having to cope with munmapping parts of existing mmaps, etc. > > I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for > now it would be used just in this clean case for undoing a > PERF_RECORD_MMAP for a BPF program. > > The ABI is already complicated, starting to use something called > PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever, > I think. Agreed, the PERF_RECORD_MUNMAP patch was fairly trivial, the difficult part was getting the perf tool to dtrt for that use-case. But if we need unmap events, doing the unmap record now is the right thing.
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
On Thu, Sep 20, 2018 at 10:44:24AM +0200, Peter Zijlstra wrote: > On Wed, Sep 19, 2018 at 03:39:34PM -0700, Alexei Starovoitov wrote: > > void bpf_prog_kallsyms_del(struct bpf_prog *fp) > > { > > + unsigned long symbol_start, symbol_end; > > + /* mmap_record.filename cannot be NULL and has to be u64 aligned */ > > + char buf[sizeof(u64)] = {}; > > + > > if (!bpf_prog_kallsyms_candidate(fp)) > > return; > > > > spin_lock_bh(_lock); > > bpf_prog_ksym_node_del(fp->aux); > > spin_unlock_bh(_lock); > > + bpf_get_prog_addr_region(fp, _start, _end); > > + perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start, > > +buf, sizeof(buf)); > > } > > So perf doesn't normally issue unmap events.. We've talked about doing > that, but so far it's never really need needed I think. > > I feels a bit weird to start issuing unmap events for this. FWIW: https://lkml.kernel.org/r/20170127130702.gi6...@twins.programming.kicks-ass.net has talk of PERF_RECORD_MUNMAP
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
Em Thu, Sep 20, 2018 at 10:44:24AM +0200, Peter Zijlstra escreveu: > On Wed, Sep 19, 2018 at 03:39:34PM -0700, Alexei Starovoitov wrote: > > void bpf_prog_kallsyms_del(struct bpf_prog *fp) > > { > > + unsigned long symbol_start, symbol_end; > > + /* mmap_record.filename cannot be NULL and has to be u64 aligned */ > > + char buf[sizeof(u64)] = {}; > > + > > if (!bpf_prog_kallsyms_candidate(fp)) > > return; > > > > spin_lock_bh(_lock); > > bpf_prog_ksym_node_del(fp->aux); > > spin_unlock_bh(_lock); > > + bpf_get_prog_addr_region(fp, _start, _end); > > + perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start, > > +buf, sizeof(buf)); > > } > > So perf doesn't normally issue unmap events.. We've talked about doing > that, but so far it's never really need needed I think. > I feels a bit weird to start issuing unmap events for this. For reference, this surfaced here: https://lkml.org/lkml/2017/1/27/452 Start of the thread, that involves postgresql, JIT, LLVM, perf is here: https://lkml.org/lkml/2016/12/10/1 PeterZ provided a patch introducing PERF_RECORD_MUNMAP, went nowhere due to having to cope with munmapping parts of existing mmaps, etc. I'm still more in favour of introduce PERF_RECORD_MUNMAP, even if for now it would be used just in this clean case for undoing a PERF_RECORD_MMAP for a BPF program. The ABI is already complicated, starting to use something called PERF_RECORD_MMAP for unmmaping by just using a NULL name... too clever, I think. - Arnaldo
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
On Wed, Sep 19, 2018 at 03:39:34PM -0700, Alexei Starovoitov wrote: > void bpf_prog_kallsyms_del(struct bpf_prog *fp) > { > + unsigned long symbol_start, symbol_end; > + /* mmap_record.filename cannot be NULL and has to be u64 aligned */ > + char buf[sizeof(u64)] = {}; > + > if (!bpf_prog_kallsyms_candidate(fp)) > return; > > spin_lock_bh(_lock); > bpf_prog_ksym_node_del(fp->aux); > spin_unlock_bh(_lock); > + bpf_get_prog_addr_region(fp, _start, _end); > + perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start, > + buf, sizeof(buf)); > } So perf doesn't normally issue unmap events.. We've talked about doing that, but so far it's never really need needed I think. I feels a bit weird to start issuing unmap events for this.
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
> On Sep 19, 2018, at 5:59 PM, Alexei Starovoitov wrote: > > On 9/19/18 4:44 PM, Song Liu wrote: >> >> >>> On Sep 19, 2018, at 3:39 PM, Alexei Starovoitov wrote: >>> >>> use perf_event_mmap_bpf_prog() helper to notify user space >>> about JITed bpf programs. >>> Use RECORD_MMAP perf event to tell user space where JITed bpf program was >>> loaded. >>> Use empty program name as unload indication. >>> >>> Signed-off-by: Alexei Starovoitov >>> --- >>> kernel/bpf/core.c | 22 -- >>> 1 file changed, 20 insertions(+), 2 deletions(-) >>> >>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c >>> index 3f5bf1af0826..ddf11fdafd36 100644 >>> --- a/kernel/bpf/core.c >>> +++ b/kernel/bpf/core.c >>> @@ -384,7 +384,7 @@ bpf_get_prog_addr_region(const struct bpf_prog *prog, >>> *symbol_end = addr + hdr->pages * PAGE_SIZE; >>> } >>> >>> -static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) >>> +static char *bpf_get_prog_name(const struct bpf_prog *prog, char *sym) >>> { >>> const char *end = sym + KSYM_NAME_LEN; >>> >>> @@ -402,9 +402,10 @@ static void bpf_get_prog_name(const struct bpf_prog >>> *prog, char *sym) >>> sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_"); >>> sym = bin2hex(sym, prog->tag, sizeof(prog->tag)); >>> if (prog->aux->name[0]) >>> - snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name); >>> + sym += snprintf(sym, (size_t)(end - sym), "_%s", >>> prog->aux->name); >>> else >>> *sym = 0; >>> + return sym; >>> } >>> >>> static __always_inline unsigned long >>> @@ -480,23 +481,40 @@ static bool bpf_prog_kallsyms_verify_off(const struct >>> bpf_prog *fp) >>> >>> void bpf_prog_kallsyms_add(struct bpf_prog *fp) >>> { >>> + unsigned long symbol_start, symbol_end; >>> + char buf[KSYM_NAME_LEN], *sym; >>> + >>> if (!bpf_prog_kallsyms_candidate(fp) || >>> !capable(CAP_SYS_ADMIN)) >>> return; >>> >>> + bpf_get_prog_addr_region(fp, _start, _end); >>> + sym = bpf_get_prog_name(fp, buf); >>> + sym++; /* sym - buf is the length of the name including trailing 0 */ >>> + while (!IS_ALIGNED(sym - buf, sizeof(u64))) >>> + *sym++ = 0; >> >> nit: This logic feels a little weird to me. How about we wrap the extra logic >> in a separate function: >> >> size_t bpf_get_prog_name_u64_aligned(const struct bpf_prog fp, char *buf) > > probably bpf_get_prog_name_u64_padded() ? > would be cleaner indeed. bpf_get_prog_name_u64_padded does sound better. > Will send v2 once Peter and Arnaldo provide their feedback. > > Thanks for reviewing!
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
On 9/19/18 4:44 PM, Song Liu wrote: On Sep 19, 2018, at 3:39 PM, Alexei Starovoitov wrote: use perf_event_mmap_bpf_prog() helper to notify user space about JITed bpf programs. Use RECORD_MMAP perf event to tell user space where JITed bpf program was loaded. Use empty program name as unload indication. Signed-off-by: Alexei Starovoitov --- kernel/bpf/core.c | 22 -- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 3f5bf1af0826..ddf11fdafd36 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -384,7 +384,7 @@ bpf_get_prog_addr_region(const struct bpf_prog *prog, *symbol_end = addr + hdr->pages * PAGE_SIZE; } -static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) +static char *bpf_get_prog_name(const struct bpf_prog *prog, char *sym) { const char *end = sym + KSYM_NAME_LEN; @@ -402,9 +402,10 @@ static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_"); sym = bin2hex(sym, prog->tag, sizeof(prog->tag)); if (prog->aux->name[0]) - snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name); + sym += snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name); else *sym = 0; + return sym; } static __always_inline unsigned long @@ -480,23 +481,40 @@ static bool bpf_prog_kallsyms_verify_off(const struct bpf_prog *fp) void bpf_prog_kallsyms_add(struct bpf_prog *fp) { + unsigned long symbol_start, symbol_end; + char buf[KSYM_NAME_LEN], *sym; + if (!bpf_prog_kallsyms_candidate(fp) || !capable(CAP_SYS_ADMIN)) return; + bpf_get_prog_addr_region(fp, _start, _end); + sym = bpf_get_prog_name(fp, buf); + sym++; /* sym - buf is the length of the name including trailing 0 */ + while (!IS_ALIGNED(sym - buf, sizeof(u64))) + *sym++ = 0; nit: This logic feels a little weird to me. How about we wrap the extra logic in a separate function: size_t bpf_get_prog_name_u64_aligned(const struct bpf_prog fp, char *buf) probably bpf_get_prog_name_u64_padded() ? would be cleaner indeed. Will send v2 once Peter and Arnaldo provide their feedback. Thanks for reviewing!
Re: [PATCH bpf-next 2/3] bpf: emit RECORD_MMAP events for bpf prog load/unload
> On Sep 19, 2018, at 3:39 PM, Alexei Starovoitov wrote: > > use perf_event_mmap_bpf_prog() helper to notify user space > about JITed bpf programs. > Use RECORD_MMAP perf event to tell user space where JITed bpf program was > loaded. > Use empty program name as unload indication. > > Signed-off-by: Alexei Starovoitov > --- > kernel/bpf/core.c | 22 -- > 1 file changed, 20 insertions(+), 2 deletions(-) > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > index 3f5bf1af0826..ddf11fdafd36 100644 > --- a/kernel/bpf/core.c > +++ b/kernel/bpf/core.c > @@ -384,7 +384,7 @@ bpf_get_prog_addr_region(const struct bpf_prog *prog, > *symbol_end = addr + hdr->pages * PAGE_SIZE; > } > > -static void bpf_get_prog_name(const struct bpf_prog *prog, char *sym) > +static char *bpf_get_prog_name(const struct bpf_prog *prog, char *sym) > { > const char *end = sym + KSYM_NAME_LEN; > > @@ -402,9 +402,10 @@ static void bpf_get_prog_name(const struct bpf_prog > *prog, char *sym) > sym += snprintf(sym, KSYM_NAME_LEN, "bpf_prog_"); > sym = bin2hex(sym, prog->tag, sizeof(prog->tag)); > if (prog->aux->name[0]) > - snprintf(sym, (size_t)(end - sym), "_%s", prog->aux->name); > + sym += snprintf(sym, (size_t)(end - sym), "_%s", > prog->aux->name); > else > *sym = 0; > + return sym; > } > > static __always_inline unsigned long > @@ -480,23 +481,40 @@ static bool bpf_prog_kallsyms_verify_off(const struct > bpf_prog *fp) > > void bpf_prog_kallsyms_add(struct bpf_prog *fp) > { > + unsigned long symbol_start, symbol_end; > + char buf[KSYM_NAME_LEN], *sym; > + > if (!bpf_prog_kallsyms_candidate(fp) || > !capable(CAP_SYS_ADMIN)) > return; > > + bpf_get_prog_addr_region(fp, _start, _end); > + sym = bpf_get_prog_name(fp, buf); > + sym++; /* sym - buf is the length of the name including trailing 0 */ > + while (!IS_ALIGNED(sym - buf, sizeof(u64))) > + *sym++ = 0; nit: This logic feels a little weird to me. How about we wrap the extra logic in a separate function: size_t bpf_get_prog_name_u64_aligned(const struct bpf_prog fp, char *buf) where the return value is the u64 aligned size. Other than this Acked-by: Song Liu > spin_lock_bh(_lock); > bpf_prog_ksym_node_add(fp->aux); > spin_unlock_bh(_lock); > + perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start, > + buf, sym - buf); > } > > void bpf_prog_kallsyms_del(struct bpf_prog *fp) > { > + unsigned long symbol_start, symbol_end; > + /* mmap_record.filename cannot be NULL and has to be u64 aligned */ > + char buf[sizeof(u64)] = {}; > + > if (!bpf_prog_kallsyms_candidate(fp)) > return; > > spin_lock_bh(_lock); > bpf_prog_ksym_node_del(fp->aux); > spin_unlock_bh(_lock); > + bpf_get_prog_addr_region(fp, _start, _end); > + perf_event_mmap_bpf_prog(symbol_start, symbol_end - symbol_start, > + buf, sizeof(buf)); > } > > static struct bpf_prog *bpf_prog_kallsyms_find(unsigned long addr) > -- > 2.17.1 >