Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
On Tue, Sep 11, 2018 at 04:42:09PM +0300, Alexey Budankov wrote: > Hi, > > On 11.09.2018 11:34, Jiri Olsa wrote: > > On Tue, Sep 11, 2018 at 11:16:45AM +0300, Alexey Budankov wrote: > >> > >> Hi Ingo, > >> > >> On 11.09.2018 9:35, Ingo Molnar wrote: > >>> > >>> * Alexey Budankov wrote: > >>> > It may sound too optimistic but glibc API is expected to be backward > compatible > and for POSIX AIO API part too. Internal implementation also tends to > evolve to > better option overtime, more probably basing on modern kernel > capabilities > mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html > >>> > >>> I'm not talking about compatibility, and I'm not just talking about > >>> glibc, perf works under > >>> other libcs as well - and let me phrase it in another way: basic event > >>> handling, threading, > >>> scheduling internals should be a *core competency* of a tracing/profiling > >>> tool. > >> > >> Well, the requirement of independence from some specific libc > >> implementation > >> as well as *core competency* design approach clarify a lot. Thanks! > >> > >>> > >>> I.e. we might end up using the exact same per event fd thread pool design > >>> that glibc uses > >>> currently. Or not. Having that internal and open coded to perf, like Jiri > >>> has started > >>> implementing it, allows people to experiment with it. > >> > >> My point here is that following some standardized programming models and > >> APIs > >> (like POSIX) in the tool code, even if the tool itself provides internal > >> open > >> coded implementation for the APIs, would simplify experimenting with the > >> tool > >> as well as lower barriers for new comers. Perf project could benefit from > >> that. > >> > >>> > >>> This isn't some GUI toolkit, this is at the essence of perf, and we are > >>> not very good on large > >>> systems right now, and I think the design should be open-coded threading, > >>> not relying on an > >>> (perf-)external AIO library to get it right. > >>> > >>> The glibc thread pool implementation of POSIX AIO is basically a > >>> fall-back > >>> implementation, for the case where there's no native KAIO interface to > >>> rely on. > >>> > Well, explicit threading in the tool for AIO, in the simplest case, > means > incorporating some POSIX API implementation into the tool, avoiding > code reuse in the first place. That tends to be error prone and costly. > >>> > >>> It's a core competency, we better do it right and not outsource it. > >> > >> Yep, makes sense. > > > > on the other hand, we are already trying to tie this up under perf_mmap > > object, which is what the threaded patchset operates on.. so I'm quite > > confident that with little effort we could make those 2 things live next > > to each other and let the user to decide which one to take and compare > > > > possibilities would be like: (not sure yet the last one makes sense, but > > still..) > > > > # perf record --threads=... ... > > # perf record --aio ... > > # perf record --threads=... --aio ... > > > > how about that? > > That might be an option. What is the semantics of --threads? that's my latest post on this: https://marc.info/?l=linux-kernel&m=151551213322861&w=2 working on repost ;-) jirka > > Be aware that when experimenting with serial trace writing on an 8-core > client machines running an HPC benchmark heavily utilizing all 8 cores > we noticed that single Perf tool thread contended with the benchmark > threads. > > That manifested like libiomp.so (Intel OpenMP implementation) functions > appearing among the top hotspots functions and this was indication of > imbalance induced by the tool during profiling. > > That's why we decided to first go with AIO approach, as it is posted, > and benefit from it the most thru multi AIO, prior turning to more > resource consuming multi-threading alternative. > > > > > I just rebased the thread patchset, will make some tests (it's been few > > months, > > so it needs some kicking/checking) and post it out hopefuly this week> > > jirka > >
Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
Hi, On 11.09.2018 17:19, Peter Zijlstra wrote: > On Tue, Sep 11, 2018 at 08:35:12AM +0200, Ingo Molnar wrote: >>> Well, explicit threading in the tool for AIO, in the simplest case, means >>> incorporating some POSIX API implementation into the tool, avoiding >>> code reuse in the first place. That tends to be error prone and costly. >> >> It's a core competency, we better do it right and not outsource it. >> >> Please take a look at Jiri's patches (once he re-posts them), I think it's a >> very good >> starting point. > > There's another reason for doing custom per-cpu threads; it avoids > bouncing the buffer memory around the machine. If the task doing the > buffer reads is the exact same as the one doing the writes, there's less > memory traffic on the interconnects. Yeah, NUMA does matter. Memory locality, i.e. cache sizes and NUMA domains for kernel/user buffers allocation, needs to be taken into account by the effective solution. Luckily data losses hasn't been observed when testing matrix multiplication on 96 core dual socket machines. > > Also, I think we can avoid the MFENCE in that case, but I'm not sure > that one is hot enough to bother about on the perf reading side of > things. Yep, *FENCE may be costly in HW, especially on larger scale. > Thanks, Alexey
Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
On Tue, Sep 11, 2018 at 08:35:12AM +0200, Ingo Molnar wrote: > > Well, explicit threading in the tool for AIO, in the simplest case, means > > incorporating some POSIX API implementation into the tool, avoiding > > code reuse in the first place. That tends to be error prone and costly. > > It's a core competency, we better do it right and not outsource it. > > Please take a look at Jiri's patches (once he re-posts them), I think it's a > very good > starting point. There's another reason for doing custom per-cpu threads; it avoids bouncing the buffer memory around the machine. If the task doing the buffer reads is the exact same as the one doing the writes, there's less memory traffic on the interconnects. Also, I think we can avoid the MFENCE in that case, but I'm not sure that one is hot enough to bother about on the perf reading side of things.
Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
Hi, On 11.09.2018 11:34, Jiri Olsa wrote: > On Tue, Sep 11, 2018 at 11:16:45AM +0300, Alexey Budankov wrote: >> >> Hi Ingo, >> >> On 11.09.2018 9:35, Ingo Molnar wrote: >>> >>> * Alexey Budankov wrote: >>> It may sound too optimistic but glibc API is expected to be backward compatible and for POSIX AIO API part too. Internal implementation also tends to evolve to better option overtime, more probably basing on modern kernel capabilities mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html >>> >>> I'm not talking about compatibility, and I'm not just talking about glibc, >>> perf works under >>> other libcs as well - and let me phrase it in another way: basic event >>> handling, threading, >>> scheduling internals should be a *core competency* of a tracing/profiling >>> tool. >> >> Well, the requirement of independence from some specific libc implementation >> as well as *core competency* design approach clarify a lot. Thanks! >> >>> >>> I.e. we might end up using the exact same per event fd thread pool design >>> that glibc uses >>> currently. Or not. Having that internal and open coded to perf, like Jiri >>> has started >>> implementing it, allows people to experiment with it. >> >> My point here is that following some standardized programming models and >> APIs >> (like POSIX) in the tool code, even if the tool itself provides internal >> open >> coded implementation for the APIs, would simplify experimenting with the >> tool >> as well as lower barriers for new comers. Perf project could benefit from >> that. >> >>> >>> This isn't some GUI toolkit, this is at the essence of perf, and we are not >>> very good on large >>> systems right now, and I think the design should be open-coded threading, >>> not relying on an >>> (perf-)external AIO library to get it right. >>> >>> The glibc thread pool implementation of POSIX AIO is basically a fall-back >>> implementation, for the case where there's no native KAIO interface to rely >>> on. >>> Well, explicit threading in the tool for AIO, in the simplest case, means incorporating some POSIX API implementation into the tool, avoiding code reuse in the first place. That tends to be error prone and costly. >>> >>> It's a core competency, we better do it right and not outsource it. >> >> Yep, makes sense. > > on the other hand, we are already trying to tie this up under perf_mmap > object, which is what the threaded patchset operates on.. so I'm quite > confident that with little effort we could make those 2 things live next > to each other and let the user to decide which one to take and compare > > possibilities would be like: (not sure yet the last one makes sense, but > still..) > > # perf record --threads=... ... > # perf record --aio ... > # perf record --threads=... --aio ... > > how about that? That might be an option. What is the semantics of --threads? Be aware that when experimenting with serial trace writing on an 8-core client machines running an HPC benchmark heavily utilizing all 8 cores we noticed that single Perf tool thread contended with the benchmark threads. That manifested like libiomp.so (Intel OpenMP implementation) functions appearing among the top hotspots functions and this was indication of imbalance induced by the tool during profiling. That's why we decided to first go with AIO approach, as it is posted, and benefit from it the most thru multi AIO, prior turning to more resource consuming multi-threading alternative. > > I just rebased the thread patchset, will make some tests (it's been few > months, > so it needs some kicking/checking) and post it out hopefuly this week> > jirka >
Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
On Tue, Sep 11, 2018 at 11:16:45AM +0300, Alexey Budankov wrote: > > Hi Ingo, > > On 11.09.2018 9:35, Ingo Molnar wrote: > > > > * Alexey Budankov wrote: > > > >> It may sound too optimistic but glibc API is expected to be backward > >> compatible > >> and for POSIX AIO API part too. Internal implementation also tends to > >> evolve to > >> better option overtime, more probably basing on modern kernel capabilities > >> mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html > > > > I'm not talking about compatibility, and I'm not just talking about glibc, > > perf works under > > other libcs as well - and let me phrase it in another way: basic event > > handling, threading, > > scheduling internals should be a *core competency* of a tracing/profiling > > tool. > > Well, the requirement of independence from some specific libc implementation > as well as *core competency* design approach clarify a lot. Thanks! > > > > > I.e. we might end up using the exact same per event fd thread pool design > > that glibc uses > > currently. Or not. Having that internal and open coded to perf, like Jiri > > has started > > implementing it, allows people to experiment with it. > > My point here is that following some standardized programming models and APIs > (like POSIX) in the tool code, even if the tool itself provides internal open > coded implementation for the APIs, would simplify experimenting with the tool > as well as lower barriers for new comers. Perf project could benefit from > that. > > > > > This isn't some GUI toolkit, this is at the essence of perf, and we are not > > very good on large > > systems right now, and I think the design should be open-coded threading, > > not relying on an > > (perf-)external AIO library to get it right. > > > > The glibc thread pool implementation of POSIX AIO is basically a fall-back > > implementation, for the case where there's no native KAIO interface to rely > > on. > > > >> Well, explicit threading in the tool for AIO, in the simplest case, means > >> incorporating some POSIX API implementation into the tool, avoiding > >> code reuse in the first place. That tends to be error prone and costly. > > > > It's a core competency, we better do it right and not outsource it. > > Yep, makes sense. on the other hand, we are already trying to tie this up under perf_mmap object, which is what the threaded patchset operates on.. so I'm quite confident that with little effort we could make those 2 things live next to each other and let the user to decide which one to take and compare possibilities would be like: (not sure yet the last one makes sense, but still..) # perf record --threads=... ... # perf record --aio ... # perf record --threads=... --aio ... how about that? I just rebased the thread patchset, will make some tests (it's been few months, so it needs some kicking/checking) and post it out hopefuly this week jirka
Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
Hi Ingo, On 11.09.2018 9:35, Ingo Molnar wrote: > > * Alexey Budankov wrote: > >> It may sound too optimistic but glibc API is expected to be backward >> compatible >> and for POSIX AIO API part too. Internal implementation also tends to evolve >> to >> better option overtime, more probably basing on modern kernel capabilities >> mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html > > I'm not talking about compatibility, and I'm not just talking about glibc, > perf works under > other libcs as well - and let me phrase it in another way: basic event > handling, threading, > scheduling internals should be a *core competency* of a tracing/profiling > tool. Well, the requirement of independence from some specific libc implementation as well as *core competency* design approach clarify a lot. Thanks! > > I.e. we might end up using the exact same per event fd thread pool design > that glibc uses > currently. Or not. Having that internal and open coded to perf, like Jiri has > started > implementing it, allows people to experiment with it. My point here is that following some standardized programming models and APIs (like POSIX) in the tool code, even if the tool itself provides internal open coded implementation for the APIs, would simplify experimenting with the tool as well as lower barriers for new comers. Perf project could benefit from that. > > This isn't some GUI toolkit, this is at the essence of perf, and we are not > very good on large > systems right now, and I think the design should be open-coded threading, not > relying on an > (perf-)external AIO library to get it right. > > The glibc thread pool implementation of POSIX AIO is basically a fall-back > implementation, for the case where there's no native KAIO interface to rely > on. > >> Well, explicit threading in the tool for AIO, in the simplest case, means >> incorporating some POSIX API implementation into the tool, avoiding >> code reuse in the first place. That tends to be error prone and costly. > > It's a core competency, we better do it right and not outsource it. Yep, makes sense. Thanks! Alexey > > Please take a look at Jiri's patches (once he re-posts them), I think it's a > very good > starting point. > > Thanks, > > Ingo >
Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
* Alexey Budankov wrote: > It may sound too optimistic but glibc API is expected to be backward > compatible > and for POSIX AIO API part too. Internal implementation also tends to evolve > to > better option overtime, more probably basing on modern kernel capabilities > mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html I'm not talking about compatibility, and I'm not just talking about glibc, perf works under other libcs as well - and let me phrase it in another way: basic event handling, threading, scheduling internals should be a *core competency* of a tracing/profiling tool. I.e. we might end up using the exact same per event fd thread pool design that glibc uses currently. Or not. Having that internal and open coded to perf, like Jiri has started implementing it, allows people to experiment with it. This isn't some GUI toolkit, this is at the essence of perf, and we are not very good on large systems right now, and I think the design should be open-coded threading, not relying on an (perf-)external AIO library to get it right. The glibc thread pool implementation of POSIX AIO is basically a fall-back implementation, for the case where there's no native KAIO interface to rely on. > Well, explicit threading in the tool for AIO, in the simplest case, means > incorporating some POSIX API implementation into the tool, avoiding > code reuse in the first place. That tends to be error prone and costly. It's a core competency, we better do it right and not outsource it. Please take a look at Jiri's patches (once he re-posts them), I think it's a very good starting point. Thanks, Ingo
Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
Hi, On 10.09.2018 16:58, Arnaldo Carvalho de Melo wrote: > Em Mon, Sep 10, 2018 at 02:06:43PM +0200, Ingo Molnar escreveu: >> * Alexey Budankov wrote: >>> On 10.09.2018 12:18, Ingo Molnar wrote: * Alexey Budankov wrote: > Currently in record mode the tool implements trace writing serially. > The algorithm loops over mapped per-cpu data buffers and stores > ready data chunks into a trace file using write() system call. > > At some circumstances the kernel may lack free space in a buffer > because the other buffer's half is not yet written to disk due to > some other buffer's data writing by the tool at the moment. > > Thus serial trace writing implementation may cause the kernel > to loose profiling data and that is what observed when profiling > highly parallel CPU bound workloads on machines with big number > of cores. Yay! I saw this frequently on a 120-CPU box (hw is broken now). > Data loss metrics is the ratio lost_time/elapsed_time where > lost_time is the sum of time intervals containing PERF_RECORD_LOST > records and elapsed_time is the elapsed application run time > under profiling. > > Applying asynchronous trace streaming thru Posix AIO API > (http://man7.org/linux/man-pages/man7/aio.7.html) > lowers data loss metrics value providing 2x improvement - > lowering 98% loss to almost 0%. Hm, instead of AIO why don't we use explicit threads instead? I think Posix AIO will fall back to threads anyway when there's no kernel AIO support (which there probably isn't for perf events). >>> >>> Explicit threading is surely an option but having more threads >>> in the tool that stream performance data is a considerable >>> design complication. >>> >>> Luckily, glibc AIO implementation is already based on pthreads, >>> but having a writing thread for every distinct fd only. >> >> My argument is, we don't want to rely on glibc's choices here. They might >> use a different threading design in the future, or it might differ between >> libc versions. >> >> The basic flow of tracing/profiling data is something we should control >> explicitly, >> via explicit threading. >> >> BTW., the usecase I was primarily concentrating on was a simpler one: 'perf >> record -a', not >> inherited workflow tracing. For system-wide profiling the ideal tracing >> setup is clean per-CPU >> separation, i.e. per CPU event fds, per CPU threads that read and then write >> into separate >> per-CPU files. > > My main request here is that we think about the 'perf top' and 'perf > trace' workflows as well when working on this, i.e. that we don't take > for granted that we'll have the perf.data files to work with. Made manual sanity checks of perf top and perf trace modes using the same matrix multiplication workload. The modes look working after applying the patch set. Regards, Alexey > > I.e. N threads, that periodically use that FINISHED_ROUND event to order > events and go on consuming. All of the objects already have refcounts > and locking to allow for things like decaying of samples to take care of > trowing away no longer needed objects (struct map, thread, dso, symbol > tables, etc) to trim memory usage. > > - Arnaldo >
Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
Hi Ingo, On 10.09.2018 15:06, Ingo Molnar wrote: > > * Alexey Budankov wrote: > >> Hi Ingo, >> >> On 10.09.2018 12:18, Ingo Molnar wrote: >>> >>> * Alexey Budankov wrote: >>> Currently in record mode the tool implements trace writing serially. The algorithm loops over mapped per-cpu data buffers and stores ready data chunks into a trace file using write() system call. At some circumstances the kernel may lack free space in a buffer because the other buffer's half is not yet written to disk due to some other buffer's data writing by the tool at the moment. Thus serial trace writing implementation may cause the kernel to loose profiling data and that is what observed when profiling highly parallel CPU bound workloads on machines with big number of cores. >>> >>> Yay! I saw this frequently on a 120-CPU box (hw is broken now). >>> Data loss metrics is the ratio lost_time/elapsed_time where lost_time is the sum of time intervals containing PERF_RECORD_LOST records and elapsed_time is the elapsed application run time under profiling. Applying asynchronous trace streaming thru Posix AIO API (http://man7.org/linux/man-pages/man7/aio.7.html) lowers data loss metrics value providing 2x improvement - lowering 98% loss to almost 0%. >>> >>> Hm, instead of AIO why don't we use explicit threads instead? I think Posix >>> AIO will fall back >>> to threads anyway when there's no kernel AIO support (which there probably >>> isn't for perf >>> events). >> >> Explicit threading is surely an option but having more threads >> in the tool that stream performance data is a considerable >> design complication. >> >> Luckily, glibc AIO implementation is already based on pthreads, >> but having a writing thread for every distinct fd only. > > My argument is, we don't want to rely on glibc's choices here. They might > use a different threading design in the future, or it might differ between > libc versions.> > The basic flow of tracing/profiling data is something we should control > explicitly, > via explicit threading. It may sound too optimistic but glibc API is expected to be backward compatible and for POSIX AIO API part too. Internal implementation also tends to evolve to better option overtime, more probably basing on modern kernel capabilities mentioned here: http://man7.org/linux/man-pages/man2/io_submit.2.html Well, explicit threading in the tool for AIO, in the simplest case, means incorporating some POSIX API implementation into the tool, avoiding code reuse in the first place. That tends to be error prone and costly. Regards, Alexey > > BTW., the usecase I was primarily concentrating on was a simpler one: 'perf > record -a', not > inherited workflow tracing. For system-wide profiling the ideal tracing setup > is clean per-CPU > separation, i.e. per CPU event fds, per CPU threads that read and then write > into separate > per-CPU files. > > Thanks, > > Ingo >
Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
Em Mon, Sep 10, 2018 at 02:06:43PM +0200, Ingo Molnar escreveu: > * Alexey Budankov wrote: > > On 10.09.2018 12:18, Ingo Molnar wrote: > > > * Alexey Budankov wrote: > > >> Currently in record mode the tool implements trace writing serially. > > >> The algorithm loops over mapped per-cpu data buffers and stores > > >> ready data chunks into a trace file using write() system call. > > >> > > >> At some circumstances the kernel may lack free space in a buffer > > >> because the other buffer's half is not yet written to disk due to > > >> some other buffer's data writing by the tool at the moment. > > >> > > >> Thus serial trace writing implementation may cause the kernel > > >> to loose profiling data and that is what observed when profiling > > >> highly parallel CPU bound workloads on machines with big number > > >> of cores. > > > > > > Yay! I saw this frequently on a 120-CPU box (hw is broken now). > > > > > >> Data loss metrics is the ratio lost_time/elapsed_time where > > >> lost_time is the sum of time intervals containing PERF_RECORD_LOST > > >> records and elapsed_time is the elapsed application run time > > >> under profiling. > > >> > > >> Applying asynchronous trace streaming thru Posix AIO API > > >> (http://man7.org/linux/man-pages/man7/aio.7.html) > > >> lowers data loss metrics value providing 2x improvement - > > >> lowering 98% loss to almost 0%. > > > > > > Hm, instead of AIO why don't we use explicit threads instead? I think > > > Posix AIO will fall back > > > to threads anyway when there's no kernel AIO support (which there > > > probably isn't for perf > > > events). > > > > Explicit threading is surely an option but having more threads > > in the tool that stream performance data is a considerable > > design complication. > > > > Luckily, glibc AIO implementation is already based on pthreads, > > but having a writing thread for every distinct fd only. > > My argument is, we don't want to rely on glibc's choices here. They might > use a different threading design in the future, or it might differ between > libc versions. > > The basic flow of tracing/profiling data is something we should control > explicitly, > via explicit threading. > > BTW., the usecase I was primarily concentrating on was a simpler one: 'perf > record -a', not > inherited workflow tracing. For system-wide profiling the ideal tracing setup > is clean per-CPU > separation, i.e. per CPU event fds, per CPU threads that read and then write > into separate > per-CPU files. My main request here is that we think about the 'perf top' and 'perf trace' workflows as well when working on this, i.e. that we don't take for granted that we'll have the perf.data files to work with. I.e. N threads, that periodically use that FINISHED_ROUND event to order events and go on consuming. All of the objects already have refcounts and locking to allow for things like decaying of samples to take care of trowing away no longer needed objects (struct map, thread, dso, symbol tables, etc) to trim memory usage. - Arnaldo
Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
* Alexey Budankov wrote: > Hi Ingo, > > On 10.09.2018 12:18, Ingo Molnar wrote: > > > > * Alexey Budankov wrote: > > > >> > >> Currently in record mode the tool implements trace writing serially. > >> The algorithm loops over mapped per-cpu data buffers and stores > >> ready data chunks into a trace file using write() system call. > >> > >> At some circumstances the kernel may lack free space in a buffer > >> because the other buffer's half is not yet written to disk due to > >> some other buffer's data writing by the tool at the moment. > >> > >> Thus serial trace writing implementation may cause the kernel > >> to loose profiling data and that is what observed when profiling > >> highly parallel CPU bound workloads on machines with big number > >> of cores. > > > > Yay! I saw this frequently on a 120-CPU box (hw is broken now). > > > >> Data loss metrics is the ratio lost_time/elapsed_time where > >> lost_time is the sum of time intervals containing PERF_RECORD_LOST > >> records and elapsed_time is the elapsed application run time > >> under profiling. > >> > >> Applying asynchronous trace streaming thru Posix AIO API > >> (http://man7.org/linux/man-pages/man7/aio.7.html) > >> lowers data loss metrics value providing 2x improvement - > >> lowering 98% loss to almost 0%. > > > > Hm, instead of AIO why don't we use explicit threads instead? I think Posix > > AIO will fall back > > to threads anyway when there's no kernel AIO support (which there probably > > isn't for perf > > events). > > Explicit threading is surely an option but having more threads > in the tool that stream performance data is a considerable > design complication. > > Luckily, glibc AIO implementation is already based on pthreads, > but having a writing thread for every distinct fd only. My argument is, we don't want to rely on glibc's choices here. They might use a different threading design in the future, or it might differ between libc versions. The basic flow of tracing/profiling data is something we should control explicitly, via explicit threading. BTW., the usecase I was primarily concentrating on was a simpler one: 'perf record -a', not inherited workflow tracing. For system-wide profiling the ideal tracing setup is clean per-CPU separation, i.e. per CPU event fds, per CPU threads that read and then write into separate per-CPU files. Thanks, Ingo
Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
Hi, On 10.09.2018 13:23, Jiri Olsa wrote: > On Mon, Sep 10, 2018 at 12:13:25PM +0200, Ingo Molnar wrote: >> >> * Jiri Olsa wrote: >> >>> On Mon, Sep 10, 2018 at 12:03:03PM +0200, Ingo Molnar wrote: * Jiri Olsa wrote: >> Per-CPU threading the record session would have so many other advantages >> as well (scalability, >> etc.). >> >> Jiri did per-CPU recording patches a couple of months ago, not sure how >> usable they are at the >> moment? > > it's still usable, I can rebase it and post a branch pointer, > the problem is I haven't been able to find a case with a real > performance benefit yet.. ;-) > > perhaps because I haven't tried on server with really big cpu > numbers Maybe Alexey could pick up from there? Your concept looked fairly mature to me and I tried it on a big-CPU box back then and there were real improvements. >>> >>> too bad u did not share your results, it could have been already in ;-) >> >> Yeah :-/ Had a proper round of testing on my TODO, then the big box I'd have >> tested it on >> broke ... >> >>> let me rebase/repost once more and let's see >> >> Thanks! >> >>> I think we could benefit from both multiple threads event reading >>> and AIO writing for perf.data.. it could be merged together >> >> So instead of AIO writing perf.data, why not just turn perf.data into a >> directory structure >> with per CPU files? That would allow all sorts of neat future performance >> features such as > > that's basically what the multiple-thread record patchset does Re-posting part of my answer here... Please note that tool threads may contend, and actually do, with application threads, under heavy load when all CPU cores are utilized, and this may alter performance profile. So this or that tool design is also a matter of proper system balancing when profiling so that the gathered performance data would be actual. Thanks, Alexey > > jirka > >> mmap() or splice() based zero-copy. >> >> User-space post-processing can then read the files and put them into global >> order - or use the >> per CPU nature of them, which would be pretty useful too. >> >> Also note how well this works on NUMA as well, as the backing pages would be >> allocated in a >> NUMA-local fashion. >> >> I.e. the whole per-CPU threading would enable such a separation of the >> tracing/event streams >> and would allow true scalability. >> >> Thanks, >> >> Ingo >
Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
Hi Ingo, On 10.09.2018 12:18, Ingo Molnar wrote: > > * Alexey Budankov wrote: > >> >> Currently in record mode the tool implements trace writing serially. >> The algorithm loops over mapped per-cpu data buffers and stores >> ready data chunks into a trace file using write() system call. >> >> At some circumstances the kernel may lack free space in a buffer >> because the other buffer's half is not yet written to disk due to >> some other buffer's data writing by the tool at the moment. >> >> Thus serial trace writing implementation may cause the kernel >> to loose profiling data and that is what observed when profiling >> highly parallel CPU bound workloads on machines with big number >> of cores. > > Yay! I saw this frequently on a 120-CPU box (hw is broken now). > >> Data loss metrics is the ratio lost_time/elapsed_time where >> lost_time is the sum of time intervals containing PERF_RECORD_LOST >> records and elapsed_time is the elapsed application run time >> under profiling. >> >> Applying asynchronous trace streaming thru Posix AIO API >> (http://man7.org/linux/man-pages/man7/aio.7.html) >> lowers data loss metrics value providing 2x improvement - >> lowering 98% loss to almost 0%. > > Hm, instead of AIO why don't we use explicit threads instead? I think Posix > AIO will fall back > to threads anyway when there's no kernel AIO support (which there probably > isn't for perf > events). Explicit threading is surely an option but having more threads in the tool that stream performance data is a considerable design complication. Luckily, glibc AIO implementation is already based on pthreads, but having a writing thread for every distinct fd only. > > Per-CPU threading the record session would have so many other advantages as > well (scalability, > etc.).> > Jiri did per-CPU recording patches a couple of months ago, not sure how > usable they are at the > moment? Tool threads may contend, and actually do, with application threads, under heavy load when all CPU cores are utilized, and this may alter performance profile. Thanks, Alexey > > Thanks, > > Ingo >
Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
On Mon, Sep 10, 2018 at 12:13:25PM +0200, Ingo Molnar wrote: > > * Jiri Olsa wrote: > > > On Mon, Sep 10, 2018 at 12:03:03PM +0200, Ingo Molnar wrote: > > > > > > * Jiri Olsa wrote: > > > > > > > > Per-CPU threading the record session would have so many other > > > > > advantages as well (scalability, > > > > > etc.). > > > > > > > > > > Jiri did per-CPU recording patches a couple of months ago, not sure > > > > > how usable they are at the > > > > > moment? > > > > > > > > it's still usable, I can rebase it and post a branch pointer, > > > > the problem is I haven't been able to find a case with a real > > > > performance benefit yet.. ;-) > > > > > > > > perhaps because I haven't tried on server with really big cpu > > > > numbers > > > > > > Maybe Alexey could pick up from there? Your concept looked fairly mature > > > to me > > > and I tried it on a big-CPU box back then and there were real > > > improvements. > > > > too bad u did not share your results, it could have been already in ;-) > > Yeah :-/ Had a proper round of testing on my TODO, then the big box I'd have > tested it on > broke ... > > > let me rebase/repost once more and let's see > > Thanks! > > > I think we could benefit from both multiple threads event reading > > and AIO writing for perf.data.. it could be merged together > > So instead of AIO writing perf.data, why not just turn perf.data into a > directory structure > with per CPU files? That would allow all sorts of neat future performance > features such as that's basically what the multiple-thread record patchset does jirka > mmap() or splice() based zero-copy. > > User-space post-processing can then read the files and put them into global > order - or use the > per CPU nature of them, which would be pretty useful too. > > Also note how well this works on NUMA as well, as the backing pages would be > allocated in a > NUMA-local fashion. > > I.e. the whole per-CPU threading would enable such a separation of the > tracing/event streams > and would allow true scalability. > > Thanks, > > Ingo
Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
* Jiri Olsa wrote: > On Mon, Sep 10, 2018 at 12:03:03PM +0200, Ingo Molnar wrote: > > > > * Jiri Olsa wrote: > > > > > > Per-CPU threading the record session would have so many other > > > > advantages as well (scalability, > > > > etc.). > > > > > > > > Jiri did per-CPU recording patches a couple of months ago, not sure how > > > > usable they are at the > > > > moment? > > > > > > it's still usable, I can rebase it and post a branch pointer, > > > the problem is I haven't been able to find a case with a real > > > performance benefit yet.. ;-) > > > > > > perhaps because I haven't tried on server with really big cpu > > > numbers > > > > Maybe Alexey could pick up from there? Your concept looked fairly mature to > > me > > and I tried it on a big-CPU box back then and there were real improvements. > > too bad u did not share your results, it could have been already in ;-) Yeah :-/ Had a proper round of testing on my TODO, then the big box I'd have tested it on broke ... > let me rebase/repost once more and let's see Thanks! > I think we could benefit from both multiple threads event reading > and AIO writing for perf.data.. it could be merged together So instead of AIO writing perf.data, why not just turn perf.data into a directory structure with per CPU files? That would allow all sorts of neat future performance features such as mmap() or splice() based zero-copy. User-space post-processing can then read the files and put them into global order - or use the per CPU nature of them, which would be pretty useful too. Also note how well this works on NUMA as well, as the backing pages would be allocated in a NUMA-local fashion. I.e. the whole per-CPU threading would enable such a separation of the tracing/event streams and would allow true scalability. Thanks, Ingo
Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
On Mon, Sep 10, 2018 at 12:03:03PM +0200, Ingo Molnar wrote: > > * Jiri Olsa wrote: > > > > Per-CPU threading the record session would have so many other advantages > > > as well (scalability, > > > etc.). > > > > > > Jiri did per-CPU recording patches a couple of months ago, not sure how > > > usable they are at the > > > moment? > > > > it's still usable, I can rebase it and post a branch pointer, > > the problem is I haven't been able to find a case with a real > > performance benefit yet.. ;-) > > > > perhaps because I haven't tried on server with really big cpu > > numbers > > Maybe Alexey could pick up from there? Your concept looked fairly mature to me > and I tried it on a big-CPU box back then and there were real improvements. too bad u did not share your results, it could have been already in ;-) let me rebase/repost once more and let's see I think we could benefit from both multiple threads event reading and AIO writing for perf.data.. it could be merged together jirka
Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
* Jiri Olsa wrote: > > Per-CPU threading the record session would have so many other advantages as > > well (scalability, > > etc.). > > > > Jiri did per-CPU recording patches a couple of months ago, not sure how > > usable they are at the > > moment? > > it's still usable, I can rebase it and post a branch pointer, > the problem is I haven't been able to find a case with a real > performance benefit yet.. ;-) > > perhaps because I haven't tried on server with really big cpu > numbers Maybe Alexey could pick up from there? Your concept looked fairly mature to me and I tried it on a big-CPU box back then and there were real improvements. Thanks, Ingo
Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
On Mon, Sep 10, 2018 at 11:18:41AM +0200, Ingo Molnar wrote: > > * Alexey Budankov wrote: > > > > > Currently in record mode the tool implements trace writing serially. > > The algorithm loops over mapped per-cpu data buffers and stores > > ready data chunks into a trace file using write() system call. > > > > At some circumstances the kernel may lack free space in a buffer > > because the other buffer's half is not yet written to disk due to > > some other buffer's data writing by the tool at the moment. > > > > Thus serial trace writing implementation may cause the kernel > > to loose profiling data and that is what observed when profiling > > highly parallel CPU bound workloads on machines with big number > > of cores. > > Yay! I saw this frequently on a 120-CPU box (hw is broken now). > > > Data loss metrics is the ratio lost_time/elapsed_time where > > lost_time is the sum of time intervals containing PERF_RECORD_LOST > > records and elapsed_time is the elapsed application run time > > under profiling. > > > > Applying asynchronous trace streaming thru Posix AIO API > > (http://man7.org/linux/man-pages/man7/aio.7.html) > > lowers data loss metrics value providing 2x improvement - > > lowering 98% loss to almost 0%. > > Hm, instead of AIO why don't we use explicit threads instead? I think Posix > AIO will fall back > to threads anyway when there's no kernel AIO support (which there probably > isn't for perf > events). this patch adds the aoi for writing to the perf.data file, reading of events is unchanged > > Per-CPU threading the record session would have so many other advantages as > well (scalability, > etc.). > > Jiri did per-CPU recording patches a couple of months ago, not sure how > usable they are at the > moment? it's still usable, I can rebase it and post a branch pointer, the problem is I haven't been able to find a case with a real performance benefit yet.. ;-) perhaps because I haven't tried on server with really big cpu numbers jirka
Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
* Alexey Budankov wrote: > > Currently in record mode the tool implements trace writing serially. > The algorithm loops over mapped per-cpu data buffers and stores > ready data chunks into a trace file using write() system call. > > At some circumstances the kernel may lack free space in a buffer > because the other buffer's half is not yet written to disk due to > some other buffer's data writing by the tool at the moment. > > Thus serial trace writing implementation may cause the kernel > to loose profiling data and that is what observed when profiling > highly parallel CPU bound workloads on machines with big number > of cores. Yay! I saw this frequently on a 120-CPU box (hw is broken now). > Data loss metrics is the ratio lost_time/elapsed_time where > lost_time is the sum of time intervals containing PERF_RECORD_LOST > records and elapsed_time is the elapsed application run time > under profiling. > > Applying asynchronous trace streaming thru Posix AIO API > (http://man7.org/linux/man-pages/man7/aio.7.html) > lowers data loss metrics value providing 2x improvement - > lowering 98% loss to almost 0%. Hm, instead of AIO why don't we use explicit threads instead? I think Posix AIO will fall back to threads anyway when there's no kernel AIO support (which there probably isn't for perf events). Per-CPU threading the record session would have so many other advantages as well (scalability, etc.). Jiri did per-CPU recording patches a couple of months ago, not sure how usable they are at the moment? Thanks, Ingo
Re: [PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
On 07.09.2018 10:07, Alexey Budankov wrote: > > Currently in record mode the tool implements trace writing serially. > The algorithm loops over mapped per-cpu data buffers and stores > ready data chunks into a trace file using write() system call. > > At some circumstances the kernel may lack free space in a buffer > because the other buffer's half is not yet written to disk due to > some other buffer's data writing by the tool at the moment. > > Thus serial trace writing implementation may cause the kernel > to loose profiling data and that is what observed when profiling > highly parallel CPU bound workloads on machines with big number > of cores. > > Experiment with profiling matrix multiplication code executing 128 > threads on Intel Xeon Phi (KNM) with 272 cores, like below, > demonstrates data loss metrics value of 98%: > > /usr/bin/time perf record -o /tmp/perf-ser.data -a -N -B -T -R -g \ > --call-graph dwarf,1024 --user-regs=IP,SP,BP \ > --switch-events -e > cycles,instructions,ref-cycles,software/period=1,name=cs,config=0x3/Duk -- \ > matrix.gcc > > Data loss metrics is the ratio lost_time/elapsed_time where > lost_time is the sum of time intervals containing PERF_RECORD_LOST > records and elapsed_time is the elapsed application run time > under profiling. > > Applying asynchronous trace streaming thru Posix AIO API > (http://man7.org/linux/man-pages/man7/aio.7.html) > lowers data loss metrics value providing 2x improvement - > lowering 98% loss to almost 0%. > > --- > Alexey Budankov (3): > perf util: map data buffer for preserving collected data > perf record: enable asynchronous trace writing > perf record: extend trace writing to multi AIO > > tools/perf/builtin-record.c | 166 > ++-- > tools/perf/perf.h | 1 + > tools/perf/util/evlist.c| 7 +- > tools/perf/util/evlist.h| 3 +- > tools/perf/util/mmap.c | 114 ++ > tools/perf/util/mmap.h | 11 ++- > 6 files changed, 277 insertions(+), 25 deletions(-) The whole thing for git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux perf/core repository follows: tools/perf/builtin-record.c | 165 ++-- tools/perf/perf.h | 1 + tools/perf/util/evlist.c| 7 +- tools/perf/util/evlist.h| 3 +- tools/perf/util/mmap.c | 114 ++ tools/perf/util/mmap.h | 11 ++- 6 files changed, 276 insertions(+), 25 deletions(-) diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index 9853552bcf16..7bb7947072e5 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -121,6 +121,112 @@ static int record__write(struct record *rec, void *bf, size_t size) return 0; } +static int record__aio_write(struct aiocb *cblock, int trace_fd, + void *buf, size_t size, off_t off) +{ + int rc; + + cblock->aio_fildes = trace_fd; + cblock->aio_buf= buf; + cblock->aio_nbytes = size; + cblock->aio_offset = off; + cblock->aio_sigevent.sigev_notify = SIGEV_NONE; + + do { + rc = aio_write(cblock); + if (rc == 0) { + break; + } else if (errno != EAGAIN) { + cblock->aio_fildes = -1; + pr_err("failed to queue perf data, error: %m\n"); + break; + } + } while (1); + + return rc; +} + +static int record__aio_complete(struct perf_mmap *md, struct aiocb *cblock) +{ + void *rem_buf; + off_t rem_off; + size_t rem_size; + int rc, aio_errno; + ssize_t aio_ret, written; + + aio_errno = aio_error(cblock); + if (aio_errno == EINPROGRESS) + return 0; + + written = aio_ret = aio_return(cblock); + if (aio_ret < 0) { + if (!(aio_errno == EINTR)) + pr_err("failed to write perf data, error: %m\n"); + written = 0; + } + + rem_size = cblock->aio_nbytes - written; + + if (rem_size == 0) { + cblock->aio_fildes = -1; + /* +* md->refcount is incremented in perf_mmap__push() for +* every enqueued aio write request so decrement it because +* the request is now complete. +*/ + perf_mmap__put(md); + rc = 1; + } else { + /* +* aio write request may require restart with the +* reminder if the kernel didn't write whole +* chunk at once. +*/ + rem_off = cblock->aio_offset + written; + rem_buf = (void *)(cblock->aio_buf + written); + record__aio_write(cblock, cblock->aio_fildes, + re
[PATCH v8 0/3]: perf: reduce data loss when profiling highly parallel CPU bound workloads
Currently in record mode the tool implements trace writing serially. The algorithm loops over mapped per-cpu data buffers and stores ready data chunks into a trace file using write() system call. At some circumstances the kernel may lack free space in a buffer because the other buffer's half is not yet written to disk due to some other buffer's data writing by the tool at the moment. Thus serial trace writing implementation may cause the kernel to loose profiling data and that is what observed when profiling highly parallel CPU bound workloads on machines with big number of cores. Experiment with profiling matrix multiplication code executing 128 threads on Intel Xeon Phi (KNM) with 272 cores, like below, demonstrates data loss metrics value of 98%: /usr/bin/time perf record -o /tmp/perf-ser.data -a -N -B -T -R -g \ --call-graph dwarf,1024 --user-regs=IP,SP,BP \ --switch-events -e cycles,instructions,ref-cycles,software/period=1,name=cs,config=0x3/Duk -- \ matrix.gcc Data loss metrics is the ratio lost_time/elapsed_time where lost_time is the sum of time intervals containing PERF_RECORD_LOST records and elapsed_time is the elapsed application run time under profiling. Applying asynchronous trace streaming thru Posix AIO API (http://man7.org/linux/man-pages/man7/aio.7.html) lowers data loss metrics value providing 2x improvement - lowering 98% loss to almost 0%. --- Alexey Budankov (3): perf util: map data buffer for preserving collected data perf record: enable asynchronous trace writing perf record: extend trace writing to multi AIO tools/perf/builtin-record.c | 166 ++-- tools/perf/perf.h | 1 + tools/perf/util/evlist.c| 7 +- tools/perf/util/evlist.h| 3 +- tools/perf/util/mmap.c | 114 ++ tools/perf/util/mmap.h | 11 ++- 6 files changed, 277 insertions(+), 25 deletions(-) --- Changes in v8: - run the whole thing thru checkpatch.pl and corrected found issues except lines longer than 80 symbols - corrected comments alignment and formatting - moved multi AIO implementation into 3rd patch in the series - implemented explicit cblocks array allocation - split AIO completion check into separate record__aio_complete() - set nr_cblocks default to 1 and max allowed value to 4 Changes in v7: - implemented handling record.aio setting from perfconfig file Changes in v6: - adjusted setting of priorities for cblocks; - handled errno == EAGAIN case from aio_write() return; Changes in v5: - resolved livelock on perf record -e intel_pt// -- dd if=/dev/zero of=/dev/null count=10 - data loss metrics decreased from 25% to 2x in trialed configuration; - reshaped layout of data structures; - implemented --aio option; - avoided nanosleep() prior calling aio_suspend(); - switched to per-cpu aio multi buffer record__aio_sync(); - record_mmap_read_sync() now does global sync just before switching trace file or collection stop; Changes in v4: - converted mmap()/munmap() to malloc()/free() for mmap->data buffer management - converted void *bf to struct perf_mmap *md in signatures - written comment in perf_mmap__push() just before perf_mmap__get(); - written comment in record__mmap_read_sync() on possible restarting of aio_write() operation and releasing perf_mmap object after all; - added perf_mmap__put() for the cases of failed aio_write(); Changes in v3: - written comments about nanosleep(0.5ms) call prior aio_suspend() to cope with intrusiveness of its implementation in glibc; - written comments about rationale behind coping profiling data into mmap->data buffer; Changes in v2: - converted zalloc() to calloc() for allocation of mmap_aio array, - cleared typo and adjusted fallback branch code;