On Fri, Mar 19, 2021 at 9:22 AM Song Liu <songliubrav...@fb.com> wrote: > > > > > On Mar 18, 2021, at 5:09 PM, Arnaldo <arnaldo.m...@gmail.com> wrote: > > > > > > > > On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa <jo...@redhat.com> wrote: > >> On Thu, Mar 18, 2021 at 03:52:51AM +0000, Song Liu wrote: > >>> > >>> > >>>> On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo > >> <a...@kernel.org> wrote: > >>>> > >>>> Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu: > >>>>> Hi Song, > >>>>> > >>>>> On Wed, Mar 17, 2021 at 6:18 AM Song Liu <songliubrav...@fb.com> > >> wrote: > >>>>>> > >>>>>> perf uses performance monitoring counters (PMCs) to monitor > >> system > >>>>>> performance. The PMCs are limited hardware resources. For > >> example, > >>>>>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu. > >>>>>> > >>>>>> Modern data center systems use these PMCs in many different ways: > >>>>>> system level monitoring, (maybe nested) container level > >> monitoring, per > >>>>>> process monitoring, profiling (in sample mode), etc. In some > >> cases, > >>>>>> there are more active perf_events than available hardware PMCs. > >> To allow > >>>>>> all perf_events to have a chance to run, it is necessary to do > >> expensive > >>>>>> time multiplexing of events. > >>>>>> > >>>>>> On the other hand, many monitoring tools count the common metrics > >> (cycles, > >>>>>> instructions). It is a waste to have multiple tools create > >> multiple > >>>>>> perf_events of "cycles" and occupy multiple PMCs. > >>>>> > >>>>> Right, it'd be really helpful when the PMCs are frequently or > >> mostly shared. > >>>>> But it'd also increase the overhead for uncontended cases as BPF > >> programs > >>>>> need to run on every context switch. Depending on the workload, > >> it may > >>>>> cause a non-negligible performance impact. So users should be > >> aware of it. > >>>> > >>>> Would be interesting to, humm, measure both cases to have a firm > >> number > >>>> of the impact, how many instructions are added when sharing using > >>>> --bpf-counters? > >>>> > >>>> I.e. compare the "expensive time multiplexing of events" with its > >>>> avoidance by using --bpf-counters. > >>>> > >>>> Song, have you perfmormed such measurements? > >>> > >>> I have got some measurements with perf-bench-sched-messaging: > >>> > >>> The system: x86_64 with 23 cores (46 HT) > >>> > >>> The perf-stat command: > >>> perf stat -e > >> cycles,cycles,instructions,instructions,ref-cycles,ref-cycles <target, > >> etc.> > >>> > >>> The benchmark command and output: > >>> ./perf bench sched messaging -g 40 -l 50000 -t > >>> # Running 'sched/messaging' benchmark: > >>> # 20 sender and receiver threads per group > >>> # 40 groups == 1600 threads run > >>> Total time: 10X.XXX [sec] > >>> > >>> > >>> I use the "Total time" as measurement, so smaller number is better. > >>> > >>> For each condition, I run the command 5 times, and took the median of > >> > >>> "Total time". > >>> > >>> Baseline (no perf-stat) 104.873 [sec] > >>> # global > >>> perf stat -a 107.887 [sec] > >>> perf stat -a --bpf-counters 106.071 [sec] > >>> # per task > >>> perf stat 106.314 [sec] > >>> perf stat --bpf-counters 105.965 [sec] > >>> # per cpu > >>> perf stat -C 1,3,5 107.063 [sec] > >>> perf stat -C 1,3,5 --bpf-counters 106.406 [sec] > >> > >> I can't see why it's actualy faster than normal perf ;-) > >> would be worth to find out > > > > Isn't this all about contended cases? > > Yeah, the normal perf is doing time multiplexing; while --bpf-counters > doesn't need it.
Yep, so for uncontended cases, normal perf should be the same as the baseline (faster than the bperf). But for contended cases, the bperf works faster. Thanks, Namhyung