On Fri, Mar 19, 2021 at 9:22 AM Song Liu <songliubrav...@fb.com> wrote:
>
>
>
> > On Mar 18, 2021, at 5:09 PM, Arnaldo <arnaldo.m...@gmail.com> wrote:
> >
> >
> >
> > On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa <jo...@redhat.com> wrote:
> >> On Thu, Mar 18, 2021 at 03:52:51AM +0000, Song Liu wrote:
> >>>
> >>>
> >>>> On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo
> >> <a...@kernel.org> wrote:
> >>>>
> >>>> Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu:
> >>>>> Hi Song,
> >>>>>
> >>>>> On Wed, Mar 17, 2021 at 6:18 AM Song Liu <songliubrav...@fb.com>
> >> wrote:
> >>>>>>
> >>>>>> perf uses performance monitoring counters (PMCs) to monitor
> >> system
> >>>>>> performance. The PMCs are limited hardware resources. For
> >> example,
> >>>>>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> >>>>>>
> >>>>>> Modern data center systems use these PMCs in many different ways:
> >>>>>> system level monitoring, (maybe nested) container level
> >> monitoring, per
> >>>>>> process monitoring, profiling (in sample mode), etc. In some
> >> cases,
> >>>>>> there are more active perf_events than available hardware PMCs.
> >> To allow
> >>>>>> all perf_events to have a chance to run, it is necessary to do
> >> expensive
> >>>>>> time multiplexing of events.
> >>>>>>
> >>>>>> On the other hand, many monitoring tools count the common metrics
> >> (cycles,
> >>>>>> instructions). It is a waste to have multiple tools create
> >> multiple
> >>>>>> perf_events of "cycles" and occupy multiple PMCs.
> >>>>>
> >>>>> Right, it'd be really helpful when the PMCs are frequently or
> >> mostly shared.
> >>>>> But it'd also increase the overhead for uncontended cases as BPF
> >> programs
> >>>>> need to run on every context switch.  Depending on the workload,
> >> it may
> >>>>> cause a non-negligible performance impact.  So users should be
> >> aware of it.
> >>>>
> >>>> Would be interesting to, humm, measure both cases to have a firm
> >> number
> >>>> of the impact, how many instructions are added when sharing using
> >>>> --bpf-counters?
> >>>>
> >>>> I.e. compare the "expensive time multiplexing of events" with its
> >>>> avoidance by using --bpf-counters.
> >>>>
> >>>> Song, have you perfmormed such measurements?
> >>>
> >>> I have got some measurements with perf-bench-sched-messaging:
> >>>
> >>> The system: x86_64 with 23 cores (46 HT)
> >>>
> >>> The perf-stat command:
> >>> perf stat -e
> >> cycles,cycles,instructions,instructions,ref-cycles,ref-cycles <target,
> >> etc.>
> >>>
> >>> The benchmark command and output:
> >>> ./perf bench sched messaging -g 40 -l 50000 -t
> >>> # Running 'sched/messaging' benchmark:
> >>> # 20 sender and receiver threads per group
> >>> # 40 groups == 1600 threads run
> >>>     Total time: 10X.XXX [sec]
> >>>
> >>>
> >>> I use the "Total time" as measurement, so smaller number is better.
> >>>
> >>> For each condition, I run the command 5 times, and took the median of
> >>
> >>> "Total time".
> >>>
> >>> Baseline (no perf-stat)                     104.873 [sec]
> >>> # global
> >>> perf stat -a                                107.887 [sec]
> >>> perf stat -a --bpf-counters         106.071 [sec]
> >>> # per task
> >>> perf stat                           106.314 [sec]
> >>> perf stat --bpf-counters            105.965 [sec]
> >>> # per cpu
> >>> perf stat -C 1,3,5                  107.063 [sec]
> >>> perf stat -C 1,3,5 --bpf-counters   106.406 [sec]
> >>
> >> I can't see why it's actualy faster than normal perf ;-)
> >> would be worth to find out
> >
> > Isn't this all about contended cases?
>
> Yeah, the normal perf is doing time multiplexing; while --bpf-counters
> doesn't need it.

Yep, so for uncontended cases, normal perf should be the same as the
baseline (faster than the bperf).  But for contended cases, the bperf
works faster.

Thanks,
Namhyung

Reply via email to