Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-23 Thread Song Liu



> On Mar 23, 2021, at 2:10 PM, Arnaldo Carvalho de Melo  wrote:
> 
> Em Fri, Mar 19, 2021 at 04:14:42PM +, Song Liu escreveu:
>>> On Mar 19, 2021, at 8:58 AM, Namhyung Kim  wrote:
>>> On Sat, Mar 20, 2021 at 12:35 AM Arnaldo Carvalho de Melo  
>>> wrote:
 Em Fri, Mar 19, 2021 at 09:54:59AM +0900, Namhyung Kim escreveu:
> On Fri, Mar 19, 2021 at 9:22 AM Song Liu  wrote:
>>> On Mar 18, 2021, at 5:09 PM, Arnaldo  wrote:
>>> On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa  
>>> wrote:
 On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote:
> perf stat -C 1,3,5  107.063 [sec]
> perf stat -C 1,3,5 --bpf-counters   106.406 [sec]
> 
 I can't see why it's actualy faster than normal perf ;-)
 would be worth to find out
> 
>>> Isn't this all about contended cases?
> 
>> Yeah, the normal perf is doing time multiplexing; while --bpf-counters
>> doesn't need it.
> 
> Yep, so for uncontended cases, normal perf should be the same as the
> baseline (faster than the bperf).  But for contended cases, the bperf
> works faster.
> 
 The difference should be small enough that for people that use this in a
 machine where contention happens most of the time, setting a
 ~/.perfconfig to use it by default should be advantageous, i.e. no need
 to use --bpf-counters on the command line all the time.
> 
 So, Namhyung, can I take that as an Acked-by or a Reviewed-by? I'll take
 a look again now but I want to have this merged on perf/core so that I
 can work on a new BPF SKEL to use this:
> 
>>> I have a concern for the per cpu target, but it can be done later, so
> 
>>> Acked-by: Namhyung Kim 
> 
 https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=tmp.bpf/bpf_perf_enable
> 
>>> Interesting!  Actually I was thinking about the similar too. :)
>> 
>> Hi Namhyung, Jiri, and Arnaldo,
>> 
>> Thanks a lot for your kind review. 
>> 
>> Here is updated 3/3, where we use perf-bench instead of stressapptest.
> 
> I had to apply this updated 3/3 manually, as there was some munging, its
> all now at:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=tmp.perf/core
> 
> Please take a look at the "Committer testing" section I added to the
> main patch, introducing bperf:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/commit/?h=tmp.perf/core=7fac83aaf2eecc9e7e7b72da694c49bb4ce7fdfc
> 
> And check if I made any mistake or if something else could be added.
> 
> It'll move to perf/core after my set of automated tests finishes.

Thanks Arnaldo! Looks great!

Song




Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-23 Thread Arnaldo Carvalho de Melo
Em Fri, Mar 19, 2021 at 04:14:42PM +, Song Liu escreveu:
> > On Mar 19, 2021, at 8:58 AM, Namhyung Kim  wrote:
> > On Sat, Mar 20, 2021 at 12:35 AM Arnaldo Carvalho de Melo  
> > wrote:
> >> Em Fri, Mar 19, 2021 at 09:54:59AM +0900, Namhyung Kim escreveu:
> >>> On Fri, Mar 19, 2021 at 9:22 AM Song Liu  wrote:
> > On Mar 18, 2021, at 5:09 PM, Arnaldo  wrote:
> > On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa  
> > wrote:
> >> On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote:
> >>> perf stat -C 1,3,5  107.063 [sec]
> >>> perf stat -C 1,3,5 --bpf-counters   106.406 [sec]

> >> I can't see why it's actualy faster than normal perf ;-)
> >> would be worth to find out

> > Isn't this all about contended cases?

>  Yeah, the normal perf is doing time multiplexing; while --bpf-counters
>  doesn't need it.

> >>> Yep, so for uncontended cases, normal perf should be the same as the
> >>> baseline (faster than the bperf).  But for contended cases, the bperf
> >>> works faster.

> >> The difference should be small enough that for people that use this in a
> >> machine where contention happens most of the time, setting a
> >> ~/.perfconfig to use it by default should be advantageous, i.e. no need
> >> to use --bpf-counters on the command line all the time.

> >> So, Namhyung, can I take that as an Acked-by or a Reviewed-by? I'll take
> >> a look again now but I want to have this merged on perf/core so that I
> >> can work on a new BPF SKEL to use this:

> > I have a concern for the per cpu target, but it can be done later, so

> > Acked-by: Namhyung Kim 

> >> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=tmp.bpf/bpf_perf_enable

> > Interesting!  Actually I was thinking about the similar too. :)
> 
> Hi Namhyung, Jiri, and Arnaldo,
> 
> Thanks a lot for your kind review. 
> 
> Here is updated 3/3, where we use perf-bench instead of stressapptest.

I had to apply this updated 3/3 manually, as there was some munging, its
all now at:

https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=tmp.perf/core

Please take a look at the "Committer testing" section I added to the
main patch, introducing bperf:

https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/commit/?h=tmp.perf/core=7fac83aaf2eecc9e7e7b72da694c49bb4ce7fdfc

And check if I made any mistake or if something else could be added.

It'll move to perf/core after my set of automated tests finishes.

- Arnaldo


Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-19 Thread Song Liu



> On Mar 19, 2021, at 8:58 AM, Namhyung Kim  wrote:
> 
> Hi Arnaldo,
> 
> On Sat, Mar 20, 2021 at 12:35 AM Arnaldo Carvalho de Melo
>  wrote:
>> 
>> Em Fri, Mar 19, 2021 at 09:54:59AM +0900, Namhyung Kim escreveu:
>>> On Fri, Mar 19, 2021 at 9:22 AM Song Liu  wrote:
> On Mar 18, 2021, at 5:09 PM, Arnaldo  wrote:
> On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa  
> wrote:
>> On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote:
>>> perf stat -C 1,3,5  107.063 [sec]
>>> perf stat -C 1,3,5 --bpf-counters   106.406 [sec]
>> 
>> I can't see why it's actualy faster than normal perf ;-)
>> would be worth to find out
>> 
> Isn't this all about contended cases?
>> 
 Yeah, the normal perf is doing time multiplexing; while --bpf-counters
 doesn't need it.
>> 
>>> Yep, so for uncontended cases, normal perf should be the same as the
>>> baseline (faster than the bperf).  But for contended cases, the bperf
>>> works faster.
>> 
>> The difference should be small enough that for people that use this in a
>> machine where contention happens most of the time, setting a
>> ~/.perfconfig to use it by default should be advantageous, i.e. no need
>> to use --bpf-counters on the command line all the time.
>> 
>> So, Namhyung, can I take that as an Acked-by or a Reviewed-by? I'll take
>> a look again now but I want to have this merged on perf/core so that I
>> can work on a new BPF SKEL to use this:
> 
> I have a concern for the per cpu target, but it can be done later, so
> 
> Acked-by: Namhyung Kim 
> 
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=tmp.bpf/bpf_perf_enable
> 
> Interesting!  Actually I was thinking about the similar too. :)

Hi Namhyung, Jiri, and Arnaldo,

Thanks a lot for your kind review. 

Here is updated 3/3, where we use perf-bench instead of stressapptest.

Thanks,
Song


>From cc79d161be9c9d24198f7e35b50058a6e15076fd Mon Sep 17 00:00:00 2001
From: Song Liu 
Date: Tue, 16 Mar 2021 00:19:53 -0700
Subject: [PATCH v3 3/3] perf-test: add a test for perf-stat --bpf-counters
 option

Add a test to compare the output of perf-stat with and without option
--bpf-counters. If the difference is more than 10%, the test is considered
as failed.

Signed-off-by: Song Liu 
---
 tools/perf/tests/shell/stat_bpf_counters.sh | 31 +
 1 file changed, 31 insertions(+)
 create mode 100755 tools/perf/tests/shell/stat_bpf_counters.sh

diff --git a/tools/perf/tests/shell/stat_bpf_counters.sh 
b/tools/perf/tests/shell/stat_bpf_counters.sh
new file mode 100755
index 0..7aabf177ce8d1
--- /dev/null
+++ b/tools/perf/tests/shell/stat_bpf_counters.sh
@@ -0,0 +1,31 @@
+#!/bin/sh
+# perf stat --bpf-counters test
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+# check whether $2 is within +/- 10% of $1
+compare_number()
+{
+   first_num=$1
+   second_num=$2
+
+   # upper bound is first_num * 110%
+   upper=$(( $first_num + $first_num / 10 ))
+   # lower bound is first_num * 90%
+   lower=$(( $first_num - $first_num / 10 ))
+
+   if [ $second_num -gt $upper ] || [ $second_num -lt $lower ]; then
+   echo "The difference between $first_num and $second_num are 
greater than 10%."
+   exit 1
+   fi
+}
+
+# skip if --bpf-counters is not supported
+perf stat --bpf-counters true > /dev/null 2>&1 || exit 2
+
+base_cycles=$(perf stat --no-big-num -e cycles -- perf bench sched messaging 
-g 1 -l 100 -t 2>&1 | awk '/cycles/ {print $1}')
+bpf_cycles=$(perf stat --no-big-num --bpf-counters -e cycles -- perf bench 
sched messaging -g 1 -l 100 -t 2>&1 | awk '/cycles/ {print $1}')
+
+compare_number $base_cycles $bpf_cycles
+exit 0
--
2.30.2




Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-19 Thread Namhyung Kim
Hi Arnaldo,

On Sat, Mar 20, 2021 at 12:35 AM Arnaldo Carvalho de Melo
 wrote:
>
> Em Fri, Mar 19, 2021 at 09:54:59AM +0900, Namhyung Kim escreveu:
> > On Fri, Mar 19, 2021 at 9:22 AM Song Liu  wrote:
> > > > On Mar 18, 2021, at 5:09 PM, Arnaldo  wrote:
> > > > On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa  
> > > > wrote:
> > > >> On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote:
> > > >>> perf stat -C 1,3,5  107.063 [sec]
> > > >>> perf stat -C 1,3,5 --bpf-counters   106.406 [sec]
>
> > > >> I can't see why it's actualy faster than normal perf ;-)
> > > >> would be worth to find out
>
> > > > Isn't this all about contended cases?
>
> > > Yeah, the normal perf is doing time multiplexing; while --bpf-counters
> > > doesn't need it.
>
> > Yep, so for uncontended cases, normal perf should be the same as the
> > baseline (faster than the bperf).  But for contended cases, the bperf
> > works faster.
>
> The difference should be small enough that for people that use this in a
> machine where contention happens most of the time, setting a
> ~/.perfconfig to use it by default should be advantageous, i.e. no need
> to use --bpf-counters on the command line all the time.
>
> So, Namhyung, can I take that as an Acked-by or a Reviewed-by? I'll take
> a look again now but I want to have this merged on perf/core so that I
> can work on a new BPF SKEL to use this:

I have a concern for the per cpu target, but it can be done later, so

Acked-by: Namhyung Kim 

>
> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=tmp.bpf/bpf_perf_enable

Interesting!  Actually I was thinking about the similar too. :)

Thanks,
Namhyung


Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-19 Thread Arnaldo Carvalho de Melo
Em Fri, Mar 19, 2021 at 09:54:59AM +0900, Namhyung Kim escreveu:
> On Fri, Mar 19, 2021 at 9:22 AM Song Liu  wrote:
> > > On Mar 18, 2021, at 5:09 PM, Arnaldo  wrote:
> > > On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa  
> > > wrote:
> > >> On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote:
> > >>> perf stat -C 1,3,5  107.063 [sec]
> > >>> perf stat -C 1,3,5 --bpf-counters   106.406 [sec]

> > >> I can't see why it's actualy faster than normal perf ;-)
> > >> would be worth to find out

> > > Isn't this all about contended cases?

> > Yeah, the normal perf is doing time multiplexing; while --bpf-counters
> > doesn't need it.

> Yep, so for uncontended cases, normal perf should be the same as the
> baseline (faster than the bperf).  But for contended cases, the bperf
> works faster.

The difference should be small enough that for people that use this in a
machine where contention happens most of the time, setting a
~/.perfconfig to use it by default should be advantageous, i.e. no need
to use --bpf-counters on the command line all the time.

So, Namhyung, can I take that as an Acked-by or a Reviewed-by? I'll take
a look again now but I want to have this merged on perf/core so that I
can work on a new BPF SKEL to use this:

https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=tmp.bpf/bpf_perf_enable

:-)

- Arnaldo


Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-18 Thread Namhyung Kim
On Fri, Mar 19, 2021 at 9:22 AM Song Liu  wrote:
>
>
>
> > On Mar 18, 2021, at 5:09 PM, Arnaldo  wrote:
> >
> >
> >
> > On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa  wrote:
> >> On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote:
> >>>
> >>>
>  On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo
> >>  wrote:
> 
>  Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu:
> > Hi Song,
> >
> > On Wed, Mar 17, 2021 at 6:18 AM Song Liu 
> >> wrote:
> >>
> >> perf uses performance monitoring counters (PMCs) to monitor
> >> system
> >> performance. The PMCs are limited hardware resources. For
> >> example,
> >> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> >>
> >> Modern data center systems use these PMCs in many different ways:
> >> system level monitoring, (maybe nested) container level
> >> monitoring, per
> >> process monitoring, profiling (in sample mode), etc. In some
> >> cases,
> >> there are more active perf_events than available hardware PMCs.
> >> To allow
> >> all perf_events to have a chance to run, it is necessary to do
> >> expensive
> >> time multiplexing of events.
> >>
> >> On the other hand, many monitoring tools count the common metrics
> >> (cycles,
> >> instructions). It is a waste to have multiple tools create
> >> multiple
> >> perf_events of "cycles" and occupy multiple PMCs.
> >
> > Right, it'd be really helpful when the PMCs are frequently or
> >> mostly shared.
> > But it'd also increase the overhead for uncontended cases as BPF
> >> programs
> > need to run on every context switch.  Depending on the workload,
> >> it may
> > cause a non-negligible performance impact.  So users should be
> >> aware of it.
> 
>  Would be interesting to, humm, measure both cases to have a firm
> >> number
>  of the impact, how many instructions are added when sharing using
>  --bpf-counters?
> 
>  I.e. compare the "expensive time multiplexing of events" with its
>  avoidance by using --bpf-counters.
> 
>  Song, have you perfmormed such measurements?
> >>>
> >>> I have got some measurements with perf-bench-sched-messaging:
> >>>
> >>> The system: x86_64 with 23 cores (46 HT)
> >>>
> >>> The perf-stat command:
> >>> perf stat -e
> >> cycles,cycles,instructions,instructions,ref-cycles,ref-cycles  >> etc.>
> >>>
> >>> The benchmark command and output:
> >>> ./perf bench sched messaging -g 40 -l 5 -t
> >>> # Running 'sched/messaging' benchmark:
> >>> # 20 sender and receiver threads per group
> >>> # 40 groups == 1600 threads run
> >>> Total time: 10X.XXX [sec]
> >>>
> >>>
> >>> I use the "Total time" as measurement, so smaller number is better.
> >>>
> >>> For each condition, I run the command 5 times, and took the median of
> >>
> >>> "Total time".
> >>>
> >>> Baseline (no perf-stat) 104.873 [sec]
> >>> # global
> >>> perf stat -a107.887 [sec]
> >>> perf stat -a --bpf-counters 106.071 [sec]
> >>> # per task
> >>> perf stat   106.314 [sec]
> >>> perf stat --bpf-counters105.965 [sec]
> >>> # per cpu
> >>> perf stat -C 1,3,5  107.063 [sec]
> >>> perf stat -C 1,3,5 --bpf-counters   106.406 [sec]
> >>
> >> I can't see why it's actualy faster than normal perf ;-)
> >> would be worth to find out
> >
> > Isn't this all about contended cases?
>
> Yeah, the normal perf is doing time multiplexing; while --bpf-counters
> doesn't need it.

Yep, so for uncontended cases, normal perf should be the same as the
baseline (faster than the bperf).  But for contended cases, the bperf
works faster.

Thanks,
Namhyung


Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-18 Thread Song Liu



> On Mar 18, 2021, at 5:09 PM, Arnaldo  wrote:
> 
> 
> 
> On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa  wrote:
>> On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote:
>>> 
>>> 
 On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo
>>  wrote:
 
 Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu:
> Hi Song,
> 
> On Wed, Mar 17, 2021 at 6:18 AM Song Liu 
>> wrote:
>> 
>> perf uses performance monitoring counters (PMCs) to monitor
>> system
>> performance. The PMCs are limited hardware resources. For
>> example,
>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>> 
>> Modern data center systems use these PMCs in many different ways:
>> system level monitoring, (maybe nested) container level
>> monitoring, per
>> process monitoring, profiling (in sample mode), etc. In some
>> cases,
>> there are more active perf_events than available hardware PMCs.
>> To allow
>> all perf_events to have a chance to run, it is necessary to do
>> expensive
>> time multiplexing of events.
>> 
>> On the other hand, many monitoring tools count the common metrics
>> (cycles,
>> instructions). It is a waste to have multiple tools create
>> multiple
>> perf_events of "cycles" and occupy multiple PMCs.
> 
> Right, it'd be really helpful when the PMCs are frequently or
>> mostly shared.
> But it'd also increase the overhead for uncontended cases as BPF
>> programs
> need to run on every context switch.  Depending on the workload,
>> it may
> cause a non-negligible performance impact.  So users should be
>> aware of it.
 
 Would be interesting to, humm, measure both cases to have a firm
>> number
 of the impact, how many instructions are added when sharing using
 --bpf-counters?
 
 I.e. compare the "expensive time multiplexing of events" with its
 avoidance by using --bpf-counters.
 
 Song, have you perfmormed such measurements?
>>> 
>>> I have got some measurements with perf-bench-sched-messaging:
>>> 
>>> The system: x86_64 with 23 cores (46 HT)
>>> 
>>> The perf-stat command:
>>> perf stat -e
>> cycles,cycles,instructions,instructions,ref-cycles,ref-cycles > etc.>
>>> 
>>> The benchmark command and output:
>>> ./perf bench sched messaging -g 40 -l 5 -t
>>> # Running 'sched/messaging' benchmark:
>>> # 20 sender and receiver threads per group
>>> # 40 groups == 1600 threads run
>>> Total time: 10X.XXX [sec]
>>> 
>>> 
>>> I use the "Total time" as measurement, so smaller number is better. 
>>> 
>>> For each condition, I run the command 5 times, and took the median of
>> 
>>> "Total time". 
>>> 
>>> Baseline (no perf-stat) 104.873 [sec]
>>> # global
>>> perf stat -a107.887 [sec]
>>> perf stat -a --bpf-counters 106.071 [sec]
>>> # per task
>>> perf stat   106.314 [sec]
>>> perf stat --bpf-counters105.965 [sec]
>>> # per cpu
>>> perf stat -C 1,3,5  107.063 [sec]
>>> perf stat -C 1,3,5 --bpf-counters   106.406 [sec]
>> 
>> I can't see why it's actualy faster than normal perf ;-)
>> would be worth to find out
> 
> Isn't this all about contended cases?

Yeah, the normal perf is doing time multiplexing; while --bpf-counters 
doesn't need it. 

Thanks,
Song



Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-18 Thread Arnaldo



On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa  wrote:
>On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote:
>> 
>> 
>> > On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo
> wrote:
>> > 
>> > Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu:
>> >> Hi Song,
>> >> 
>> >> On Wed, Mar 17, 2021 at 6:18 AM Song Liu 
>wrote:
>> >>> 
>> >>> perf uses performance monitoring counters (PMCs) to monitor
>system
>> >>> performance. The PMCs are limited hardware resources. For
>example,
>> >>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>> >>> 
>> >>> Modern data center systems use these PMCs in many different ways:
>> >>> system level monitoring, (maybe nested) container level
>monitoring, per
>> >>> process monitoring, profiling (in sample mode), etc. In some
>cases,
>> >>> there are more active perf_events than available hardware PMCs.
>To allow
>> >>> all perf_events to have a chance to run, it is necessary to do
>expensive
>> >>> time multiplexing of events.
>> >>> 
>> >>> On the other hand, many monitoring tools count the common metrics
>(cycles,
>> >>> instructions). It is a waste to have multiple tools create
>multiple
>> >>> perf_events of "cycles" and occupy multiple PMCs.
>> >> 
>> >> Right, it'd be really helpful when the PMCs are frequently or
>mostly shared.
>> >> But it'd also increase the overhead for uncontended cases as BPF
>programs
>> >> need to run on every context switch.  Depending on the workload,
>it may
>> >> cause a non-negligible performance impact.  So users should be
>aware of it.
>> > 
>> > Would be interesting to, humm, measure both cases to have a firm
>number
>> > of the impact, how many instructions are added when sharing using
>> > --bpf-counters?
>> > 
>> > I.e. compare the "expensive time multiplexing of events" with its
>> > avoidance by using --bpf-counters.
>> > 
>> > Song, have you perfmormed such measurements?
>> 
>> I have got some measurements with perf-bench-sched-messaging:
>> 
>> The system: x86_64 with 23 cores (46 HT)
>> 
>> The perf-stat command:
>> perf stat -e
>cycles,cycles,instructions,instructions,ref-cycles,ref-cycles etc.>
>> 
>> The benchmark command and output:
>> ./perf bench sched messaging -g 40 -l 5 -t
>> # Running 'sched/messaging' benchmark:
>> # 20 sender and receiver threads per group
>> # 40 groups == 1600 threads run
>>  Total time: 10X.XXX [sec]
>> 
>> 
>> I use the "Total time" as measurement, so smaller number is better. 
>> 
>> For each condition, I run the command 5 times, and took the median of
>
>> "Total time". 
>> 
>> Baseline (no perf-stat)  104.873 [sec]
>> # global
>> perf stat -a 107.887 [sec]
>> perf stat -a --bpf-counters  106.071 [sec]
>> # per task
>> perf stat106.314 [sec]
>> perf stat --bpf-counters 105.965 [sec]
>> # per cpu
>> perf stat -C 1,3,5   107.063 [sec]
>> perf stat -C 1,3,5 --bpf-counters106.406 [sec]
>
>I can't see why it's actualy faster than normal perf ;-)
>would be worth to find out

Isn't this all about contended cases?

>
>jirka
>
>> 
>> From the data, --bpf-counters is slightly better than the regular
>event
>> for all targets. I noticed that the results are not very stable.
>There 
>> are a couple 108.xx runs in some of the conditions (w/ and w/o 
>> --bpf-counters).
>> 
>> 
>> I also measured the average runtime of the BPF programs, with 
>> 
>>  sysctl kernel.bpf_stats_enabled=1
>> 
>> For each event, if we have one leader and two followers, the total
>run 
>> time is about 340ns. IOW, 340ns for two perf-stat reading
>instructions, 
>> 340ns for two perf-stat reading cycles, etc. 
>> 
>> Thanks,
>> Song
>> 

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-18 Thread Jiri Olsa
On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote:
> 
> 
> > On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo  
> > wrote:
> > 
> > Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu:
> >> Hi Song,
> >> 
> >> On Wed, Mar 17, 2021 at 6:18 AM Song Liu  wrote:
> >>> 
> >>> perf uses performance monitoring counters (PMCs) to monitor system
> >>> performance. The PMCs are limited hardware resources. For example,
> >>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> >>> 
> >>> Modern data center systems use these PMCs in many different ways:
> >>> system level monitoring, (maybe nested) container level monitoring, per
> >>> process monitoring, profiling (in sample mode), etc. In some cases,
> >>> there are more active perf_events than available hardware PMCs. To allow
> >>> all perf_events to have a chance to run, it is necessary to do expensive
> >>> time multiplexing of events.
> >>> 
> >>> On the other hand, many monitoring tools count the common metrics (cycles,
> >>> instructions). It is a waste to have multiple tools create multiple
> >>> perf_events of "cycles" and occupy multiple PMCs.
> >> 
> >> Right, it'd be really helpful when the PMCs are frequently or mostly 
> >> shared.
> >> But it'd also increase the overhead for uncontended cases as BPF programs
> >> need to run on every context switch.  Depending on the workload, it may
> >> cause a non-negligible performance impact.  So users should be aware of it.
> > 
> > Would be interesting to, humm, measure both cases to have a firm number
> > of the impact, how many instructions are added when sharing using
> > --bpf-counters?
> > 
> > I.e. compare the "expensive time multiplexing of events" with its
> > avoidance by using --bpf-counters.
> > 
> > Song, have you perfmormed such measurements?
> 
> I have got some measurements with perf-bench-sched-messaging:
> 
> The system: x86_64 with 23 cores (46 HT)
> 
> The perf-stat command:
> perf stat -e cycles,cycles,instructions,instructions,ref-cycles,ref-cycles 
> 
> 
> The benchmark command and output:
> ./perf bench sched messaging -g 40 -l 5 -t
> # Running 'sched/messaging' benchmark:
> # 20 sender and receiver threads per group
> # 40 groups == 1600 threads run
>  Total time: 10X.XXX [sec]
> 
> 
> I use the "Total time" as measurement, so smaller number is better. 
> 
> For each condition, I run the command 5 times, and took the median of 
> "Total time". 
> 
> Baseline (no perf-stat)   104.873 [sec]
> # global
> perf stat -a  107.887 [sec]
> perf stat -a --bpf-counters   106.071 [sec]
> # per task
> perf stat 106.314 [sec]
> perf stat --bpf-counters  105.965 [sec]
> # per cpu
> perf stat -C 1,3,5107.063 [sec]
> perf stat -C 1,3,5 --bpf-counters 106.406 [sec]

I can't see why it's actualy faster than normal perf ;-)
would be worth to find out

jirka

> 
> From the data, --bpf-counters is slightly better than the regular event
> for all targets. I noticed that the results are not very stable. There 
> are a couple 108.xx runs in some of the conditions (w/ and w/o 
> --bpf-counters).
> 
> 
> I also measured the average runtime of the BPF programs, with 
> 
>   sysctl kernel.bpf_stats_enabled=1
> 
> For each event, if we have one leader and two followers, the total run 
> time is about 340ns. IOW, 340ns for two perf-stat reading instructions, 
> 340ns for two perf-stat reading cycles, etc. 
> 
> Thanks,
> Song
> 



Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-18 Thread Song Liu



> On Mar 17, 2021, at 9:32 PM, Namhyung Kim  wrote:
> 
> On Thu, Mar 18, 2021 at 12:52 PM Song Liu  wrote:
>> 
>> 
>> 
>>> On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo  
>>> wrote:
>>> 
>>> Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu:
 Hi Song,
 
 On Wed, Mar 17, 2021 at 6:18 AM Song Liu  wrote:
> 
> perf uses performance monitoring counters (PMCs) to monitor system
> performance. The PMCs are limited hardware resources. For example,
> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> 
> Modern data center systems use these PMCs in many different ways:
> system level monitoring, (maybe nested) container level monitoring, per
> process monitoring, profiling (in sample mode), etc. In some cases,
> there are more active perf_events than available hardware PMCs. To allow
> all perf_events to have a chance to run, it is necessary to do expensive
> time multiplexing of events.
> 
> On the other hand, many monitoring tools count the common metrics (cycles,
> instructions). It is a waste to have multiple tools create multiple
> perf_events of "cycles" and occupy multiple PMCs.
 
 Right, it'd be really helpful when the PMCs are frequently or mostly 
 shared.
 But it'd also increase the overhead for uncontended cases as BPF programs
 need to run on every context switch.  Depending on the workload, it may
 cause a non-negligible performance impact.  So users should be aware of it.
>>> 
>>> Would be interesting to, humm, measure both cases to have a firm number
>>> of the impact, how many instructions are added when sharing using
>>> --bpf-counters?
>>> 
>>> I.e. compare the "expensive time multiplexing of events" with its
>>> avoidance by using --bpf-counters.
>>> 
>>> Song, have you perfmormed such measurements?
>> 
>> I have got some measurements with perf-bench-sched-messaging:
>> 
>> The system: x86_64 with 23 cores (46 HT)
>> 
>> The perf-stat command:
>> perf stat -e cycles,cycles,instructions,instructions,ref-cycles,ref-cycles 
>> 
>> 
>> The benchmark command and output:
>> ./perf bench sched messaging -g 40 -l 5 -t
>> # Running 'sched/messaging' benchmark:
>> # 20 sender and receiver threads per group
>> # 40 groups == 1600 threads run
>> Total time: 10X.XXX [sec]
>> 
>> 
>> I use the "Total time" as measurement, so smaller number is better.
>> 
>> For each condition, I run the command 5 times, and took the median of
>> "Total time".
>> 
>> Baseline (no perf-stat) 104.873 [sec]
>> # global
>> perf stat -a107.887 [sec]
>> perf stat -a --bpf-counters 106.071 [sec]
>> # per task
>> perf stat   106.314 [sec]
>> perf stat --bpf-counters105.965 [sec]
>> # per cpu
>> perf stat -C 1,3,5  107.063 [sec]
>> perf stat -C 1,3,5 --bpf-counters   106.406 [sec]
>> 
>> From the data, --bpf-counters is slightly better than the regular event
>> for all targets. I noticed that the results are not very stable. There
>> are a couple 108.xx runs in some of the conditions (w/ and w/o
>> --bpf-counters).
> 
> Hmm.. so this result is when multiplexing happened, right?
> I wondered how/why the regular perf stat is slower..

I should have made this more clear. This is when regular perf-stat time 
multiplexing (2x ref-cycles on Intel). OTOH, bpf-counters does enables 
sharing, so there is no time multiplexing. IOW, this is overhead of BPF 
vs. overhead of time multiplexing. 

Thanks,
Song

Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-17 Thread Namhyung Kim
On Thu, Mar 18, 2021 at 12:52 PM Song Liu  wrote:
>
>
>
> > On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo  
> > wrote:
> >
> > Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu:
> >> Hi Song,
> >>
> >> On Wed, Mar 17, 2021 at 6:18 AM Song Liu  wrote:
> >>>
> >>> perf uses performance monitoring counters (PMCs) to monitor system
> >>> performance. The PMCs are limited hardware resources. For example,
> >>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> >>>
> >>> Modern data center systems use these PMCs in many different ways:
> >>> system level monitoring, (maybe nested) container level monitoring, per
> >>> process monitoring, profiling (in sample mode), etc. In some cases,
> >>> there are more active perf_events than available hardware PMCs. To allow
> >>> all perf_events to have a chance to run, it is necessary to do expensive
> >>> time multiplexing of events.
> >>>
> >>> On the other hand, many monitoring tools count the common metrics (cycles,
> >>> instructions). It is a waste to have multiple tools create multiple
> >>> perf_events of "cycles" and occupy multiple PMCs.
> >>
> >> Right, it'd be really helpful when the PMCs are frequently or mostly 
> >> shared.
> >> But it'd also increase the overhead for uncontended cases as BPF programs
> >> need to run on every context switch.  Depending on the workload, it may
> >> cause a non-negligible performance impact.  So users should be aware of it.
> >
> > Would be interesting to, humm, measure both cases to have a firm number
> > of the impact, how many instructions are added when sharing using
> > --bpf-counters?
> >
> > I.e. compare the "expensive time multiplexing of events" with its
> > avoidance by using --bpf-counters.
> >
> > Song, have you perfmormed such measurements?
>
> I have got some measurements with perf-bench-sched-messaging:
>
> The system: x86_64 with 23 cores (46 HT)
>
> The perf-stat command:
> perf stat -e cycles,cycles,instructions,instructions,ref-cycles,ref-cycles 
> 
>
> The benchmark command and output:
> ./perf bench sched messaging -g 40 -l 5 -t
> # Running 'sched/messaging' benchmark:
> # 20 sender and receiver threads per group
> # 40 groups == 1600 threads run
>  Total time: 10X.XXX [sec]
>
>
> I use the "Total time" as measurement, so smaller number is better.
>
> For each condition, I run the command 5 times, and took the median of
> "Total time".
>
> Baseline (no perf-stat) 104.873 [sec]
> # global
> perf stat -a107.887 [sec]
> perf stat -a --bpf-counters 106.071 [sec]
> # per task
> perf stat   106.314 [sec]
> perf stat --bpf-counters105.965 [sec]
> # per cpu
> perf stat -C 1,3,5  107.063 [sec]
> perf stat -C 1,3,5 --bpf-counters   106.406 [sec]
>
> From the data, --bpf-counters is slightly better than the regular event
> for all targets. I noticed that the results are not very stable. There
> are a couple 108.xx runs in some of the conditions (w/ and w/o
> --bpf-counters).

Hmm.. so this result is when multiplexing happened, right?
I wondered how/why the regular perf stat is slower..

Thanks,
Namhyung

>
>
> I also measured the average runtime of the BPF programs, with
>
> sysctl kernel.bpf_stats_enabled=1
>
> For each event, if we have one leader and two followers, the total run
> time is about 340ns. IOW, 340ns for two perf-stat reading instructions,
> 340ns for two perf-stat reading cycles, etc.
>
> Thanks,
> Song


Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-17 Thread Song Liu



> On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo  wrote:
> 
> Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu:
>> Hi Song,
>> 
>> On Wed, Mar 17, 2021 at 6:18 AM Song Liu  wrote:
>>> 
>>> perf uses performance monitoring counters (PMCs) to monitor system
>>> performance. The PMCs are limited hardware resources. For example,
>>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>>> 
>>> Modern data center systems use these PMCs in many different ways:
>>> system level monitoring, (maybe nested) container level monitoring, per
>>> process monitoring, profiling (in sample mode), etc. In some cases,
>>> there are more active perf_events than available hardware PMCs. To allow
>>> all perf_events to have a chance to run, it is necessary to do expensive
>>> time multiplexing of events.
>>> 
>>> On the other hand, many monitoring tools count the common metrics (cycles,
>>> instructions). It is a waste to have multiple tools create multiple
>>> perf_events of "cycles" and occupy multiple PMCs.
>> 
>> Right, it'd be really helpful when the PMCs are frequently or mostly shared.
>> But it'd also increase the overhead for uncontended cases as BPF programs
>> need to run on every context switch.  Depending on the workload, it may
>> cause a non-negligible performance impact.  So users should be aware of it.
> 
> Would be interesting to, humm, measure both cases to have a firm number
> of the impact, how many instructions are added when sharing using
> --bpf-counters?
> 
> I.e. compare the "expensive time multiplexing of events" with its
> avoidance by using --bpf-counters.
> 
> Song, have you perfmormed such measurements?

I have got some measurements with perf-bench-sched-messaging:

The system: x86_64 with 23 cores (46 HT)

The perf-stat command:
perf stat -e cycles,cycles,instructions,instructions,ref-cycles,ref-cycles 


The benchmark command and output:
./perf bench sched messaging -g 40 -l 5 -t
# Running 'sched/messaging' benchmark:
# 20 sender and receiver threads per group
# 40 groups == 1600 threads run
 Total time: 10X.XXX [sec]


I use the "Total time" as measurement, so smaller number is better. 

For each condition, I run the command 5 times, and took the median of 
"Total time". 

Baseline (no perf-stat) 104.873 [sec]
# global
perf stat -a107.887 [sec]
perf stat -a --bpf-counters 106.071 [sec]
# per task
perf stat   106.314 [sec]
perf stat --bpf-counters105.965 [sec]
# per cpu
perf stat -C 1,3,5  107.063 [sec]
perf stat -C 1,3,5 --bpf-counters   106.406 [sec]

>From the data, --bpf-counters is slightly better than the regular event
for all targets. I noticed that the results are not very stable. There 
are a couple 108.xx runs in some of the conditions (w/ and w/o 
--bpf-counters).


I also measured the average runtime of the BPF programs, with 

sysctl kernel.bpf_stats_enabled=1

For each event, if we have one leader and two followers, the total run 
time is about 340ns. IOW, 340ns for two perf-stat reading instructions, 
340ns for two perf-stat reading cycles, etc. 

Thanks,
Song

Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-17 Thread Arnaldo Carvalho de Melo
Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu:
> Hi Song,
> 
> On Wed, Mar 17, 2021 at 6:18 AM Song Liu  wrote:
> >
> > perf uses performance monitoring counters (PMCs) to monitor system
> > performance. The PMCs are limited hardware resources. For example,
> > Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> >
> > Modern data center systems use these PMCs in many different ways:
> > system level monitoring, (maybe nested) container level monitoring, per
> > process monitoring, profiling (in sample mode), etc. In some cases,
> > there are more active perf_events than available hardware PMCs. To allow
> > all perf_events to have a chance to run, it is necessary to do expensive
> > time multiplexing of events.
> >
> > On the other hand, many monitoring tools count the common metrics (cycles,
> > instructions). It is a waste to have multiple tools create multiple
> > perf_events of "cycles" and occupy multiple PMCs.
> 
> Right, it'd be really helpful when the PMCs are frequently or mostly shared.
> But it'd also increase the overhead for uncontended cases as BPF programs
> need to run on every context switch.  Depending on the workload, it may
> cause a non-negligible performance impact.  So users should be aware of it.

Would be interesting to, humm, measure both cases to have a firm number
of the impact, how many instructions are added when sharing using
--bpf-counters?

I.e. compare the "expensive time multiplexing of events" with its
avoidance by using --bpf-counters.

Song, have you perfmormed such measurements?

- Arnaldo
 
> Thanks,
> Namhyung
> 
> >
> > bperf tries to reduce such wastes by allowing multiple perf_events of
> > "cycles" or "instructions" (at different scopes) to share PMUs. Instead
> > of having each perf-stat session to read its own perf_events, bperf uses
> > BPF programs to read the perf_events and aggregate readings to BPF maps.
> > Then, the perf-stat session(s) reads the values from these BPF maps.
> >
> > Changes v1 => v2:
> >   1. Add documentation.
> >   2. Add a shell test.
> >   3. Rename options, default path of the atto-map, and some variables.
> >   4. Add a separate patch that moves clock_gettime() in __run_perf_stat()
> >  to after enable_counters().
> >   5. Make perf_cpu_map for all cpus a global variable.
> >   6. Use sysfs__mountpoint() for default attr-map path.
> >   7. Use cpu__max_cpu() instead of libbpf_num_possible_cpus().
> >   8. Add flag "enabled" to the follower program. Then move follower attach
> >  to bperf__load() and simplify bperf__enable().
> >
> > Song Liu (3):
> >   perf-stat: introduce bperf, share hardware PMCs with BPF
> >   perf-stat: measure t0 and ref_time after enable_counters()
> >   perf-test: add a test for perf-stat --bpf-counters option
> >
> >  tools/perf/Documentation/perf-stat.txt|  11 +
> >  tools/perf/Makefile.perf  |   1 +
> >  tools/perf/builtin-stat.c |  20 +-
> >  tools/perf/tests/shell/stat_bpf_counters.sh   |  34 ++
> >  tools/perf/util/bpf_counter.c | 519 +-
> >  tools/perf/util/bpf_skel/bperf.h  |  14 +
> >  tools/perf/util/bpf_skel/bperf_follower.bpf.c |  69 +++
> >  tools/perf/util/bpf_skel/bperf_leader.bpf.c   |  46 ++
> >  tools/perf/util/bpf_skel/bperf_u.h|  14 +
> >  tools/perf/util/evsel.h   |  20 +-
> >  tools/perf/util/target.h  |   4 +-
> >  11 files changed, 742 insertions(+), 10 deletions(-)
> >  create mode 100755 tools/perf/tests/shell/stat_bpf_counters.sh
> >  create mode 100644 tools/perf/util/bpf_skel/bperf.h
> >  create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c
> >  create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c
> >  create mode 100644 tools/perf/util/bpf_skel/bperf_u.h
> >
> > --
> > 2.30.2

-- 

- Arnaldo


Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-17 Thread Jiri Olsa
On Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim wrote:
> Hi Song,
> 
> On Wed, Mar 17, 2021 at 6:18 AM Song Liu  wrote:
> >
> > perf uses performance monitoring counters (PMCs) to monitor system
> > performance. The PMCs are limited hardware resources. For example,
> > Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> >
> > Modern data center systems use these PMCs in many different ways:
> > system level monitoring, (maybe nested) container level monitoring, per
> > process monitoring, profiling (in sample mode), etc. In some cases,
> > there are more active perf_events than available hardware PMCs. To allow
> > all perf_events to have a chance to run, it is necessary to do expensive
> > time multiplexing of events.
> >
> > On the other hand, many monitoring tools count the common metrics (cycles,
> > instructions). It is a waste to have multiple tools create multiple
> > perf_events of "cycles" and occupy multiple PMCs.
> 
> Right, it'd be really helpful when the PMCs are frequently or mostly shared.
> But it'd also increase the overhead for uncontended cases as BPF programs
> need to run on every context switch.  Depending on the workload, it may
> cause a non-negligible performance impact.  So users should be aware of it.

right, let's get get some idea of how bad that actualy is

Song,
could you please get some numbers from runnning for example
'perf bench sched messaging ...' with both normal and bpf
mode perf stat? for all supported target options

thanks,
jirka



Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-16 Thread Namhyung Kim
Hi Song,

On Wed, Mar 17, 2021 at 6:18 AM Song Liu  wrote:
>
> perf uses performance monitoring counters (PMCs) to monitor system
> performance. The PMCs are limited hardware resources. For example,
> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>
> Modern data center systems use these PMCs in many different ways:
> system level monitoring, (maybe nested) container level monitoring, per
> process monitoring, profiling (in sample mode), etc. In some cases,
> there are more active perf_events than available hardware PMCs. To allow
> all perf_events to have a chance to run, it is necessary to do expensive
> time multiplexing of events.
>
> On the other hand, many monitoring tools count the common metrics (cycles,
> instructions). It is a waste to have multiple tools create multiple
> perf_events of "cycles" and occupy multiple PMCs.

Right, it'd be really helpful when the PMCs are frequently or mostly shared.
But it'd also increase the overhead for uncontended cases as BPF programs
need to run on every context switch.  Depending on the workload, it may
cause a non-negligible performance impact.  So users should be aware of it.

Thanks,
Namhyung

>
> bperf tries to reduce such wastes by allowing multiple perf_events of
> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
> of having each perf-stat session to read its own perf_events, bperf uses
> BPF programs to read the perf_events and aggregate readings to BPF maps.
> Then, the perf-stat session(s) reads the values from these BPF maps.
>
> Changes v1 => v2:
>   1. Add documentation.
>   2. Add a shell test.
>   3. Rename options, default path of the atto-map, and some variables.
>   4. Add a separate patch that moves clock_gettime() in __run_perf_stat()
>  to after enable_counters().
>   5. Make perf_cpu_map for all cpus a global variable.
>   6. Use sysfs__mountpoint() for default attr-map path.
>   7. Use cpu__max_cpu() instead of libbpf_num_possible_cpus().
>   8. Add flag "enabled" to the follower program. Then move follower attach
>  to bperf__load() and simplify bperf__enable().
>
> Song Liu (3):
>   perf-stat: introduce bperf, share hardware PMCs with BPF
>   perf-stat: measure t0 and ref_time after enable_counters()
>   perf-test: add a test for perf-stat --bpf-counters option
>
>  tools/perf/Documentation/perf-stat.txt|  11 +
>  tools/perf/Makefile.perf  |   1 +
>  tools/perf/builtin-stat.c |  20 +-
>  tools/perf/tests/shell/stat_bpf_counters.sh   |  34 ++
>  tools/perf/util/bpf_counter.c | 519 +-
>  tools/perf/util/bpf_skel/bperf.h  |  14 +
>  tools/perf/util/bpf_skel/bperf_follower.bpf.c |  69 +++
>  tools/perf/util/bpf_skel/bperf_leader.bpf.c   |  46 ++
>  tools/perf/util/bpf_skel/bperf_u.h|  14 +
>  tools/perf/util/evsel.h   |  20 +-
>  tools/perf/util/target.h  |   4 +-
>  11 files changed, 742 insertions(+), 10 deletions(-)
>  create mode 100755 tools/perf/tests/shell/stat_bpf_counters.sh
>  create mode 100644 tools/perf/util/bpf_skel/bperf.h
>  create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c
>  create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c
>  create mode 100644 tools/perf/util/bpf_skel/bperf_u.h
>
> --
> 2.30.2


[PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

2021-03-16 Thread Song Liu
perf uses performance monitoring counters (PMCs) to monitor system
performance. The PMCs are limited hardware resources. For example,
Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.

Modern data center systems use these PMCs in many different ways:
system level monitoring, (maybe nested) container level monitoring, per
process monitoring, profiling (in sample mode), etc. In some cases,
there are more active perf_events than available hardware PMCs. To allow
all perf_events to have a chance to run, it is necessary to do expensive
time multiplexing of events.

On the other hand, many monitoring tools count the common metrics (cycles,
instructions). It is a waste to have multiple tools create multiple
perf_events of "cycles" and occupy multiple PMCs.

bperf tries to reduce such wastes by allowing multiple perf_events of
"cycles" or "instructions" (at different scopes) to share PMUs. Instead
of having each perf-stat session to read its own perf_events, bperf uses
BPF programs to read the perf_events and aggregate readings to BPF maps.
Then, the perf-stat session(s) reads the values from these BPF maps.

Changes v1 => v2:
  1. Add documentation.
  2. Add a shell test.
  3. Rename options, default path of the atto-map, and some variables.
  4. Add a separate patch that moves clock_gettime() in __run_perf_stat()
 to after enable_counters().
  5. Make perf_cpu_map for all cpus a global variable.
  6. Use sysfs__mountpoint() for default attr-map path.
  7. Use cpu__max_cpu() instead of libbpf_num_possible_cpus().
  8. Add flag "enabled" to the follower program. Then move follower attach
 to bperf__load() and simplify bperf__enable().

Song Liu (3):
  perf-stat: introduce bperf, share hardware PMCs with BPF
  perf-stat: measure t0 and ref_time after enable_counters()
  perf-test: add a test for perf-stat --bpf-counters option

 tools/perf/Documentation/perf-stat.txt|  11 +
 tools/perf/Makefile.perf  |   1 +
 tools/perf/builtin-stat.c |  20 +-
 tools/perf/tests/shell/stat_bpf_counters.sh   |  34 ++
 tools/perf/util/bpf_counter.c | 519 +-
 tools/perf/util/bpf_skel/bperf.h  |  14 +
 tools/perf/util/bpf_skel/bperf_follower.bpf.c |  69 +++
 tools/perf/util/bpf_skel/bperf_leader.bpf.c   |  46 ++
 tools/perf/util/bpf_skel/bperf_u.h|  14 +
 tools/perf/util/evsel.h   |  20 +-
 tools/perf/util/target.h  |   4 +-
 11 files changed, 742 insertions(+), 10 deletions(-)
 create mode 100755 tools/perf/tests/shell/stat_bpf_counters.sh
 create mode 100644 tools/perf/util/bpf_skel/bperf.h
 create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c
 create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c
 create mode 100644 tools/perf/util/bpf_skel/bperf_u.h

--
2.30.2