Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF
> On Mar 23, 2021, at 2:10 PM, Arnaldo Carvalho de Melo wrote: > > Em Fri, Mar 19, 2021 at 04:14:42PM +, Song Liu escreveu: >>> On Mar 19, 2021, at 8:58 AM, Namhyung Kim wrote: >>> On Sat, Mar 20, 2021 at 12:35 AM Arnaldo Carvalho de Melo >>> wrote: Em Fri, Mar 19, 2021 at 09:54:59AM +0900, Namhyung Kim escreveu: > On Fri, Mar 19, 2021 at 9:22 AM Song Liu wrote: >>> On Mar 18, 2021, at 5:09 PM, Arnaldo wrote: >>> On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa >>> wrote: On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote: > perf stat -C 1,3,5 107.063 [sec] > perf stat -C 1,3,5 --bpf-counters 106.406 [sec] > I can't see why it's actualy faster than normal perf ;-) would be worth to find out > >>> Isn't this all about contended cases? > >> Yeah, the normal perf is doing time multiplexing; while --bpf-counters >> doesn't need it. > > Yep, so for uncontended cases, normal perf should be the same as the > baseline (faster than the bperf). But for contended cases, the bperf > works faster. > The difference should be small enough that for people that use this in a machine where contention happens most of the time, setting a ~/.perfconfig to use it by default should be advantageous, i.e. no need to use --bpf-counters on the command line all the time. > So, Namhyung, can I take that as an Acked-by or a Reviewed-by? I'll take a look again now but I want to have this merged on perf/core so that I can work on a new BPF SKEL to use this: > >>> I have a concern for the per cpu target, but it can be done later, so > >>> Acked-by: Namhyung Kim > https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=tmp.bpf/bpf_perf_enable > >>> Interesting! Actually I was thinking about the similar too. :) >> >> Hi Namhyung, Jiri, and Arnaldo, >> >> Thanks a lot for your kind review. >> >> Here is updated 3/3, where we use perf-bench instead of stressapptest. > > I had to apply this updated 3/3 manually, as there was some munging, its > all now at: > > https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=tmp.perf/core > > Please take a look at the "Committer testing" section I added to the > main patch, introducing bperf: > > https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/commit/?h=tmp.perf/core=7fac83aaf2eecc9e7e7b72da694c49bb4ce7fdfc > > And check if I made any mistake or if something else could be added. > > It'll move to perf/core after my set of automated tests finishes. Thanks Arnaldo! Looks great! Song
Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF
Em Fri, Mar 19, 2021 at 04:14:42PM +, Song Liu escreveu: > > On Mar 19, 2021, at 8:58 AM, Namhyung Kim wrote: > > On Sat, Mar 20, 2021 at 12:35 AM Arnaldo Carvalho de Melo > > wrote: > >> Em Fri, Mar 19, 2021 at 09:54:59AM +0900, Namhyung Kim escreveu: > >>> On Fri, Mar 19, 2021 at 9:22 AM Song Liu wrote: > > On Mar 18, 2021, at 5:09 PM, Arnaldo wrote: > > On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa > > wrote: > >> On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote: > >>> perf stat -C 1,3,5 107.063 [sec] > >>> perf stat -C 1,3,5 --bpf-counters 106.406 [sec] > >> I can't see why it's actualy faster than normal perf ;-) > >> would be worth to find out > > Isn't this all about contended cases? > Yeah, the normal perf is doing time multiplexing; while --bpf-counters > doesn't need it. > >>> Yep, so for uncontended cases, normal perf should be the same as the > >>> baseline (faster than the bperf). But for contended cases, the bperf > >>> works faster. > >> The difference should be small enough that for people that use this in a > >> machine where contention happens most of the time, setting a > >> ~/.perfconfig to use it by default should be advantageous, i.e. no need > >> to use --bpf-counters on the command line all the time. > >> So, Namhyung, can I take that as an Acked-by or a Reviewed-by? I'll take > >> a look again now but I want to have this merged on perf/core so that I > >> can work on a new BPF SKEL to use this: > > I have a concern for the per cpu target, but it can be done later, so > > Acked-by: Namhyung Kim > >> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=tmp.bpf/bpf_perf_enable > > Interesting! Actually I was thinking about the similar too. :) > > Hi Namhyung, Jiri, and Arnaldo, > > Thanks a lot for your kind review. > > Here is updated 3/3, where we use perf-bench instead of stressapptest. I had to apply this updated 3/3 manually, as there was some munging, its all now at: https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=tmp.perf/core Please take a look at the "Committer testing" section I added to the main patch, introducing bperf: https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/commit/?h=tmp.perf/core=7fac83aaf2eecc9e7e7b72da694c49bb4ce7fdfc And check if I made any mistake or if something else could be added. It'll move to perf/core after my set of automated tests finishes. - Arnaldo
Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF
> On Mar 19, 2021, at 8:58 AM, Namhyung Kim wrote: > > Hi Arnaldo, > > On Sat, Mar 20, 2021 at 12:35 AM Arnaldo Carvalho de Melo > wrote: >> >> Em Fri, Mar 19, 2021 at 09:54:59AM +0900, Namhyung Kim escreveu: >>> On Fri, Mar 19, 2021 at 9:22 AM Song Liu wrote: > On Mar 18, 2021, at 5:09 PM, Arnaldo wrote: > On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa > wrote: >> On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote: >>> perf stat -C 1,3,5 107.063 [sec] >>> perf stat -C 1,3,5 --bpf-counters 106.406 [sec] >> >> I can't see why it's actualy faster than normal perf ;-) >> would be worth to find out >> > Isn't this all about contended cases? >> Yeah, the normal perf is doing time multiplexing; while --bpf-counters doesn't need it. >> >>> Yep, so for uncontended cases, normal perf should be the same as the >>> baseline (faster than the bperf). But for contended cases, the bperf >>> works faster. >> >> The difference should be small enough that for people that use this in a >> machine where contention happens most of the time, setting a >> ~/.perfconfig to use it by default should be advantageous, i.e. no need >> to use --bpf-counters on the command line all the time. >> >> So, Namhyung, can I take that as an Acked-by or a Reviewed-by? I'll take >> a look again now but I want to have this merged on perf/core so that I >> can work on a new BPF SKEL to use this: > > I have a concern for the per cpu target, but it can be done later, so > > Acked-by: Namhyung Kim > >> >> https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=tmp.bpf/bpf_perf_enable > > Interesting! Actually I was thinking about the similar too. :) Hi Namhyung, Jiri, and Arnaldo, Thanks a lot for your kind review. Here is updated 3/3, where we use perf-bench instead of stressapptest. Thanks, Song >From cc79d161be9c9d24198f7e35b50058a6e15076fd Mon Sep 17 00:00:00 2001 From: Song Liu Date: Tue, 16 Mar 2021 00:19:53 -0700 Subject: [PATCH v3 3/3] perf-test: add a test for perf-stat --bpf-counters option Add a test to compare the output of perf-stat with and without option --bpf-counters. If the difference is more than 10%, the test is considered as failed. Signed-off-by: Song Liu --- tools/perf/tests/shell/stat_bpf_counters.sh | 31 + 1 file changed, 31 insertions(+) create mode 100755 tools/perf/tests/shell/stat_bpf_counters.sh diff --git a/tools/perf/tests/shell/stat_bpf_counters.sh b/tools/perf/tests/shell/stat_bpf_counters.sh new file mode 100755 index 0..7aabf177ce8d1 --- /dev/null +++ b/tools/perf/tests/shell/stat_bpf_counters.sh @@ -0,0 +1,31 @@ +#!/bin/sh +# perf stat --bpf-counters test +# SPDX-License-Identifier: GPL-2.0 + +set -e + +# check whether $2 is within +/- 10% of $1 +compare_number() +{ + first_num=$1 + second_num=$2 + + # upper bound is first_num * 110% + upper=$(( $first_num + $first_num / 10 )) + # lower bound is first_num * 90% + lower=$(( $first_num - $first_num / 10 )) + + if [ $second_num -gt $upper ] || [ $second_num -lt $lower ]; then + echo "The difference between $first_num and $second_num are greater than 10%." + exit 1 + fi +} + +# skip if --bpf-counters is not supported +perf stat --bpf-counters true > /dev/null 2>&1 || exit 2 + +base_cycles=$(perf stat --no-big-num -e cycles -- perf bench sched messaging -g 1 -l 100 -t 2>&1 | awk '/cycles/ {print $1}') +bpf_cycles=$(perf stat --no-big-num --bpf-counters -e cycles -- perf bench sched messaging -g 1 -l 100 -t 2>&1 | awk '/cycles/ {print $1}') + +compare_number $base_cycles $bpf_cycles +exit 0 -- 2.30.2
Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF
Hi Arnaldo, On Sat, Mar 20, 2021 at 12:35 AM Arnaldo Carvalho de Melo wrote: > > Em Fri, Mar 19, 2021 at 09:54:59AM +0900, Namhyung Kim escreveu: > > On Fri, Mar 19, 2021 at 9:22 AM Song Liu wrote: > > > > On Mar 18, 2021, at 5:09 PM, Arnaldo wrote: > > > > On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa > > > > wrote: > > > >> On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote: > > > >>> perf stat -C 1,3,5 107.063 [sec] > > > >>> perf stat -C 1,3,5 --bpf-counters 106.406 [sec] > > > > >> I can't see why it's actualy faster than normal perf ;-) > > > >> would be worth to find out > > > > > Isn't this all about contended cases? > > > > Yeah, the normal perf is doing time multiplexing; while --bpf-counters > > > doesn't need it. > > > Yep, so for uncontended cases, normal perf should be the same as the > > baseline (faster than the bperf). But for contended cases, the bperf > > works faster. > > The difference should be small enough that for people that use this in a > machine where contention happens most of the time, setting a > ~/.perfconfig to use it by default should be advantageous, i.e. no need > to use --bpf-counters on the command line all the time. > > So, Namhyung, can I take that as an Acked-by or a Reviewed-by? I'll take > a look again now but I want to have this merged on perf/core so that I > can work on a new BPF SKEL to use this: I have a concern for the per cpu target, but it can be done later, so Acked-by: Namhyung Kim > > https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=tmp.bpf/bpf_perf_enable Interesting! Actually I was thinking about the similar too. :) Thanks, Namhyung
Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF
Em Fri, Mar 19, 2021 at 09:54:59AM +0900, Namhyung Kim escreveu: > On Fri, Mar 19, 2021 at 9:22 AM Song Liu wrote: > > > On Mar 18, 2021, at 5:09 PM, Arnaldo wrote: > > > On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa > > > wrote: > > >> On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote: > > >>> perf stat -C 1,3,5 107.063 [sec] > > >>> perf stat -C 1,3,5 --bpf-counters 106.406 [sec] > > >> I can't see why it's actualy faster than normal perf ;-) > > >> would be worth to find out > > > Isn't this all about contended cases? > > Yeah, the normal perf is doing time multiplexing; while --bpf-counters > > doesn't need it. > Yep, so for uncontended cases, normal perf should be the same as the > baseline (faster than the bperf). But for contended cases, the bperf > works faster. The difference should be small enough that for people that use this in a machine where contention happens most of the time, setting a ~/.perfconfig to use it by default should be advantageous, i.e. no need to use --bpf-counters on the command line all the time. So, Namhyung, can I take that as an Acked-by or a Reviewed-by? I'll take a look again now but I want to have this merged on perf/core so that I can work on a new BPF SKEL to use this: https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=tmp.bpf/bpf_perf_enable :-) - Arnaldo
Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF
On Fri, Mar 19, 2021 at 9:22 AM Song Liu wrote: > > > > > On Mar 18, 2021, at 5:09 PM, Arnaldo wrote: > > > > > > > > On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa wrote: > >> On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote: > >>> > >>> > On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo > >> wrote: > > Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu: > > Hi Song, > > > > On Wed, Mar 17, 2021 at 6:18 AM Song Liu > >> wrote: > >> > >> perf uses performance monitoring counters (PMCs) to monitor > >> system > >> performance. The PMCs are limited hardware resources. For > >> example, > >> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu. > >> > >> Modern data center systems use these PMCs in many different ways: > >> system level monitoring, (maybe nested) container level > >> monitoring, per > >> process monitoring, profiling (in sample mode), etc. In some > >> cases, > >> there are more active perf_events than available hardware PMCs. > >> To allow > >> all perf_events to have a chance to run, it is necessary to do > >> expensive > >> time multiplexing of events. > >> > >> On the other hand, many monitoring tools count the common metrics > >> (cycles, > >> instructions). It is a waste to have multiple tools create > >> multiple > >> perf_events of "cycles" and occupy multiple PMCs. > > > > Right, it'd be really helpful when the PMCs are frequently or > >> mostly shared. > > But it'd also increase the overhead for uncontended cases as BPF > >> programs > > need to run on every context switch. Depending on the workload, > >> it may > > cause a non-negligible performance impact. So users should be > >> aware of it. > > Would be interesting to, humm, measure both cases to have a firm > >> number > of the impact, how many instructions are added when sharing using > --bpf-counters? > > I.e. compare the "expensive time multiplexing of events" with its > avoidance by using --bpf-counters. > > Song, have you perfmormed such measurements? > >>> > >>> I have got some measurements with perf-bench-sched-messaging: > >>> > >>> The system: x86_64 with 23 cores (46 HT) > >>> > >>> The perf-stat command: > >>> perf stat -e > >> cycles,cycles,instructions,instructions,ref-cycles,ref-cycles >> etc.> > >>> > >>> The benchmark command and output: > >>> ./perf bench sched messaging -g 40 -l 5 -t > >>> # Running 'sched/messaging' benchmark: > >>> # 20 sender and receiver threads per group > >>> # 40 groups == 1600 threads run > >>> Total time: 10X.XXX [sec] > >>> > >>> > >>> I use the "Total time" as measurement, so smaller number is better. > >>> > >>> For each condition, I run the command 5 times, and took the median of > >> > >>> "Total time". > >>> > >>> Baseline (no perf-stat) 104.873 [sec] > >>> # global > >>> perf stat -a107.887 [sec] > >>> perf stat -a --bpf-counters 106.071 [sec] > >>> # per task > >>> perf stat 106.314 [sec] > >>> perf stat --bpf-counters105.965 [sec] > >>> # per cpu > >>> perf stat -C 1,3,5 107.063 [sec] > >>> perf stat -C 1,3,5 --bpf-counters 106.406 [sec] > >> > >> I can't see why it's actualy faster than normal perf ;-) > >> would be worth to find out > > > > Isn't this all about contended cases? > > Yeah, the normal perf is doing time multiplexing; while --bpf-counters > doesn't need it. Yep, so for uncontended cases, normal perf should be the same as the baseline (faster than the bperf). But for contended cases, the bperf works faster. Thanks, Namhyung
Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF
> On Mar 18, 2021, at 5:09 PM, Arnaldo wrote: > > > > On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa wrote: >> On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote: >>> >>> On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo >> wrote: Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu: > Hi Song, > > On Wed, Mar 17, 2021 at 6:18 AM Song Liu >> wrote: >> >> perf uses performance monitoring counters (PMCs) to monitor >> system >> performance. The PMCs are limited hardware resources. For >> example, >> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu. >> >> Modern data center systems use these PMCs in many different ways: >> system level monitoring, (maybe nested) container level >> monitoring, per >> process monitoring, profiling (in sample mode), etc. In some >> cases, >> there are more active perf_events than available hardware PMCs. >> To allow >> all perf_events to have a chance to run, it is necessary to do >> expensive >> time multiplexing of events. >> >> On the other hand, many monitoring tools count the common metrics >> (cycles, >> instructions). It is a waste to have multiple tools create >> multiple >> perf_events of "cycles" and occupy multiple PMCs. > > Right, it'd be really helpful when the PMCs are frequently or >> mostly shared. > But it'd also increase the overhead for uncontended cases as BPF >> programs > need to run on every context switch. Depending on the workload, >> it may > cause a non-negligible performance impact. So users should be >> aware of it. Would be interesting to, humm, measure both cases to have a firm >> number of the impact, how many instructions are added when sharing using --bpf-counters? I.e. compare the "expensive time multiplexing of events" with its avoidance by using --bpf-counters. Song, have you perfmormed such measurements? >>> >>> I have got some measurements with perf-bench-sched-messaging: >>> >>> The system: x86_64 with 23 cores (46 HT) >>> >>> The perf-stat command: >>> perf stat -e >> cycles,cycles,instructions,instructions,ref-cycles,ref-cycles > etc.> >>> >>> The benchmark command and output: >>> ./perf bench sched messaging -g 40 -l 5 -t >>> # Running 'sched/messaging' benchmark: >>> # 20 sender and receiver threads per group >>> # 40 groups == 1600 threads run >>> Total time: 10X.XXX [sec] >>> >>> >>> I use the "Total time" as measurement, so smaller number is better. >>> >>> For each condition, I run the command 5 times, and took the median of >> >>> "Total time". >>> >>> Baseline (no perf-stat) 104.873 [sec] >>> # global >>> perf stat -a107.887 [sec] >>> perf stat -a --bpf-counters 106.071 [sec] >>> # per task >>> perf stat 106.314 [sec] >>> perf stat --bpf-counters105.965 [sec] >>> # per cpu >>> perf stat -C 1,3,5 107.063 [sec] >>> perf stat -C 1,3,5 --bpf-counters 106.406 [sec] >> >> I can't see why it's actualy faster than normal perf ;-) >> would be worth to find out > > Isn't this all about contended cases? Yeah, the normal perf is doing time multiplexing; while --bpf-counters doesn't need it. Thanks, Song
Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF
On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa wrote: >On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote: >> >> >> > On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo > wrote: >> > >> > Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu: >> >> Hi Song, >> >> >> >> On Wed, Mar 17, 2021 at 6:18 AM Song Liu >wrote: >> >>> >> >>> perf uses performance monitoring counters (PMCs) to monitor >system >> >>> performance. The PMCs are limited hardware resources. For >example, >> >>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu. >> >>> >> >>> Modern data center systems use these PMCs in many different ways: >> >>> system level monitoring, (maybe nested) container level >monitoring, per >> >>> process monitoring, profiling (in sample mode), etc. In some >cases, >> >>> there are more active perf_events than available hardware PMCs. >To allow >> >>> all perf_events to have a chance to run, it is necessary to do >expensive >> >>> time multiplexing of events. >> >>> >> >>> On the other hand, many monitoring tools count the common metrics >(cycles, >> >>> instructions). It is a waste to have multiple tools create >multiple >> >>> perf_events of "cycles" and occupy multiple PMCs. >> >> >> >> Right, it'd be really helpful when the PMCs are frequently or >mostly shared. >> >> But it'd also increase the overhead for uncontended cases as BPF >programs >> >> need to run on every context switch. Depending on the workload, >it may >> >> cause a non-negligible performance impact. So users should be >aware of it. >> > >> > Would be interesting to, humm, measure both cases to have a firm >number >> > of the impact, how many instructions are added when sharing using >> > --bpf-counters? >> > >> > I.e. compare the "expensive time multiplexing of events" with its >> > avoidance by using --bpf-counters. >> > >> > Song, have you perfmormed such measurements? >> >> I have got some measurements with perf-bench-sched-messaging: >> >> The system: x86_64 with 23 cores (46 HT) >> >> The perf-stat command: >> perf stat -e >cycles,cycles,instructions,instructions,ref-cycles,ref-cycles etc.> >> >> The benchmark command and output: >> ./perf bench sched messaging -g 40 -l 5 -t >> # Running 'sched/messaging' benchmark: >> # 20 sender and receiver threads per group >> # 40 groups == 1600 threads run >> Total time: 10X.XXX [sec] >> >> >> I use the "Total time" as measurement, so smaller number is better. >> >> For each condition, I run the command 5 times, and took the median of > >> "Total time". >> >> Baseline (no perf-stat) 104.873 [sec] >> # global >> perf stat -a 107.887 [sec] >> perf stat -a --bpf-counters 106.071 [sec] >> # per task >> perf stat106.314 [sec] >> perf stat --bpf-counters 105.965 [sec] >> # per cpu >> perf stat -C 1,3,5 107.063 [sec] >> perf stat -C 1,3,5 --bpf-counters106.406 [sec] > >I can't see why it's actualy faster than normal perf ;-) >would be worth to find out Isn't this all about contended cases? > >jirka > >> >> From the data, --bpf-counters is slightly better than the regular >event >> for all targets. I noticed that the results are not very stable. >There >> are a couple 108.xx runs in some of the conditions (w/ and w/o >> --bpf-counters). >> >> >> I also measured the average runtime of the BPF programs, with >> >> sysctl kernel.bpf_stats_enabled=1 >> >> For each event, if we have one leader and two followers, the total >run >> time is about 340ns. IOW, 340ns for two perf-stat reading >instructions, >> 340ns for two perf-stat reading cycles, etc. >> >> Thanks, >> Song >> -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF
On Thu, Mar 18, 2021 at 03:52:51AM +, Song Liu wrote: > > > > On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo > > wrote: > > > > Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu: > >> Hi Song, > >> > >> On Wed, Mar 17, 2021 at 6:18 AM Song Liu wrote: > >>> > >>> perf uses performance monitoring counters (PMCs) to monitor system > >>> performance. The PMCs are limited hardware resources. For example, > >>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu. > >>> > >>> Modern data center systems use these PMCs in many different ways: > >>> system level monitoring, (maybe nested) container level monitoring, per > >>> process monitoring, profiling (in sample mode), etc. In some cases, > >>> there are more active perf_events than available hardware PMCs. To allow > >>> all perf_events to have a chance to run, it is necessary to do expensive > >>> time multiplexing of events. > >>> > >>> On the other hand, many monitoring tools count the common metrics (cycles, > >>> instructions). It is a waste to have multiple tools create multiple > >>> perf_events of "cycles" and occupy multiple PMCs. > >> > >> Right, it'd be really helpful when the PMCs are frequently or mostly > >> shared. > >> But it'd also increase the overhead for uncontended cases as BPF programs > >> need to run on every context switch. Depending on the workload, it may > >> cause a non-negligible performance impact. So users should be aware of it. > > > > Would be interesting to, humm, measure both cases to have a firm number > > of the impact, how many instructions are added when sharing using > > --bpf-counters? > > > > I.e. compare the "expensive time multiplexing of events" with its > > avoidance by using --bpf-counters. > > > > Song, have you perfmormed such measurements? > > I have got some measurements with perf-bench-sched-messaging: > > The system: x86_64 with 23 cores (46 HT) > > The perf-stat command: > perf stat -e cycles,cycles,instructions,instructions,ref-cycles,ref-cycles > > > The benchmark command and output: > ./perf bench sched messaging -g 40 -l 5 -t > # Running 'sched/messaging' benchmark: > # 20 sender and receiver threads per group > # 40 groups == 1600 threads run > Total time: 10X.XXX [sec] > > > I use the "Total time" as measurement, so smaller number is better. > > For each condition, I run the command 5 times, and took the median of > "Total time". > > Baseline (no perf-stat) 104.873 [sec] > # global > perf stat -a 107.887 [sec] > perf stat -a --bpf-counters 106.071 [sec] > # per task > perf stat 106.314 [sec] > perf stat --bpf-counters 105.965 [sec] > # per cpu > perf stat -C 1,3,5107.063 [sec] > perf stat -C 1,3,5 --bpf-counters 106.406 [sec] I can't see why it's actualy faster than normal perf ;-) would be worth to find out jirka > > From the data, --bpf-counters is slightly better than the regular event > for all targets. I noticed that the results are not very stable. There > are a couple 108.xx runs in some of the conditions (w/ and w/o > --bpf-counters). > > > I also measured the average runtime of the BPF programs, with > > sysctl kernel.bpf_stats_enabled=1 > > For each event, if we have one leader and two followers, the total run > time is about 340ns. IOW, 340ns for two perf-stat reading instructions, > 340ns for two perf-stat reading cycles, etc. > > Thanks, > Song >
Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF
> On Mar 17, 2021, at 9:32 PM, Namhyung Kim wrote: > > On Thu, Mar 18, 2021 at 12:52 PM Song Liu wrote: >> >> >> >>> On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo >>> wrote: >>> >>> Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu: Hi Song, On Wed, Mar 17, 2021 at 6:18 AM Song Liu wrote: > > perf uses performance monitoring counters (PMCs) to monitor system > performance. The PMCs are limited hardware resources. For example, > Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu. > > Modern data center systems use these PMCs in many different ways: > system level monitoring, (maybe nested) container level monitoring, per > process monitoring, profiling (in sample mode), etc. In some cases, > there are more active perf_events than available hardware PMCs. To allow > all perf_events to have a chance to run, it is necessary to do expensive > time multiplexing of events. > > On the other hand, many monitoring tools count the common metrics (cycles, > instructions). It is a waste to have multiple tools create multiple > perf_events of "cycles" and occupy multiple PMCs. Right, it'd be really helpful when the PMCs are frequently or mostly shared. But it'd also increase the overhead for uncontended cases as BPF programs need to run on every context switch. Depending on the workload, it may cause a non-negligible performance impact. So users should be aware of it. >>> >>> Would be interesting to, humm, measure both cases to have a firm number >>> of the impact, how many instructions are added when sharing using >>> --bpf-counters? >>> >>> I.e. compare the "expensive time multiplexing of events" with its >>> avoidance by using --bpf-counters. >>> >>> Song, have you perfmormed such measurements? >> >> I have got some measurements with perf-bench-sched-messaging: >> >> The system: x86_64 with 23 cores (46 HT) >> >> The perf-stat command: >> perf stat -e cycles,cycles,instructions,instructions,ref-cycles,ref-cycles >> >> >> The benchmark command and output: >> ./perf bench sched messaging -g 40 -l 5 -t >> # Running 'sched/messaging' benchmark: >> # 20 sender and receiver threads per group >> # 40 groups == 1600 threads run >> Total time: 10X.XXX [sec] >> >> >> I use the "Total time" as measurement, so smaller number is better. >> >> For each condition, I run the command 5 times, and took the median of >> "Total time". >> >> Baseline (no perf-stat) 104.873 [sec] >> # global >> perf stat -a107.887 [sec] >> perf stat -a --bpf-counters 106.071 [sec] >> # per task >> perf stat 106.314 [sec] >> perf stat --bpf-counters105.965 [sec] >> # per cpu >> perf stat -C 1,3,5 107.063 [sec] >> perf stat -C 1,3,5 --bpf-counters 106.406 [sec] >> >> From the data, --bpf-counters is slightly better than the regular event >> for all targets. I noticed that the results are not very stable. There >> are a couple 108.xx runs in some of the conditions (w/ and w/o >> --bpf-counters). > > Hmm.. so this result is when multiplexing happened, right? > I wondered how/why the regular perf stat is slower.. I should have made this more clear. This is when regular perf-stat time multiplexing (2x ref-cycles on Intel). OTOH, bpf-counters does enables sharing, so there is no time multiplexing. IOW, this is overhead of BPF vs. overhead of time multiplexing. Thanks, Song
Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF
On Thu, Mar 18, 2021 at 12:52 PM Song Liu wrote: > > > > > On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo > > wrote: > > > > Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu: > >> Hi Song, > >> > >> On Wed, Mar 17, 2021 at 6:18 AM Song Liu wrote: > >>> > >>> perf uses performance monitoring counters (PMCs) to monitor system > >>> performance. The PMCs are limited hardware resources. For example, > >>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu. > >>> > >>> Modern data center systems use these PMCs in many different ways: > >>> system level monitoring, (maybe nested) container level monitoring, per > >>> process monitoring, profiling (in sample mode), etc. In some cases, > >>> there are more active perf_events than available hardware PMCs. To allow > >>> all perf_events to have a chance to run, it is necessary to do expensive > >>> time multiplexing of events. > >>> > >>> On the other hand, many monitoring tools count the common metrics (cycles, > >>> instructions). It is a waste to have multiple tools create multiple > >>> perf_events of "cycles" and occupy multiple PMCs. > >> > >> Right, it'd be really helpful when the PMCs are frequently or mostly > >> shared. > >> But it'd also increase the overhead for uncontended cases as BPF programs > >> need to run on every context switch. Depending on the workload, it may > >> cause a non-negligible performance impact. So users should be aware of it. > > > > Would be interesting to, humm, measure both cases to have a firm number > > of the impact, how many instructions are added when sharing using > > --bpf-counters? > > > > I.e. compare the "expensive time multiplexing of events" with its > > avoidance by using --bpf-counters. > > > > Song, have you perfmormed such measurements? > > I have got some measurements with perf-bench-sched-messaging: > > The system: x86_64 with 23 cores (46 HT) > > The perf-stat command: > perf stat -e cycles,cycles,instructions,instructions,ref-cycles,ref-cycles > > > The benchmark command and output: > ./perf bench sched messaging -g 40 -l 5 -t > # Running 'sched/messaging' benchmark: > # 20 sender and receiver threads per group > # 40 groups == 1600 threads run > Total time: 10X.XXX [sec] > > > I use the "Total time" as measurement, so smaller number is better. > > For each condition, I run the command 5 times, and took the median of > "Total time". > > Baseline (no perf-stat) 104.873 [sec] > # global > perf stat -a107.887 [sec] > perf stat -a --bpf-counters 106.071 [sec] > # per task > perf stat 106.314 [sec] > perf stat --bpf-counters105.965 [sec] > # per cpu > perf stat -C 1,3,5 107.063 [sec] > perf stat -C 1,3,5 --bpf-counters 106.406 [sec] > > From the data, --bpf-counters is slightly better than the regular event > for all targets. I noticed that the results are not very stable. There > are a couple 108.xx runs in some of the conditions (w/ and w/o > --bpf-counters). Hmm.. so this result is when multiplexing happened, right? I wondered how/why the regular perf stat is slower.. Thanks, Namhyung > > > I also measured the average runtime of the BPF programs, with > > sysctl kernel.bpf_stats_enabled=1 > > For each event, if we have one leader and two followers, the total run > time is about 340ns. IOW, 340ns for two perf-stat reading instructions, > 340ns for two perf-stat reading cycles, etc. > > Thanks, > Song
Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF
> On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo wrote: > > Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu: >> Hi Song, >> >> On Wed, Mar 17, 2021 at 6:18 AM Song Liu wrote: >>> >>> perf uses performance monitoring counters (PMCs) to monitor system >>> performance. The PMCs are limited hardware resources. For example, >>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu. >>> >>> Modern data center systems use these PMCs in many different ways: >>> system level monitoring, (maybe nested) container level monitoring, per >>> process monitoring, profiling (in sample mode), etc. In some cases, >>> there are more active perf_events than available hardware PMCs. To allow >>> all perf_events to have a chance to run, it is necessary to do expensive >>> time multiplexing of events. >>> >>> On the other hand, many monitoring tools count the common metrics (cycles, >>> instructions). It is a waste to have multiple tools create multiple >>> perf_events of "cycles" and occupy multiple PMCs. >> >> Right, it'd be really helpful when the PMCs are frequently or mostly shared. >> But it'd also increase the overhead for uncontended cases as BPF programs >> need to run on every context switch. Depending on the workload, it may >> cause a non-negligible performance impact. So users should be aware of it. > > Would be interesting to, humm, measure both cases to have a firm number > of the impact, how many instructions are added when sharing using > --bpf-counters? > > I.e. compare the "expensive time multiplexing of events" with its > avoidance by using --bpf-counters. > > Song, have you perfmormed such measurements? I have got some measurements with perf-bench-sched-messaging: The system: x86_64 with 23 cores (46 HT) The perf-stat command: perf stat -e cycles,cycles,instructions,instructions,ref-cycles,ref-cycles The benchmark command and output: ./perf bench sched messaging -g 40 -l 5 -t # Running 'sched/messaging' benchmark: # 20 sender and receiver threads per group # 40 groups == 1600 threads run Total time: 10X.XXX [sec] I use the "Total time" as measurement, so smaller number is better. For each condition, I run the command 5 times, and took the median of "Total time". Baseline (no perf-stat) 104.873 [sec] # global perf stat -a107.887 [sec] perf stat -a --bpf-counters 106.071 [sec] # per task perf stat 106.314 [sec] perf stat --bpf-counters105.965 [sec] # per cpu perf stat -C 1,3,5 107.063 [sec] perf stat -C 1,3,5 --bpf-counters 106.406 [sec] >From the data, --bpf-counters is slightly better than the regular event for all targets. I noticed that the results are not very stable. There are a couple 108.xx runs in some of the conditions (w/ and w/o --bpf-counters). I also measured the average runtime of the BPF programs, with sysctl kernel.bpf_stats_enabled=1 For each event, if we have one leader and two followers, the total run time is about 340ns. IOW, 340ns for two perf-stat reading instructions, 340ns for two perf-stat reading cycles, etc. Thanks, Song
Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF
Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu: > Hi Song, > > On Wed, Mar 17, 2021 at 6:18 AM Song Liu wrote: > > > > perf uses performance monitoring counters (PMCs) to monitor system > > performance. The PMCs are limited hardware resources. For example, > > Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu. > > > > Modern data center systems use these PMCs in many different ways: > > system level monitoring, (maybe nested) container level monitoring, per > > process monitoring, profiling (in sample mode), etc. In some cases, > > there are more active perf_events than available hardware PMCs. To allow > > all perf_events to have a chance to run, it is necessary to do expensive > > time multiplexing of events. > > > > On the other hand, many monitoring tools count the common metrics (cycles, > > instructions). It is a waste to have multiple tools create multiple > > perf_events of "cycles" and occupy multiple PMCs. > > Right, it'd be really helpful when the PMCs are frequently or mostly shared. > But it'd also increase the overhead for uncontended cases as BPF programs > need to run on every context switch. Depending on the workload, it may > cause a non-negligible performance impact. So users should be aware of it. Would be interesting to, humm, measure both cases to have a firm number of the impact, how many instructions are added when sharing using --bpf-counters? I.e. compare the "expensive time multiplexing of events" with its avoidance by using --bpf-counters. Song, have you perfmormed such measurements? - Arnaldo > Thanks, > Namhyung > > > > > bperf tries to reduce such wastes by allowing multiple perf_events of > > "cycles" or "instructions" (at different scopes) to share PMUs. Instead > > of having each perf-stat session to read its own perf_events, bperf uses > > BPF programs to read the perf_events and aggregate readings to BPF maps. > > Then, the perf-stat session(s) reads the values from these BPF maps. > > > > Changes v1 => v2: > > 1. Add documentation. > > 2. Add a shell test. > > 3. Rename options, default path of the atto-map, and some variables. > > 4. Add a separate patch that moves clock_gettime() in __run_perf_stat() > > to after enable_counters(). > > 5. Make perf_cpu_map for all cpus a global variable. > > 6. Use sysfs__mountpoint() for default attr-map path. > > 7. Use cpu__max_cpu() instead of libbpf_num_possible_cpus(). > > 8. Add flag "enabled" to the follower program. Then move follower attach > > to bperf__load() and simplify bperf__enable(). > > > > Song Liu (3): > > perf-stat: introduce bperf, share hardware PMCs with BPF > > perf-stat: measure t0 and ref_time after enable_counters() > > perf-test: add a test for perf-stat --bpf-counters option > > > > tools/perf/Documentation/perf-stat.txt| 11 + > > tools/perf/Makefile.perf | 1 + > > tools/perf/builtin-stat.c | 20 +- > > tools/perf/tests/shell/stat_bpf_counters.sh | 34 ++ > > tools/perf/util/bpf_counter.c | 519 +- > > tools/perf/util/bpf_skel/bperf.h | 14 + > > tools/perf/util/bpf_skel/bperf_follower.bpf.c | 69 +++ > > tools/perf/util/bpf_skel/bperf_leader.bpf.c | 46 ++ > > tools/perf/util/bpf_skel/bperf_u.h| 14 + > > tools/perf/util/evsel.h | 20 +- > > tools/perf/util/target.h | 4 +- > > 11 files changed, 742 insertions(+), 10 deletions(-) > > create mode 100755 tools/perf/tests/shell/stat_bpf_counters.sh > > create mode 100644 tools/perf/util/bpf_skel/bperf.h > > create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c > > create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c > > create mode 100644 tools/perf/util/bpf_skel/bperf_u.h > > > > -- > > 2.30.2 -- - Arnaldo
Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF
On Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim wrote: > Hi Song, > > On Wed, Mar 17, 2021 at 6:18 AM Song Liu wrote: > > > > perf uses performance monitoring counters (PMCs) to monitor system > > performance. The PMCs are limited hardware resources. For example, > > Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu. > > > > Modern data center systems use these PMCs in many different ways: > > system level monitoring, (maybe nested) container level monitoring, per > > process monitoring, profiling (in sample mode), etc. In some cases, > > there are more active perf_events than available hardware PMCs. To allow > > all perf_events to have a chance to run, it is necessary to do expensive > > time multiplexing of events. > > > > On the other hand, many monitoring tools count the common metrics (cycles, > > instructions). It is a waste to have multiple tools create multiple > > perf_events of "cycles" and occupy multiple PMCs. > > Right, it'd be really helpful when the PMCs are frequently or mostly shared. > But it'd also increase the overhead for uncontended cases as BPF programs > need to run on every context switch. Depending on the workload, it may > cause a non-negligible performance impact. So users should be aware of it. right, let's get get some idea of how bad that actualy is Song, could you please get some numbers from runnning for example 'perf bench sched messaging ...' with both normal and bpf mode perf stat? for all supported target options thanks, jirka
Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF
Hi Song, On Wed, Mar 17, 2021 at 6:18 AM Song Liu wrote: > > perf uses performance monitoring counters (PMCs) to monitor system > performance. The PMCs are limited hardware resources. For example, > Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu. > > Modern data center systems use these PMCs in many different ways: > system level monitoring, (maybe nested) container level monitoring, per > process monitoring, profiling (in sample mode), etc. In some cases, > there are more active perf_events than available hardware PMCs. To allow > all perf_events to have a chance to run, it is necessary to do expensive > time multiplexing of events. > > On the other hand, many monitoring tools count the common metrics (cycles, > instructions). It is a waste to have multiple tools create multiple > perf_events of "cycles" and occupy multiple PMCs. Right, it'd be really helpful when the PMCs are frequently or mostly shared. But it'd also increase the overhead for uncontended cases as BPF programs need to run on every context switch. Depending on the workload, it may cause a non-negligible performance impact. So users should be aware of it. Thanks, Namhyung > > bperf tries to reduce such wastes by allowing multiple perf_events of > "cycles" or "instructions" (at different scopes) to share PMUs. Instead > of having each perf-stat session to read its own perf_events, bperf uses > BPF programs to read the perf_events and aggregate readings to BPF maps. > Then, the perf-stat session(s) reads the values from these BPF maps. > > Changes v1 => v2: > 1. Add documentation. > 2. Add a shell test. > 3. Rename options, default path of the atto-map, and some variables. > 4. Add a separate patch that moves clock_gettime() in __run_perf_stat() > to after enable_counters(). > 5. Make perf_cpu_map for all cpus a global variable. > 6. Use sysfs__mountpoint() for default attr-map path. > 7. Use cpu__max_cpu() instead of libbpf_num_possible_cpus(). > 8. Add flag "enabled" to the follower program. Then move follower attach > to bperf__load() and simplify bperf__enable(). > > Song Liu (3): > perf-stat: introduce bperf, share hardware PMCs with BPF > perf-stat: measure t0 and ref_time after enable_counters() > perf-test: add a test for perf-stat --bpf-counters option > > tools/perf/Documentation/perf-stat.txt| 11 + > tools/perf/Makefile.perf | 1 + > tools/perf/builtin-stat.c | 20 +- > tools/perf/tests/shell/stat_bpf_counters.sh | 34 ++ > tools/perf/util/bpf_counter.c | 519 +- > tools/perf/util/bpf_skel/bperf.h | 14 + > tools/perf/util/bpf_skel/bperf_follower.bpf.c | 69 +++ > tools/perf/util/bpf_skel/bperf_leader.bpf.c | 46 ++ > tools/perf/util/bpf_skel/bperf_u.h| 14 + > tools/perf/util/evsel.h | 20 +- > tools/perf/util/target.h | 4 +- > 11 files changed, 742 insertions(+), 10 deletions(-) > create mode 100755 tools/perf/tests/shell/stat_bpf_counters.sh > create mode 100644 tools/perf/util/bpf_skel/bperf.h > create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c > create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c > create mode 100644 tools/perf/util/bpf_skel/bperf_u.h > > -- > 2.30.2
[PATCH v2 0/3] perf-stat: share hardware PMCs with BPF
perf uses performance monitoring counters (PMCs) to monitor system performance. The PMCs are limited hardware resources. For example, Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu. Modern data center systems use these PMCs in many different ways: system level monitoring, (maybe nested) container level monitoring, per process monitoring, profiling (in sample mode), etc. In some cases, there are more active perf_events than available hardware PMCs. To allow all perf_events to have a chance to run, it is necessary to do expensive time multiplexing of events. On the other hand, many monitoring tools count the common metrics (cycles, instructions). It is a waste to have multiple tools create multiple perf_events of "cycles" and occupy multiple PMCs. bperf tries to reduce such wastes by allowing multiple perf_events of "cycles" or "instructions" (at different scopes) to share PMUs. Instead of having each perf-stat session to read its own perf_events, bperf uses BPF programs to read the perf_events and aggregate readings to BPF maps. Then, the perf-stat session(s) reads the values from these BPF maps. Changes v1 => v2: 1. Add documentation. 2. Add a shell test. 3. Rename options, default path of the atto-map, and some variables. 4. Add a separate patch that moves clock_gettime() in __run_perf_stat() to after enable_counters(). 5. Make perf_cpu_map for all cpus a global variable. 6. Use sysfs__mountpoint() for default attr-map path. 7. Use cpu__max_cpu() instead of libbpf_num_possible_cpus(). 8. Add flag "enabled" to the follower program. Then move follower attach to bperf__load() and simplify bperf__enable(). Song Liu (3): perf-stat: introduce bperf, share hardware PMCs with BPF perf-stat: measure t0 and ref_time after enable_counters() perf-test: add a test for perf-stat --bpf-counters option tools/perf/Documentation/perf-stat.txt| 11 + tools/perf/Makefile.perf | 1 + tools/perf/builtin-stat.c | 20 +- tools/perf/tests/shell/stat_bpf_counters.sh | 34 ++ tools/perf/util/bpf_counter.c | 519 +- tools/perf/util/bpf_skel/bperf.h | 14 + tools/perf/util/bpf_skel/bperf_follower.bpf.c | 69 +++ tools/perf/util/bpf_skel/bperf_leader.bpf.c | 46 ++ tools/perf/util/bpf_skel/bperf_u.h| 14 + tools/perf/util/evsel.h | 20 +- tools/perf/util/target.h | 4 +- 11 files changed, 742 insertions(+), 10 deletions(-) create mode 100755 tools/perf/tests/shell/stat_bpf_counters.sh create mode 100644 tools/perf/util/bpf_skel/bperf.h create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c create mode 100644 tools/perf/util/bpf_skel/bperf_u.h -- 2.30.2