Re: [RFCv2 00/48] perf tools: Add threads to record command

Alexey Budankov Fri, 21 Sep 2018 05:16:56 -0700

Hello Jiri,

On 21.09.2018 9:13, Alexey Budankov wrote:
> Hello Jiri,
> 
> On 14.09.2018 12:37, Alexey Budankov wrote:
>> On 14.09.2018 11:28, Jiri Olsa wrote:
>>> On Fri, Sep 14, 2018 at 10:26:53AM +0200, Jiri Olsa wrote:
>>>
>>> SNIP
>>>
>>>>>> The threaded monitoring currently can't monitor backward maps
>>>>>> and there are probably more limitations which I haven't spotted
>>>>>> yet.
>>>>>>
>>>>>> So far I tested on laptop:
>>>>>>   http://people.redhat.com/~jolsa/record_threads/test-4CPU.txt
>>>>>>
>>>>>> and a one bigger server:
>>>>>>   http://people.redhat.com/~jolsa/record_threads/test-208CPU.txt
>>>>>>
>>>>>> I can see decrease in recorded LOST events, but both the benchmark
>>>>>> and the monitoring must be carefully configured wrt:
>>>>>>   - number of events (frequency)
>>>>>>   - size of the memory maps
>>>>>>   - size of events (callchains)
>>>>>>   - final perf.data size
>>>>>>
>>>>>> It's also available in:
>>>>>>   git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
>>>>>>   perf/record_threads
>>>>>>
>>>>>> thoughts? ;-) thanks
>>>>>> jirka
>>>>>
>>>>> It is preferable to split into smaller pieces that bring 
>>>>> some improvement proved by metrics numbers and ready for 
>>>>> merging and upstream. Do we have more metrics than the 
>>>>> data loss from trace AIO patches?
>>>>
>>>> well the primary focus is to get more events in,
>>>> so the LOST metric is the main one
>>>
>>> actualy I was hoping, could you please run it through the same
>>> tests as you do for AIO code on some huge server? 
>>
>> Yeah, I will, but it takes some time.
> 
> Here it is:
> 
> Hardware:
> cat /proc/cpuinfo
> processor     : 271
> vendor_id     : GenuineIntel
> cpu family    : 6
> model         : 133
> model name    : Intel(R) Xeon Phi(TM) CPU 7285 @ 1.30GHz
> stepping      : 0
> microcode     : 0xe
> cpu MHz               : 1064.235
> cache size    : 1024 KB
> physical id   : 0
> siblings      : 272
> core id               : 73
> cpu cores     : 68
> apicid                : 295
> initial apicid        : 295
> fpu           : yes
> fpu_exception : yes
> cpuid level   : 13
> wp            : yes
> flags         : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb 
> rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
> nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 
> ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer 
> aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ring3mwait cpuid_fault 
> epb pti tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 
> smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd xsaveopt dtherm 
> ida arat pln pts avx512_vpopcntdq avx512_4vnniw avx512_4fmaps
> bugs          : cpu_meltdown spectre_v1 spectre_v2
> bogomips      : 2594.07
> clflush size  : 64
> cache_alignment       : 64
> address sizes : 46 bits physical, 48 bits virtual
> power management:
> 
> uname -a
> Linux nntpat98-196 4.18.0-rc7+ #2 SMP Thu Sep 6 13:24:37 MSK 2018 x86_64 
> x86_64 x86_64 GNU/Linux
> 
> cat /proc/sys/kernel/perf_event_paranoid
> 0
> 
> cat /proc/sys/kernel/perf_event_mlock_kb 
> 516
> 
> cat /proc/sys/kernel/perf_event_max_sample_rate 
> 3000
> 
> cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.5 (Maipo)
> 
> Metrics:
> runtime overhead (%) : elapsed_time_under_profiling / elapsed_time
> data loss (%)        : paused_time / elapsed_time_under_profiling
> LOST events          : stat from perf report --stats
> SAMPLE events        : stat from perf report --stats
> perf.data size (B)   : size of trace file on disk
> 
> Events:
> cpu/period=P,event=0x3c/Duk;CPU_CLK_UNHALTED.THREAD
> cpu/period=P,umask=0x3/Duk;CPU_CLK_UNHALTED.REF_TSC
> cpu/period=P,event=0xc0/Duk;INST_RETIRED.ANY
> cpu/period=0xaae61,event=0xc2,umask=0x10/uk;UOPS_RETIRED.ALL
> cpu/period=0x11171,event=0xc2,umask=0x20/uk;UOPS_RETIRED.SCALAR_SIMD
> cpu/period=0x11171,event=0xc2,umask=0x40/uk;UOPS_RETIRED.PACKED_SIMD
> 
> =================================================
> 
> Command:
> /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf.thr record 
> --threads=T \
>       -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
>         -e cpu/period=P,event=0x3c/Duk,\
>            cpu/period=P,umask=0x3/Duk,\
>            cpu/period=P,event=0xc0/Duk,\
>            cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
>            cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
>            cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
>          --clockid=monotonic_raw -- ./matrix.(icc|gcc)
> 
> Workload: matrix multiplication in 256 threads
> 
> /usr/bin/time ./matrix.icc
> Addr of buf1 = 0x7ff9faa73010
> Offs of buf1 = 0x7ff9faa73180
> Addr of buf2 = 0x7ff9f8a72010
> Offs of buf2 = 0x7ff9f8a721c0
> Addr of buf3 = 0x7ff9f6a71010
> Offs of buf3 = 0x7ff9f6a71100
> Addr of buf4 = 0x7ff9f4a70010
> Offs of buf4 = 0x7ff9f4a70140
> Threads #: 256 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Freq = 0.997720 GHz
> Execution time = 9.061 seconds
> 1639.55user 6.59system 0:07.12elapsed 23094%CPU (0avgtext+0avgdata 
> 100448maxresident)k
> 96inputs+0outputs (1major+33839minor)pagefaults 0swaps
> 
> T : 272
>         P (period, ms)       : 0.1
>       runtime overhead (%) : 45x ~ 323.54 / 7.12
>       data loss (%)        : 96
>       LOST events          : 323662
>       SAMPLE events        : 31885479
>         perf.data size (GiB) : 42
> 
>       P (period, ms)       : 0.25
>       runtime overhead (%) : 25x ~ 180.76 / 7.12
>       data loss (%)        : 69 
>       LOST events          : 10636
>       SAMPLE events        : 18692998
>         perf.data size (GiB) : 23.5
> 
>       P (period, ms)       : 0.35 
>       runtime overhead (%) : 16x ~ 119.49 / 7.12
>       data loss (%)        : 1
>       LOST events          : 6
>       SAMPLE events        : 11178524
>         perf.data size (GiB) : 14
> 
> T : 128
>       P (period, ms)       : 0.35 
>       runtime overhead (%) : 15x ~ 111.98 / 7.12
>       data loss (%)        : 62
>       LOST events          : 2825
>       SAMPLE events        : 11267247
>         perf.data size (GiB) : 15
> 
> T : 64
>       P (period, ms)       : 0.35 
>       runtime overhead (%) : 14x ~ 101.55 / 7.12
>       data loss (%)        : 67
>       LOST events          : 5155
>       SAMPLE events        : 10966297
>         perf.data size (GiB) : 13.7
> 
> Workload: matrix multiplication in 128 threads
> 
> /usr/bin/time ./matrix.gcc
> Addr of buf1 = 0x7f072e630010
> Offs of buf1 = 0x7f072e630180
> Addr of buf2 = 0x7f072c62f010
> Offs of buf2 = 0x7f072c62f1c0
> Addr of buf3 = 0x7f072a62e010
> Offs of buf3 = 0x7f072a62e100
> Addr of buf4 = 0x7f072862d010
> Offs of buf4 = 0x7f072862d140
> Threads #: 128 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 6.639 seconds
> 767.03user 11.17system 0:06.81elapsed 11424%CPU (0avgtext+0avgdata 
> 100756maxresident)k
> 88inputs+0outputs (0major+139898minor)pagefaults 0swaps
> 
> T : 272
>         P (period, ms)       : 0.1
>       runtime overhead (%) : 29x ~ 198.81 / 6.81
>       data loss (%)        : 21
>       LOST events          : 2502
>       SAMPLE events        : 22481062
>         perf.data size (GiB) : 27.6
> 
>       P (period, ms)       : 0.25
>       runtime overhead (%) : 13x ~ 88.47 / 6.81
>       data loss (%)        : 0
>       LOST events          : 0
>       SAMPLE events        : 9572787
>         perf.data size (GiB) : 11.3
> 
>       P (period, ms)       : 0.35 
>       runtime overhead (%) : 10x ~ 67.11 / 6.81
>       data loss (%)        : 1
>       LOST events          : 137
>       SAMPLE events        : 6985930
>         perf.data size (GiB) : 8
> 
> T : 128
>       P (period, ms)       : 0.35 
>       runtime overhead (%) : 9.5x ~ 64.33 / 6.81
>       data loss (%)        : 1
>       LOST events          : 3
>       SAMPLE events        : 6666903
>         perf.data size (GiB) : 7.8
> 
> T : 64
>       P (period, ms)       : 0.25
>       runtime overhead (%) : 17x ~ 114.27 / 6.81
>       data loss (%)        : 2
>       LOST events          : 52
>       SAMPLE events        : 12643645
>         perf.data size (GiB) : 15.5
> 
>       P (period, ms)       : 0.35 
>       runtime overhead (%) : 10x ~ 68.60 / 6.81
>       data loss (%)        : 1
>       LOST events          : 93
>       SAMPLE events        : 7164368
>         perf.data size (GiB) : 8.5


and this is for AIO and serial:

Command:
/usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf.aio record --aio=N \
        -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
        -e cpu/period=P,event=0x3c/Duk,\
           cpu/period=P,umask=0x3/Duk,\
           cpu/period=P,event=0xc0/Duk,\
           cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
           cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
           cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
         --clockid=monotonic_raw -- ./matrix.(icc|gcc)

Workload: matrix multiplication in 256 threads

 N : 512
        P (period, ms)       : 2.5
        runtime overhead (%) : 2.7x ~ 19.21 / 7.12
        data loss (%)        : 42
        LOST events          : 1600
        SAMPLE events        : 1235928
        perf.data size (GiB) : 1.5
 
 N : 272
        P (period, ms)       : 1.5
        runtime overhead (%) : 2.5x ~ 18.09 / 7.12
        data loss (%)        : 89
        LOST events          : 3457
        SAMPLE events        : 1222143
        perf.data size (GiB) : 1.5

        P (period, ms)       : 2
        runtime overhead (%) : 2.5x ~ 17.93 / 7.12
        data loss (%)        : 65
        LOST events          : 2496
        SAMPLE events        : 1240754
        perf.data size (GiB) : 1.5

        P (period, ms)       : 2.5
        runtime overhead (%) : 2.5x ~ 17.87 / 7.12
        data loss (%)        : 44
        LOST events          : 1621
        SAMPLE events        : 1221949
        perf.data size (GiB) : 1.5

        P (period, ms)       : 3
        runtime overhead (%) : 2.5x ~ 18.43 / 7.12
        data loss (%)        : 12
        LOST events          : 350
        SAMPLE events        : 1117972
        perf.data size (GiB) : 1.3
 
 N : 128
        P (period, ms)       : 3
        runtime overhead (%) : 2.4x ~ 17.08 / 7.12
        data loss (%)        : 11
        LOST events          : 335
        SAMPLE events        : 1116832
        perf.data size (GiB) : 1.3
 
 N : 64
        P (period, ms)       : 3
        runtime overhead (%) : 2.2x ~ 16.03 / 7.12
        data loss (%)        : 11
        LOST events          : 329
        SAMPLE events        : 1108205
        perf.data size (GiB) : 1.3
 
Workload: matrix multiplication in 128 threads

 N : 512
        P (period, ms)       : 1
        runtime overhead (%) : 3.5x ~ 23.72 / 6.81
        data loss (%)        : 18
        LOST events          : 1043
        SAMPLE events        : 2015306
        perf.data size (GiB) : 2.3

 N : 272
        P (period, ms)       : 0.5
        runtime overhead (%) : 3x ~ 22.72 / 6.81
        data loss (%)        : 90
        LOST events          : 5842
        SAMPLE events        : 2205937
        perf.data size (GiB) : 2.5

        P (period, ms)       : 1
        runtime overhead (%) : 3x ~ 22.79 / 6.81
        data loss (%)        : 11
        LOST events          : 481
        SAMPLE events        : 2017099
        perf.data size (GiB) : 2.5
 
        P (period, ms)       : 1.5
        runtime overhead (%) : 3x ~ 19.93 / 6.81
        data loss (%)        : 5
        LOST events          : 190
        SAMPLE events        : 1308692
        perf.data size (GiB) : 1.5
 
        P (period, ms)       : 2
        runtime overhead (%) : 3x ~ 18.95 / 6.81
        data loss (%)        : 0
        LOST events          : 0
        SAMPLE events        : 1010769
        perf.data size (GiB) : 1.2
 
 N : 128
        P (period, ms)       : 1.5
        runtime overhead (%) : 3x ~ 19.08 / 6.81
        data loss (%)        : 6
        LOST events          : 220
        SAMPLE events        : 1322240
        perf.data size (GiB) : 1.5
 
 N : 64
        P (period, ms)       : 1.5
        runtime overhead (%) : 3x ~ 19.43 / 6.81
        data loss (%)        : 3
        LOST events          : 130
        SAMPLE events        : 1386521
        perf.data size (GiB) : 1.6

=================================================

Command:
/usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf record \
        -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
        -e cpu/period=P,event=0x3c/Duk,\
           cpu/period=P,umask=0x3/Duk,\
           cpu/period=P,event=0xc0/Duk,\
           cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
           cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
           cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
         --clockid=monotonic_raw -- ./matrix.(icc|gcc)

Workload: matrix multiplication in 256 threads

        P (period, ms)       : 7.5
        runtime overhead (%) : 1.6x ~ 11.6 / 7.12
        data loss (%)        : 1
        LOST events          : 1
        SAMPLE events        : 451062
        perf.data size (GiB) : 0.5

Workload: matrix multiplication in 128 threads

        P (period, ms)       : 3
        runtime overhead (%) : 1.8x ~ 12.58 / 6.81
        data loss (%)        : 9
        LOST events          : 147
        SAMPLE events        : 673299
        perf.data size (GiB) : 0.8

Thanks,
Alexey

> 
> Thanks,
> Alexey
> 
>>
>>>
>>> thanks,
>>> jirka
>>>
>>
>

Re: [RFCv2 00/48] perf tools: Add threads to record command

Reply via email to