Hello Jiri, On 21.09.2018 9:13, Alexey Budankov wrote: > Hello Jiri, > > On 14.09.2018 12:37, Alexey Budankov wrote: >> On 14.09.2018 11:28, Jiri Olsa wrote: >>> On Fri, Sep 14, 2018 at 10:26:53AM +0200, Jiri Olsa wrote: >>> >>> SNIP >>> >>>>>> The threaded monitoring currently can't monitor backward maps >>>>>> and there are probably more limitations which I haven't spotted >>>>>> yet. >>>>>> >>>>>> So far I tested on laptop: >>>>>> http://people.redhat.com/~jolsa/record_threads/test-4CPU.txt >>>>>> >>>>>> and a one bigger server: >>>>>> http://people.redhat.com/~jolsa/record_threads/test-208CPU.txt >>>>>> >>>>>> I can see decrease in recorded LOST events, but both the benchmark >>>>>> and the monitoring must be carefully configured wrt: >>>>>> - number of events (frequency) >>>>>> - size of the memory maps >>>>>> - size of events (callchains) >>>>>> - final perf.data size >>>>>> >>>>>> It's also available in: >>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git >>>>>> perf/record_threads >>>>>> >>>>>> thoughts? ;-) thanks >>>>>> jirka >>>>> >>>>> It is preferable to split into smaller pieces that bring >>>>> some improvement proved by metrics numbers and ready for >>>>> merging and upstream. Do we have more metrics than the >>>>> data loss from trace AIO patches? >>>> >>>> well the primary focus is to get more events in, >>>> so the LOST metric is the main one >>> >>> actualy I was hoping, could you please run it through the same >>> tests as you do for AIO code on some huge server? >> >> Yeah, I will, but it takes some time. > > Here it is: > > Hardware: > cat /proc/cpuinfo > processor : 271 > vendor_id : GenuineIntel > cpu family : 6 > model : 133 > model name : Intel(R) Xeon Phi(TM) CPU 7285 @ 1.30GHz > stepping : 0 > microcode : 0xe > cpu MHz : 1064.235 > cache size : 1024 KB > physical id : 0 > siblings : 272 > core id : 73 > cpu cores : 68 > apicid : 295 > initial apicid : 295 > fpu : yes > fpu_exception : yes > cpuid level : 13 > wp : yes > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov > pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb > rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology > nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 > ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer > aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ring3mwait cpuid_fault > epb pti tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 > smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd xsaveopt dtherm > ida arat pln pts avx512_vpopcntdq avx512_4vnniw avx512_4fmaps > bugs : cpu_meltdown spectre_v1 spectre_v2 > bogomips : 2594.07 > clflush size : 64 > cache_alignment : 64 > address sizes : 46 bits physical, 48 bits virtual > power management: > > uname -a > Linux nntpat98-196 4.18.0-rc7+ #2 SMP Thu Sep 6 13:24:37 MSK 2018 x86_64 > x86_64 x86_64 GNU/Linux > > cat /proc/sys/kernel/perf_event_paranoid > 0 > > cat /proc/sys/kernel/perf_event_mlock_kb > 516 > > cat /proc/sys/kernel/perf_event_max_sample_rate > 3000 > > cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.5 (Maipo) > > Metrics: > runtime overhead (%) : elapsed_time_under_profiling / elapsed_time > data loss (%) : paused_time / elapsed_time_under_profiling > LOST events : stat from perf report --stats > SAMPLE events : stat from perf report --stats > perf.data size (B) : size of trace file on disk > > Events: > cpu/period=P,event=0x3c/Duk;CPU_CLK_UNHALTED.THREAD > cpu/period=P,umask=0x3/Duk;CPU_CLK_UNHALTED.REF_TSC > cpu/period=P,event=0xc0/Duk;INST_RETIRED.ANY > cpu/period=0xaae61,event=0xc2,umask=0x10/uk;UOPS_RETIRED.ALL > cpu/period=0x11171,event=0xc2,umask=0x20/uk;UOPS_RETIRED.SCALAR_SIMD > cpu/period=0x11171,event=0xc2,umask=0x40/uk;UOPS_RETIRED.PACKED_SIMD > > ================================================= > > Command: > /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf.thr record > --threads=T \ > -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \ > -e cpu/period=P,event=0x3c/Duk,\ > cpu/period=P,umask=0x3/Duk,\ > cpu/period=P,event=0xc0/Duk,\ > cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\ > cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\ > cpu/period=0x4e20,event=0xc2,umask=0x40/uk \ > --clockid=monotonic_raw -- ./matrix.(icc|gcc) > > Workload: matrix multiplication in 256 threads > > /usr/bin/time ./matrix.icc > Addr of buf1 = 0x7ff9faa73010 > Offs of buf1 = 0x7ff9faa73180 > Addr of buf2 = 0x7ff9f8a72010 > Offs of buf2 = 0x7ff9f8a721c0 > Addr of buf3 = 0x7ff9f6a71010 > Offs of buf3 = 0x7ff9f6a71100 > Addr of buf4 = 0x7ff9f4a70010 > Offs of buf4 = 0x7ff9f4a70140 > Threads #: 256 Pthreads > Matrix size: 2048 > Using multiply kernel: multiply1 > Freq = 0.997720 GHz > Execution time = 9.061 seconds > 1639.55user 6.59system 0:07.12elapsed 23094%CPU (0avgtext+0avgdata > 100448maxresident)k > 96inputs+0outputs (1major+33839minor)pagefaults 0swaps > > T : 272 > P (period, ms) : 0.1 > runtime overhead (%) : 45x ~ 323.54 / 7.12 > data loss (%) : 96 > LOST events : 323662 > SAMPLE events : 31885479 > perf.data size (GiB) : 42 > > P (period, ms) : 0.25 > runtime overhead (%) : 25x ~ 180.76 / 7.12 > data loss (%) : 69 > LOST events : 10636 > SAMPLE events : 18692998 > perf.data size (GiB) : 23.5 > > P (period, ms) : 0.35 > runtime overhead (%) : 16x ~ 119.49 / 7.12 > data loss (%) : 1 > LOST events : 6 > SAMPLE events : 11178524 > perf.data size (GiB) : 14 > > T : 128 > P (period, ms) : 0.35 > runtime overhead (%) : 15x ~ 111.98 / 7.12 > data loss (%) : 62 > LOST events : 2825 > SAMPLE events : 11267247 > perf.data size (GiB) : 15 > > T : 64 > P (period, ms) : 0.35 > runtime overhead (%) : 14x ~ 101.55 / 7.12 > data loss (%) : 67 > LOST events : 5155 > SAMPLE events : 10966297 > perf.data size (GiB) : 13.7 > > Workload: matrix multiplication in 128 threads > > /usr/bin/time ./matrix.gcc > Addr of buf1 = 0x7f072e630010 > Offs of buf1 = 0x7f072e630180 > Addr of buf2 = 0x7f072c62f010 > Offs of buf2 = 0x7f072c62f1c0 > Addr of buf3 = 0x7f072a62e010 > Offs of buf3 = 0x7f072a62e100 > Addr of buf4 = 0x7f072862d010 > Offs of buf4 = 0x7f072862d140 > Threads #: 128 Pthreads > Matrix size: 2048 > Using multiply kernel: multiply1 > Execution time = 6.639 seconds > 767.03user 11.17system 0:06.81elapsed 11424%CPU (0avgtext+0avgdata > 100756maxresident)k > 88inputs+0outputs (0major+139898minor)pagefaults 0swaps > > T : 272 > P (period, ms) : 0.1 > runtime overhead (%) : 29x ~ 198.81 / 6.81 > data loss (%) : 21 > LOST events : 2502 > SAMPLE events : 22481062 > perf.data size (GiB) : 27.6 > > P (period, ms) : 0.25 > runtime overhead (%) : 13x ~ 88.47 / 6.81 > data loss (%) : 0 > LOST events : 0 > SAMPLE events : 9572787 > perf.data size (GiB) : 11.3 > > P (period, ms) : 0.35 > runtime overhead (%) : 10x ~ 67.11 / 6.81 > data loss (%) : 1 > LOST events : 137 > SAMPLE events : 6985930 > perf.data size (GiB) : 8 > > T : 128 > P (period, ms) : 0.35 > runtime overhead (%) : 9.5x ~ 64.33 / 6.81 > data loss (%) : 1 > LOST events : 3 > SAMPLE events : 6666903 > perf.data size (GiB) : 7.8 > > T : 64 > P (period, ms) : 0.25 > runtime overhead (%) : 17x ~ 114.27 / 6.81 > data loss (%) : 2 > LOST events : 52 > SAMPLE events : 12643645 > perf.data size (GiB) : 15.5 > > P (period, ms) : 0.35 > runtime overhead (%) : 10x ~ 68.60 / 6.81 > data loss (%) : 1 > LOST events : 93 > SAMPLE events : 7164368 > perf.data size (GiB) : 8.5
and this is for AIO and serial: Command: /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf.aio record --aio=N \ -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \ -e cpu/period=P,event=0x3c/Duk,\ cpu/period=P,umask=0x3/Duk,\ cpu/period=P,event=0xc0/Duk,\ cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\ cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\ cpu/period=0x4e20,event=0xc2,umask=0x40/uk \ --clockid=monotonic_raw -- ./matrix.(icc|gcc) Workload: matrix multiplication in 256 threads N : 512 P (period, ms) : 2.5 runtime overhead (%) : 2.7x ~ 19.21 / 7.12 data loss (%) : 42 LOST events : 1600 SAMPLE events : 1235928 perf.data size (GiB) : 1.5 N : 272 P (period, ms) : 1.5 runtime overhead (%) : 2.5x ~ 18.09 / 7.12 data loss (%) : 89 LOST events : 3457 SAMPLE events : 1222143 perf.data size (GiB) : 1.5 P (period, ms) : 2 runtime overhead (%) : 2.5x ~ 17.93 / 7.12 data loss (%) : 65 LOST events : 2496 SAMPLE events : 1240754 perf.data size (GiB) : 1.5 P (period, ms) : 2.5 runtime overhead (%) : 2.5x ~ 17.87 / 7.12 data loss (%) : 44 LOST events : 1621 SAMPLE events : 1221949 perf.data size (GiB) : 1.5 P (period, ms) : 3 runtime overhead (%) : 2.5x ~ 18.43 / 7.12 data loss (%) : 12 LOST events : 350 SAMPLE events : 1117972 perf.data size (GiB) : 1.3 N : 128 P (period, ms) : 3 runtime overhead (%) : 2.4x ~ 17.08 / 7.12 data loss (%) : 11 LOST events : 335 SAMPLE events : 1116832 perf.data size (GiB) : 1.3 N : 64 P (period, ms) : 3 runtime overhead (%) : 2.2x ~ 16.03 / 7.12 data loss (%) : 11 LOST events : 329 SAMPLE events : 1108205 perf.data size (GiB) : 1.3 Workload: matrix multiplication in 128 threads N : 512 P (period, ms) : 1 runtime overhead (%) : 3.5x ~ 23.72 / 6.81 data loss (%) : 18 LOST events : 1043 SAMPLE events : 2015306 perf.data size (GiB) : 2.3 N : 272 P (period, ms) : 0.5 runtime overhead (%) : 3x ~ 22.72 / 6.81 data loss (%) : 90 LOST events : 5842 SAMPLE events : 2205937 perf.data size (GiB) : 2.5 P (period, ms) : 1 runtime overhead (%) : 3x ~ 22.79 / 6.81 data loss (%) : 11 LOST events : 481 SAMPLE events : 2017099 perf.data size (GiB) : 2.5 P (period, ms) : 1.5 runtime overhead (%) : 3x ~ 19.93 / 6.81 data loss (%) : 5 LOST events : 190 SAMPLE events : 1308692 perf.data size (GiB) : 1.5 P (period, ms) : 2 runtime overhead (%) : 3x ~ 18.95 / 6.81 data loss (%) : 0 LOST events : 0 SAMPLE events : 1010769 perf.data size (GiB) : 1.2 N : 128 P (period, ms) : 1.5 runtime overhead (%) : 3x ~ 19.08 / 6.81 data loss (%) : 6 LOST events : 220 SAMPLE events : 1322240 perf.data size (GiB) : 1.5 N : 64 P (period, ms) : 1.5 runtime overhead (%) : 3x ~ 19.43 / 6.81 data loss (%) : 3 LOST events : 130 SAMPLE events : 1386521 perf.data size (GiB) : 1.6 ================================================= Command: /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf record \ -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \ -e cpu/period=P,event=0x3c/Duk,\ cpu/period=P,umask=0x3/Duk,\ cpu/period=P,event=0xc0/Duk,\ cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\ cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\ cpu/period=0x4e20,event=0xc2,umask=0x40/uk \ --clockid=monotonic_raw -- ./matrix.(icc|gcc) Workload: matrix multiplication in 256 threads P (period, ms) : 7.5 runtime overhead (%) : 1.6x ~ 11.6 / 7.12 data loss (%) : 1 LOST events : 1 SAMPLE events : 451062 perf.data size (GiB) : 0.5 Workload: matrix multiplication in 128 threads P (period, ms) : 3 runtime overhead (%) : 1.8x ~ 12.58 / 6.81 data loss (%) : 9 LOST events : 147 SAMPLE events : 673299 perf.data size (GiB) : 0.8 Thanks, Alexey > > Thanks, > Alexey > >> >>> >>> thanks, >>> jirka >>> >> >