Hi,
This is a gentle reminder regarding the patch set below.
Thanks,
Alexey
On 18.03.2019 20:36, Alexey Budankov wrote:
>
> The patch set implements runtime trace compression (-z option) in
> record mode and trace auto decompression in report and inject modes.
> Streaming Zstd API [1] is used for compression and decompression of
> data that come from kernel mmaped data buffers.
>
> Usage of implemented -z,--compression_level=n option provides ~3-5x
> avg. trace file size reduction on variety of tested workloads what
> saves storage space on larger server systems where trace file size
> can easily reach several tens or even hundreds of GiBs, especially
> when profiling with dwarf-based stacks and tracing of context switches.
> Default option value is 1 (fastest compression).
>
> Implemented --mmap-flush option can be used to specify minimal size
> of data chunk that is extracted from mmaped kernel buffer to store
> into a trace. The option is independent from -z setting and doesn't
> vary with compression level. The default option value is 1 byte what
> means every time trace writing thread finds some new data in the
> mmaped buffer the data is extracted, possibly compressed and written
> to a trace. The option serves two purposes the first one is to increase
> the compression ratio of trace data and the second one is to avoid
> live-lock self tool process monitoring in system wide (-a) profiling
> mode. Profiling in system wide mode with compression (-a -z) can
> additionally induce data into the kernel buffers along with the data
> from monitored processes. If performance data rate and volume from
> the monitored processes is high then trace streaming and compression
> activity in the tool is also high. It can lead to subtle live-lock
> effect of endless activity when compression of single new byte from
> some of mmaped kernel buffer induces the next single byte at some
> mmaped buffer. So perf tool thread never stops on polling event file
> descriptors. Varying data chunk size to be extracted from mmap buffers
> allows avoiding live-locking self monitoring in system wide mode and
> makes mmap buffers polling loop manageable. Possible usage examples:
>
> $ tools/perf/perf record -z -e cycles -- matrix.gcc
> $ tools/perf/perf record --aio -z -e cycles -- matrix.gcc
> $ tools/perf/perf record -z --mmap-flush 1024 -e cycles -- matrix.gcc
> $ tools/perf/perf record --aio -z --mmap-flush 1K -e cycles -- matrix.gcc
>
> Runtime compression overhead has been measured for serial and AIO
> trace writing modes when profiling matrix multiplication workload:
>
> -------------------------------------------------------------
> | SERIAL | AIO-1 |
> ----|-----------------------------|-----------------------------|
> |-z | OVH(x) | ratio(x) size(MiB) | OVH(x) | ratio(x) size(MiB) |
> |---|--------|--------------------|--------|--------------------|
> | 0 | 1,00 | 1,000 179,424 | 1,00 | 1,000 187,527 |
> | 1 | 1,04 | 8,427 181,148 | 1,01 | 8,474 188,562 |
> | 2 | 1,07 | 8,055 186,953 | 1,03 | 7,912 191,773 |
> | 3 | 1,04 | 8,283 181,908 | 1,03 | 8,220 191,078 |
> | 5 | 1,09 | 8,101 187,705 | 1,05 | 7,780 190,065 |
> | 8 | 1,05 | 9,217 179,191 | 1,12 | 6,111 193,024 |
> -----------------------------------------------------------------
>
> OVH = (Execution time with -z N) / (Execution time with -z 0)
>
> ratio - compression ratio
> size - number of bytes that was compressed
>
> size ~= trace file x ratio
>
> See complete description of measurement conditions with details below.
>
> Introduced compression functionality can be disabled or configured from
> the command line using NO_LIBZSTD and LIBZSTD_DIR defines:
>
> $ make -C tools/perf NO_LIBZSTD=1 clean all
> $ make -C tools/perf LIBZSTD_DIR=/path/to/zstd/sources/ clean all
>
> If your system has some version of the zstd package preinstalled then
> the build system finds and uses it during the build. Auto detection
> feature status is reported just before compilation starts, as usual.
> If you still prefer to compile with some other version of zstd you have
> capability to refer the compilation to that version using LIBZSTD_DIR
> define.
>
> See 'perf test' results below for enabled and disabled (NO_LIBZSTD=1)
> feature configurations.
>
> ---
> Alexey Budankov (12):
> feature: implement libzstd check, LIBZSTD_DIR and NO_LIBZSTD defines
> perf record: implement --mmap-flush=<number> option
> perf session: define bytes_transferred and bytes_compressed metrics
> perf record: implement COMPRESSED event record and its attributes
> perf mmap: implement dedicated memory buffer for data compression
> perf util: introduce Zstd streaming based compression API
> perf record: implement compression for serial trace streaming
> perf record: implement compression for AIO trace streaming
> perf record: implement -z,--compression_level[=<n>] option
> perf report: implement record trace decompression
> perf inject: enable COMPRESSED records decompression
> perf tests: implement Zstd comp/decomp integration test
>
> tools/build/Makefile.feature | 6 +-
> tools/build/feature/Makefile | 6 +-
> tools/build/feature/test-all.c | 5 +
> tools/build/feature/test-libzstd.c | 12 +
> tools/perf/Documentation/perf-record.txt | 17 ++
> .../Documentation/perf.data-file-format.txt | 24 ++
> tools/perf/Makefile.config | 20 ++
> tools/perf/Makefile.perf | 3 +
> tools/perf/builtin-inject.c | 4 +
> tools/perf/builtin-record.c | 285 +++++++++++++++---
> tools/perf/builtin-report.c | 5 +-
> tools/perf/builtin-version.c | 2 +
> tools/perf/perf.h | 2 +
> .../tests/shell/record+zstd_comp_decomp.sh | 35 +++
> tools/perf/util/Build | 2 +
> tools/perf/util/compress.h | 54 ++++
> tools/perf/util/env.h | 11 +
> tools/perf/util/event.c | 1 +
> tools/perf/util/event.h | 7 +
> tools/perf/util/evlist.c | 8 +-
> tools/perf/util/evlist.h | 3 +-
> tools/perf/util/header.c | 55 +++-
> tools/perf/util/header.h | 1 +
> tools/perf/util/mmap.c | 106 ++-----
> tools/perf/util/mmap.h | 17 +-
> tools/perf/util/session.c | 129 +++++++-
> tools/perf/util/session.h | 14 +
> tools/perf/util/tool.h | 2 +
> tools/perf/util/zstd.c | 111 +++++++
> 29 files changed, 813 insertions(+), 134 deletions(-)
> create mode 100644 tools/build/feature/test-libzstd.c
> create mode 100755 tools/perf/tests/shell/record+zstd_comp_decomp.sh
> create mode 100644 tools/perf/util/zstd.c
>
> ---
> Changes in v10:
> - separated decomp list deallocation into perf_session__release_decomp_events
> - extended the test with suggested decompression validation
>
> Changes in v9:
> - fixed issue with improper max COMPRESSED record size calculation
> - moved up calculation of ratio variable in 03/12
> - made minor corrections in changelogs
> - corrected several checkpatch.pl warnings and errors
>
> Changes in v8:
> - avoid using -f for --mmap-flush option
> - move stubs to compress.h and avoid unconditional compiling of zstd.c
> - fixed silent interruption for perf record collection
> - implemented -z 1 as default
>
> Changes in v7:
> - rebased to Arnaldo's perf/core tip
> - implemented B/K/M/G suffixes for -f option
> - reworked record__mmap_read_evlist() to replace perf_mmap__aio_push()
> by perf_mmap__push() in AIO case
> - extended "[ perf record: Captured ... ]" message with compression statistics
> - extended changelog for v5 06/10
> - used PERF_SAMPLE_MAX_SIZE for compressed record size calculations
> - renamed record__zstd_compress to zstd_compress and
> record__process_comp_header to process_comp_header
> - separated nr_cblocks_max applying
>
> Changes in v6:
> - extended docs with description of PERF_RECORD_COMPRESSED record and
> HEADER_COMPRESSED feature layouts
>
> Changes in v5:
> - implemented perf version --build-options extension for aio and zstd - see
> TESTING below
> - adjusted commit message and perf-record.txt content for -f option
> - fixed build errors in case of NO_AIO=1 and NO_LIBZSTD=1
>
> Changes in v4:
> - implemented integration tests
> - adjusted zstd_ stub functions
> - rebased on tip of Arnaldo's perf/core
>
> Changes in v3:
> - moved -f,--mmap-flush option implementation into a separate patch
> - moved definition and printing of bytes_transferred and bytes_compressed
> into a separate patch
> - moved COMPRESSED feature into a separate patch
> - added versioning and stored COMPRESSED feature attributes as u32
> - implemented dedicated memory buffer for compression in case of serial
> streaming
> - moved low level Zstd based compression functions into
> util/{compress.h,zstd.c}
> - made compress function to be a param of __push(), __aio_push() functions
> - enabled perf inject to decompress COMPRESSED records
> - measured compression overhead for serial and AIO streaming using
> basic matrix multiplication workload on 8 core skylake
>
> Changes in v2:
> - moved compression/decompression code to session layer
> - enabled allocation aio data buffers for compression
> - enabled trace compression for serial trace streaming
>
> ---
> [1] https://github.com/facebook/zstd
>
> ---
> OVERHEAD MEASUREMENTS:
>
> uname -a
> Linux localhost 4.20.7-200.fc29.x86_64 #1 SMP Wed Feb 6 19:16:42 UTC 2019
> x86_64 x86_64 x86_64 GNU/Linux
>
> cat /proc/cpuinfo
> processor : 7
> vendor_id : GenuineIntel
> cpu family : 6
> model : 94
> model name : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
> stepping : 3
> microcode : 0xc6
> cpu MHz : 4021.884
> cache size : 8192 KB
> physical id : 0
> siblings : 8
> core id : 3
> cpu cores : 4
> apicid : 7
> initial apicid : 7
> fpu : yes
> fpu_exception : yes
> cpuid level : 22
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
> pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl
> xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64
> monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2
> x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm
> 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow
> vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2
> erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec
> xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
> flush_l1d
> bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
> bogomips : 8016.00
> clflush size : 64
> cache_alignment : 64
> address sizes : 39 bits physical, 48 bits virtual
> power management:
>
> -----------------------------------------------------------------
> #!/bin/bash -xv
>
> echo 0 > /proc/sys/kernel/perf_event_paranoid
> + echo 0
> cat /proc/sys/kernel/perf_event_paranoid
> + cat /proc/sys/kernel/perf_event_paranoid
> 0
>
> echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
> + echo performance
> + tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
> /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
> /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
> /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
> /sys/devices/system/cpu/cpu4/cpufreq/scaling_governor
> /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor
> /sys/devices/system/cpu/cpu6/cpufreq/scaling_governor
> /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor
> performance
>
> for i in 0 1 2 3 5 8
> do
> /usr/bin/time tools/perf/perf record -z $i -F 25000 -N -B -T -R -e cycles
> -- ../../matrix/linux/matrix.gcc
> done
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record -z 0 -F 25000 -N -B -T -R -e cycles --
> ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7fe36de5c010
> Offs of buf1 = 0x7fe36de5c180
> Addr of buf2 = 0x7fe36be5b010
> Offs of buf2 = 0x7fe36be5b1c0
> Addr of buf3 = 0x7fe369e5a010
> Offs of buf3 = 0x7fe369e5a100
> Addr of buf4 = 0x7fe367e59010
> Offs of buf4 = 0x7fe367e59140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 16.949 seconds
> [ perf record: Woken up 309 times to write data ]
> [ perf record: Captured and wrote 179.424 MB perf.data ]
> 133.67user 0.35system 0:17.08elapsed 784%CPU (0avgtext+0avgdata
> 100580maxresident)k
> 0inputs+367480outputs (0major+34737minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record -z 1 -F 25000 -N -B -T -R -e cycles --
> ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7fcaec334010
> Offs of buf1 = 0x7fcaec334180
> Addr of buf2 = 0x7fcaea333010
> Offs of buf2 = 0x7fcaea3331c0
> Addr of buf3 = 0x7fcae8332010
> Offs of buf3 = 0x7fcae8332100
> Addr of buf4 = 0x7fcae6331010
> Offs of buf4 = 0x7fcae6331140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 17.608 seconds
> [ perf record: Woken up 595 times to write data ]
> [ perf record: Compressed 181.148 MB to 21.497 MB, ratio is 8.427 ]
> [ perf record: Captured and wrote 21.527 MB perf.data ]
> 135.69user 0.24system 0:17.73elapsed 766%CPU (0avgtext+0avgdata
> 100500maxresident)k
> 0inputs+44112outputs (0major+35033minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record -z 2 -F 25000 -N -B -T -R -e cycles --
> ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7f1336f8d010
> Offs of buf1 = 0x7f1336f8d180
> Addr of buf2 = 0x7f1334f8c010
> Offs of buf2 = 0x7f1334f8c1c0
> Addr of buf3 = 0x7f1332f8b010
> Offs of buf3 = 0x7f1332f8b100
> Addr of buf4 = 0x7f1330f8a010
> Offs of buf4 = 0x7f1330f8a140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 18.175 seconds
> [ perf record: Woken up 521 times to write data ]
> [ perf record: Compressed 186.953 MB to 23.210 MB, ratio is 8.055 ]
> [ perf record: Captured and wrote 23.239 MB perf.data ]
> 140.21user 0.25system 0:18.32elapsed 766%CPU (0avgtext+0avgdata
> 100560maxresident)k
> 0inputs+47608outputs (0major+35263minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record -z 3 -F 25000 -N -B -T -R -e cycles --
> ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7f97060e3010
> Offs of buf1 = 0x7f97060e3180
> Addr of buf2 = 0x7f97040e2010
> Offs of buf2 = 0x7f97040e21c0
> Addr of buf3 = 0x7f97020e1010
> Offs of buf3 = 0x7f97020e1100
> Addr of buf4 = 0x7f97000e0010
> Offs of buf4 = 0x7f97000e0140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 17.688 seconds
> [ perf record: Woken up 485 times to write data ]
> [ perf record: Compressed 181.908 MB to 21.962 MB, ratio is 8.283 ]
> [ perf record: Captured and wrote 21.991 MB perf.data ]
> 136.87user 0.23system 0:17.81elapsed 769%CPU (0avgtext+0avgdata
> 100616maxresident)k
> 0inputs+45064outputs (0major+35773minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record -z 5 -F 25000 -N -B -T -R -e cycles --
> ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7f477b444010
> Offs of buf1 = 0x7f477b444180
> Addr of buf2 = 0x7f4779443010
> Offs of buf2 = 0x7f47794431c0
> Addr of buf3 = 0x7f4777442010
> Offs of buf3 = 0x7f4777442100
> Addr of buf4 = 0x7f4775441010
> Offs of buf4 = 0x7f4775441140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 18.406 seconds
> [ perf record: Woken up 416 times to write data ]
> [ perf record: Compressed 187.705 MB to 23.170 MB, ratio is 8.101 ]
> [ perf record: Captured and wrote 23.200 MB perf.data ]
> 142.72user 0.26system 0:18.53elapsed 771%CPU (0avgtext+0avgdata
> 100520maxresident)k
> 0inputs+47528outputs (0major+36928minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record -z 8 -F 25000 -N -B -T -R -e cycles --
> ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7fb5bf032010
> Offs of buf1 = 0x7fb5bf032180
> Addr of buf2 = 0x7fb5bd031010
> Offs of buf2 = 0x7fb5bd0311c0
> Addr of buf3 = 0x7fb5bb030010
> Offs of buf3 = 0x7fb5bb030100
> Addr of buf4 = 0x7fb5b902f010
> Offs of buf4 = 0x7fb5b902f140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 17.751 seconds
> [ perf record: Woken up 391 times to write data ]
> [ perf record: Compressed 179.191 MB to 19.441 MB, ratio is 9.217 ]
> [ perf record: Captured and wrote 19.502 MB perf.data ]
> 138.90user 0.29system 0:17.88elapsed 778%CPU (0avgtext+0avgdata
> 100612maxresident)k
> 0inputs+39968outputs (0major+37436minor)pagefaults 0swaps
>
> for i in 0 1 2 3 5 8
> do
> /usr/bin/time tools/perf/perf record --aio=1 -z $i -F 25000 -N -B -T -R
> -e cycles -- ../../matrix/linux/matrix.gcc
> done
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record --aio=1 -z 0 -F 25000 -N -B -T -R -e
> cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7feee4519010
> Offs of buf1 = 0x7feee4519180
> Addr of buf2 = 0x7feee2518010
> Offs of buf2 = 0x7feee25181c0
> Addr of buf3 = 0x7feee0517010
> Offs of buf3 = 0x7feee0517100
> Addr of buf4 = 0x7feede516010
> Offs of buf4 = 0x7feede516140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 17.912 seconds
> [ perf record: Woken up 390 times to write data ]
> [ perf record: Captured and wrote 187.527 MB perf.data ]
> 139.70user 0.39system 0:18.04elapsed 776%CPU (0avgtext+0avgdata
> 100624maxresident)k
> 0inputs+384072outputs (0major+35257minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record --aio=1 -z 1 -F 25000 -N -B -T -R -e
> cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7f72b93ac010
> Offs of buf1 = 0x7f72b93ac180
> Addr of buf2 = 0x7f72b73ab010
> Offs of buf2 = 0x7f72b73ab1c0
> Addr of buf3 = 0x7f72b53aa010
> Offs of buf3 = 0x7f72b53aa100
> Addr of buf4 = 0x7f72b33a9010
> Offs of buf4 = 0x7f72b33a9140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 18.198 seconds
> [ perf record: Woken up 416 times to write data ]
> [ perf record: Compressed 188.562 MB to 22.252 MB, ratio is 8.474 ]
> [ perf record: Captured and wrote 22.284 MB perf.data ]
> 141.12user 0.32system 0:18.32elapsed 771%CPU (0avgtext+0avgdata
> 100576maxresident)k
> 0inputs+45664outputs (0major+35040minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record --aio=1 -z 2 -F 25000 -N -B -T -R -e
> cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7ffb9caf3010
> Offs of buf1 = 0x7ffb9caf3180
> Addr of buf2 = 0x7ffb9aaf2010
> Offs of buf2 = 0x7ffb9aaf21c0
> Addr of buf3 = 0x7ffb98af1010
> Offs of buf3 = 0x7ffb98af1100
> Addr of buf4 = 0x7ffb96af0010
> Offs of buf4 = 0x7ffb96af0140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 18.360 seconds
> [ perf record: Woken up 442 times to write data ]
> [ perf record: Compressed 191.773 MB to 24.238 MB, ratio is 7.912 ]
> [ perf record: Captured and wrote 24.290 MB perf.data ]
> 143.76user 0.49system 0:18.50elapsed 779%CPU (0avgtext+0avgdata
> 100596maxresident)k
> 0inputs+49760outputs (0major+35276minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record --aio=1 -z 3 -F 25000 -N -B -T -R -e
> cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7f13f2df2010
> Offs of buf1 = 0x7f13f2df2180
> Addr of buf2 = 0x7f13f0df1010
> Offs of buf2 = 0x7f13f0df11c0
> Addr of buf3 = 0x7f13eedf0010
> Offs of buf3 = 0x7f13eedf0100
> Addr of buf4 = 0x7f13ecdef010
> Offs of buf4 = 0x7f13ecdef140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 18.383 seconds
> [ perf record: Woken up 499 times to write data ]
> [ perf record: Compressed 191.078 MB to 23.246 MB, ratio is 8.220 ]
> [ perf record: Captured and wrote 23.282 MB perf.data ]
> 143.72user 0.34system 0:18.51elapsed 778%CPU (0avgtext+0avgdata
> 100616maxresident)k
> 0inputs+47704outputs (0major+35783minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record --aio=1 -z 5 -F 25000 -N -B -T -R -e
> cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7fca0d091010
> Offs of buf1 = 0x7fca0d091180
> Addr of buf2 = 0x7fca0b090010
> Offs of buf2 = 0x7fca0b0901c0
> Addr of buf3 = 0x7fca0908f010
> Offs of buf3 = 0x7fca0908f100
> Addr of buf4 = 0x7fca0708e010
> Offs of buf4 = 0x7fca0708e140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 18.758 seconds
> [ perf record: Woken up 535 times to write data ]
> [ perf record: Compressed 190.065 MB to 24.430 MB, ratio is 7.780 ]
> [ perf record: Captured and wrote 24.519 MB perf.data ]
> 144.62user 0.66system 0:18.88elapsed 769%CPU (0avgtext+0avgdata
> 100528maxresident)k
> 0inputs+50232outputs (0major+36942minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record --aio=1 -z 8 -F 25000 -N -B -T -R -e
> cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7f7e1f449010
> Offs of buf1 = 0x7f7e1f449180
> Addr of buf2 = 0x7f7e1d448010
> Offs of buf2 = 0x7f7e1d4481c0
> Addr of buf3 = 0x7f7e1b447010
> Offs of buf3 = 0x7f7e1b447100
> Addr of buf4 = 0x7f7e19446010
> Offs of buf4 = 0x7f7e19446140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 20.103 seconds
> [ perf record: Woken up 260 times to write data ]
> [ perf record: Compressed 193.024 MB to 31.588 MB, ratio is 6.111 ]
> [ perf record: Captured and wrote 32.139 MB perf.data ]
> 151.73user 4.21system 0:20.23elapsed 770%CPU (0avgtext+0avgdata
> 100616maxresident)k
> 0inputs+65848outputs (0major+37431minor)pagefaults 0swaps
>
> ---
> TESTING:
>
> $ tools/perf/perf version --build-options
> perf version 4.13.rc5.gd8d056b
> dwarf: [ on ] # HAVE_DWARF_SUPPORT
> dwarf_getlocations: [ on ] # HAVE_DWARF_GETLOCATIONS_SUPPORT
> glibc: [ on ] # HAVE_GLIBC_SUPPORT
> gtk2: [ on ] # HAVE_GTK2_SUPPORT
> syscall_table: [ on ] # HAVE_SYSCALL_TABLE_SUPPORT
> libbfd: [ on ] # HAVE_LIBBFD_SUPPORT
> libelf: [ on ] # HAVE_LIBELF_SUPPORT
> libnuma: [ on ] # HAVE_LIBNUMA_SUPPORT
> numa_num_possible_cpus: [ on ] # HAVE_LIBNUMA_SUPPORT
> libperl: [ on ] # HAVE_LIBPERL_SUPPORT
> libpython: [ on ] # HAVE_LIBPYTHON_SUPPORT
> libslang: [ on ] # HAVE_SLANG_SUPPORT
> libcrypto: [ on ] # HAVE_LIBCRYPTO_SUPPORT
> libunwind: [ on ] # HAVE_LIBUNWIND_SUPPORT
> libdw-dwarf-unwind: [ on ] # HAVE_DWARF_SUPPORT
> zlib: [ on ] # HAVE_ZLIB_SUPPORT
> lzma: [ on ] # HAVE_LZMA_SUPPORT
> get_cpuid: [ on ] # HAVE_AUXTRACE_SUPPORT
> bpf: [ on ] # HAVE_LIBBPF_SUPPORT
> aio: [ OFF ] # HAVE_AIO_SUPPORT
> zstd: [ OFF ] # HAVE_ZSTD_SUPPORT
>
> $ tools/perf/perf version --build-options
> perf version 4.13.rc5.gd8d056b
> dwarf: [ on ] # HAVE_DWARF_SUPPORT
> dwarf_getlocations: [ on ] # HAVE_DWARF_GETLOCATIONS_SUPPORT
> glibc: [ on ] # HAVE_GLIBC_SUPPORT
> gtk2: [ on ] # HAVE_GTK2_SUPPORT
> syscall_table: [ on ] # HAVE_SYSCALL_TABLE_SUPPORT
> libbfd: [ on ] # HAVE_LIBBFD_SUPPORT
> libelf: [ on ] # HAVE_LIBELF_SUPPORT
> libnuma: [ on ] # HAVE_LIBNUMA_SUPPORT
> numa_num_possible_cpus: [ on ] # HAVE_LIBNUMA_SUPPORT
> libperl: [ on ] # HAVE_LIBPERL_SUPPORT
> libpython: [ on ] # HAVE_LIBPYTHON_SUPPORT
> libslang: [ on ] # HAVE_SLANG_SUPPORT
> libcrypto: [ on ] # HAVE_LIBCRYPTO_SUPPORT
> libunwind: [ on ] # HAVE_LIBUNWIND_SUPPORT
> libdw-dwarf-unwind: [ on ] # HAVE_DWARF_SUPPORT
> zlib: [ on ] # HAVE_ZLIB_SUPPORT
> lzma: [ on ] # HAVE_LZMA_SUPPORT
> get_cpuid: [ on ] # HAVE_AUXTRACE_SUPPORT
> bpf: [ on ] # HAVE_LIBBPF_SUPPORT
> aio: [ on ] # HAVE_AIO_SUPPORT
> zstd: [ on ] # HAVE_ZSTD_SUPPORT
>
> $ make -C tools/perf clean all
> ...
> $ pushd tools/perf/ && ./perf test && popd
> ~/abudanko/kernel/acme/tools/perf ~/abudanko/kernel/acme
> 1: vmlinux symtab matches kallsyms : Skip
> 2: Detect openat syscall event : Ok
> 3: Detect openat syscall event on all cpus : Ok
> 4: Read samples using the mmap interface : Ok
> 5: Test data source output : Ok
> 6: Parse event definition strings : Ok
> 7: Simple expression parser : Ok
> 8: PERF_RECORD_* events & perf_sample fields : Ok
> 9: Parse perf pmu format : Ok
> 10: DSO data read : Ok
> 11: DSO data cache : Ok
> 12: DSO data reopen : Ok
> 13: Roundtrip evsel->name : Ok
> 14: Parse sched tracepoints fields : Ok
> 15: syscalls:sys_enter_openat event fields : Ok
> 16: Setup struct perf_event_attr : Ok
> 17: Match and link multiple hists : Ok
> 18: 'import perf' in python : Ok
> 19: Breakpoint overflow signal handler : Ok
> 20: Breakpoint overflow sampling : Ok
> 21: Breakpoint accounting : Ok
> 22: Watchpoint :
> 22.1: Read Only Watchpoint : Skip
> 22.2: Write Only Watchpoint : Ok
> 22.3: Read / Write Watchpoint : Ok
> 22.4: Modify Watchpoint : Ok
> 23: Number of exit events of a simple workload : Ok
> 24: Software clock events period values : Ok
> 25: Object code reading : Ok
> 26: Sample parsing : Ok
> 27: Use a dummy software event to keep tracking : Ok
> 28: Parse with no sample_id_all bit set : Ok
> 29: Filter hist entries : Ok
> 30: Lookup mmap thread : Ok
> 31: Share thread mg : Ok
> 32: Sort output of hist entries : Ok
> 33: Cumulate child hist entries : Ok
> 34: Track with sched_switch : Ok
> 35: Filter fds with revents mask in a fdarray : Ok
> 36: Add fd to a fdarray, making it autogrow : Ok
> 37: kmod_path__parse : Ok
> 38: Thread map : Ok
> 39: LLVM search and compile :
> 39.1: Basic BPF llvm compile : Skip
> 39.2: kbuild searching : Skip
> 39.3: Compile source for BPF prologue generation : Skip
> 39.4: Compile source for BPF relocation : Skip
> 40: Session topology : Ok
> 41: BPF filter :
> 41.1: Basic BPF filtering : Skip
> 41.2: BPF pinning : Skip
> 41.3: BPF prologue generation : Skip
> 41.4: BPF relocation checker : Skip
> 42: Synthesize thread map : Ok
> 43: Remove thread map : Ok
> 44: Synthesize cpu map : Ok
> 45: Synthesize stat config : Ok
> 46: Synthesize stat : Ok
> 47: Synthesize stat round : Ok
> 48: Synthesize attr update : Ok
> 49: Event times : Ok
> 50: Read backward ring buffer : Ok
> 51: Print cpu map : Ok
> 52: Probe SDT events : Ok
> 53: is_printable_array : Ok
> 54: Print bitmap : Ok
> 55: perf hooks : Ok
> 56: builtin clang support : Skip (not
> compiled in)
> 57: unit_number__scnprintf : Ok
> 58: mem2node : Ok
> 59: x86 rdpmc : Ok
> 60: Convert perf time to TSC : Ok
> 61: DWARF unwind : Ok
> 62: x86 instruction decoder - new instructions : Ok
> 63: x86 bp modify : Ok
> 64: Check open filename arg using perf trace + vfs_getname: Skip
> 65: Add vfs_getname probe to get syscall args filenames : Skip
> 66: probe libc's inet_pton & backtrace it with ping : Ok
> 67: Use vfs_getname probe to get syscall args filenames : Skip
> 68: record trace Zstd compression/decompression : Ok
> ~/abudanko/kernel/acme
>
> $ make -C tools/perf NO_LIBZSTD=1 clean all
> ...
> $ pushd tools/perf/ && ./perf test && popd
> ~/abudanko/kernel/acme/tools/perf ~/abudanko/kernel/acme
> ...
> 68: record trace Zstd compression/decompression : Skip
> ~/abudanko/kernel/acme
>