Re: [PATCH v10 00/12] perf: enable compression of record mode trace to save storage space

Alexey Budankov Thu, 28 Mar 2019 02:29:14 -0700

Hi,

This is a gentle reminder regarding the patch set below.


Thanks,
Alexey

On 18.03.2019 20:36, Alexey Budankov wrote:
> 
> The patch set implements runtime trace compression (-z option) in 
> record mode and trace auto decompression in report and inject modes. 
> Streaming Zstd API [1] is used for compression and decompression of
> data that come from kernel mmaped data buffers.
> 
> Usage of implemented -z,--compression_level=n option provides ~3-5x 
> avg. trace file size reduction on variety of tested workloads what 
> saves storage space on larger server systems where trace file size 
> can easily reach several tens or even hundreds of GiBs, especially 
> when profiling with dwarf-based stacks and tracing of context switches.
> Default option value is 1 (fastest compression).
> 
> Implemented --mmap-flush option can be used to specify minimal size 
> of data chunk that is extracted from mmaped kernel buffer to store
> into a trace. The option is independent from -z setting and doesn't 
> vary with compression level. The default option value is 1 byte what 
> means every time trace writing thread finds some new data in the 
> mmaped buffer the data is extracted, possibly compressed and written 
> to a trace. The option serves two purposes the first one is to increase 
> the compression ratio of trace data and the second one is to avoid 
> live-lock self tool process monitoring in system wide (-a) profiling
> mode. Profiling in system wide mode with compression (-a -z) can 
> additionally induce data into the kernel buffers along with the data 
> from monitored processes. If performance data rate and volume from 
> the monitored processes is high then trace streaming and compression 
> activity in the tool is also high. It can lead to subtle live-lock 
> effect of endless activity when compression of single new byte from 
> some of mmaped kernel buffer induces the next single byte at some 
> mmaped buffer. So perf tool thread never stops on polling event file 
> descriptors. Varying data chunk size to be extracted from mmap buffers 
> allows avoiding live-locking self monitoring in system wide mode and
> makes mmap buffers polling loop manageable. Possible usage examples:
> 
>   $ tools/perf/perf record -z -e cycles -- matrix.gcc
>   $ tools/perf/perf record --aio -z -e cycles -- matrix.gcc
>   $ tools/perf/perf record -z --mmap-flush 1024 -e cycles -- matrix.gcc
>   $ tools/perf/perf record --aio -z --mmap-flush 1K -e cycles -- matrix.gcc
> 
> Runtime compression overhead has been measured for serial and AIO 
> trace writing modes when profiling matrix multiplication workload:
> 
>       -------------------------------------------------------------
>       | SERIAL                      | AIO-1                       |
>   ----|-----------------------------|-----------------------------|
>   |-z | OVH(x) | ratio(x) size(MiB) | OVH(x) | ratio(x) size(MiB) |
>   |---|--------|--------------------|--------|--------------------|
>   | 0 | 1,00   | 1,000    179,424   | 1,00   | 1,000    187,527   |
>   | 1 | 1,04   | 8,427    181,148   | 1,01   | 8,474    188,562   |
>   | 2 | 1,07   | 8,055    186,953   | 1,03   | 7,912    191,773   |
>   | 3 | 1,04   | 8,283    181,908   | 1,03   | 8,220    191,078   |
>   | 5 | 1,09   | 8,101    187,705   | 1,05   | 7,780    190,065   |
>   | 8 | 1,05   | 9,217    179,191   | 1,12   | 6,111    193,024   |
>   -----------------------------------------------------------------
> 
>   OVH = (Execution time with -z N) / (Execution time with -z 0)
> 
>   ratio - compression ratio
>   size  - number of bytes that was compressed
> 
>   size ~= trace file x ratio
> 
> See complete description of measurement conditions with details below.
> 
> Introduced compression functionality can be disabled or configured from 
> the command line using NO_LIBZSTD and LIBZSTD_DIR defines:
> 
>   $ make -C tools/perf NO_LIBZSTD=1 clean all
>   $ make -C tools/perf LIBZSTD_DIR=/path/to/zstd/sources/ clean all
> 
> If your system has some version of the zstd package preinstalled then 
> the build system finds and uses it during the build. Auto detection 
> feature status is reported just before compilation starts, as usual.
> If you still prefer to compile with some other version of zstd you have 
> capability to refer the compilation to that version using LIBZSTD_DIR 
> define.
> 
> See 'perf test' results below for enabled and disabled (NO_LIBZSTD=1)
> feature configurations.
> 
> ---
> Alexey Budankov (12):
>   feature: implement libzstd check, LIBZSTD_DIR and NO_LIBZSTD defines
>   perf record: implement --mmap-flush=<number> option
>   perf session: define bytes_transferred and bytes_compressed metrics
>   perf record: implement COMPRESSED event record and its attributes
>   perf mmap: implement dedicated memory buffer for data compression
>   perf util: introduce Zstd streaming based compression API
>   perf record: implement compression for serial trace streaming
>   perf record: implement compression for AIO trace streaming
>   perf record: implement -z,--compression_level[=<n>] option
>   perf report: implement record trace decompression
>   perf inject: enable COMPRESSED records decompression
>   perf tests: implement Zstd comp/decomp integration test
> 
>  tools/build/Makefile.feature                  |   6 +-
>  tools/build/feature/Makefile                  |   6 +-
>  tools/build/feature/test-all.c                |   5 +
>  tools/build/feature/test-libzstd.c            |  12 +
>  tools/perf/Documentation/perf-record.txt      |  17 ++
>  .../Documentation/perf.data-file-format.txt   |  24 ++
>  tools/perf/Makefile.config                    |  20 ++
>  tools/perf/Makefile.perf                      |   3 +
>  tools/perf/builtin-inject.c                   |   4 +
>  tools/perf/builtin-record.c                   | 285 +++++++++++++++---
>  tools/perf/builtin-report.c                   |   5 +-
>  tools/perf/builtin-version.c                  |   2 +
>  tools/perf/perf.h                             |   2 +
>  .../tests/shell/record+zstd_comp_decomp.sh    |  35 +++
>  tools/perf/util/Build                         |   2 +
>  tools/perf/util/compress.h                    |  54 ++++
>  tools/perf/util/env.h                         |  11 +
>  tools/perf/util/event.c                       |   1 +
>  tools/perf/util/event.h                       |   7 +
>  tools/perf/util/evlist.c                      |   8 +-
>  tools/perf/util/evlist.h                      |   3 +-
>  tools/perf/util/header.c                      |  55 +++-
>  tools/perf/util/header.h                      |   1 +
>  tools/perf/util/mmap.c                        | 106 ++-----
>  tools/perf/util/mmap.h                        |  17 +-
>  tools/perf/util/session.c                     | 129 +++++++-
>  tools/perf/util/session.h                     |  14 +
>  tools/perf/util/tool.h                        |   2 +
>  tools/perf/util/zstd.c                        | 111 +++++++
>  29 files changed, 813 insertions(+), 134 deletions(-)
>  create mode 100644 tools/build/feature/test-libzstd.c
>  create mode 100755 tools/perf/tests/shell/record+zstd_comp_decomp.sh
>  create mode 100644 tools/perf/util/zstd.c
> 
> ---
> Changes in v10:
> - separated decomp list deallocation into perf_session__release_decomp_events
> - extended the test with suggested decompression validation
> 
> Changes in v9:
> - fixed issue with improper max COMPRESSED record size calculation
> - moved up calculation of ratio variable in 03/12
> - made minor corrections in changelogs
> - corrected several checkpatch.pl warnings and errors
> 
> Changes in v8:
> - avoid using -f for --mmap-flush option
> - move stubs to compress.h and avoid unconditional compiling of zstd.c
> - fixed silent interruption for perf record collection
> - implemented -z 1 as default
> 
> Changes in v7:
> - rebased to Arnaldo's perf/core tip
> - implemented B/K/M/G suffixes for -f option
> - reworked record__mmap_read_evlist() to replace perf_mmap__aio_push()
>   by perf_mmap__push() in AIO case
> - extended "[ perf record: Captured ... ]" message with compression statistics
> - extended changelog for v5 06/10
> - used PERF_SAMPLE_MAX_SIZE for compressed record size calculations
> - renamed record__zstd_compress to zstd_compress and
>   record__process_comp_header to process_comp_header
> - separated nr_cblocks_max applying
> 
> Changes in v6:
> - extended docs with description of PERF_RECORD_COMPRESSED record and 
>   HEADER_COMPRESSED feature layouts
> 
> Changes in v5:
> - implemented perf version --build-options extension for aio and zstd - see 
> TESTING below
> - adjusted commit message and perf-record.txt content for -f option
> - fixed build errors in case of NO_AIO=1 and NO_LIBZSTD=1
> 
> Changes in v4:
> - implemented integration tests
> - adjusted zstd_ stub functions
> - rebased on tip of Arnaldo's perf/core
> 
> Changes in v3:
> - moved -f,--mmap-flush option implementation into a separate patch
> - moved definition and printing of bytes_transferred and bytes_compressed 
> into a separate patch
> - moved COMPRESSED feature into a separate patch
> - added versioning and stored COMPRESSED feature attributes as u32
> - implemented dedicated memory buffer for compression in case of serial 
> streaming
> - moved low level Zstd based compression functions into 
> util/{compress.h,zstd.c}
> - made compress function to be a param of __push(), __aio_push() functions
> - enabled perf inject to decompress COMPRESSED records
> - measured compression overhead for serial and AIO streaming using 
>   basic matrix multiplication workload on 8 core skylake
> 
> Changes in v2:
> - moved compression/decompression code to session layer
> - enabled allocation aio data buffers for compression
> - enabled trace compression for serial trace streaming
> 
> ---
> [1] https://github.com/facebook/zstd
> 
> ---
> OVERHEAD MEASUREMENTS:
> 
> uname -a
> Linux localhost 4.20.7-200.fc29.x86_64 #1 SMP Wed Feb 6 19:16:42 UTC 2019 
> x86_64 x86_64 x86_64 GNU/Linux
> 
> cat /proc/cpuinfo
> processor       : 7
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 94
> model name      : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
> stepping        : 3
> microcode       : 0xc6
> cpu MHz         : 4021.884
> cache size      : 8192 KB
> physical id     : 0
> siblings        : 8
> core id         : 3
> cpu cores       : 4
> apicid          : 7
> initial apicid  : 7
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 22
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
> pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl 
> xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 
> monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 
> x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 
> 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow 
> vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 
> erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec 
> xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp 
> flush_l1d
> bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
> bogomips        : 8016.00
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 39 bits physical, 48 bits virtual
> power management:
> 
> -----------------------------------------------------------------
> #!/bin/bash -xv
> 
> echo 0 > /proc/sys/kernel/perf_event_paranoid
> + echo 0
> cat /proc/sys/kernel/perf_event_paranoid
> + cat /proc/sys/kernel/perf_event_paranoid
> 0
> 
> echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
> + echo performance
> + tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor 
> /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor 
> /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor 
> /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor 
> /sys/devices/system/cpu/cpu4/cpufreq/scaling_governor 
> /sys/devices/system/cpu/cpu5/cpufreq/scaling_governor 
> /sys/devices/system/cpu/cpu6/cpufreq/scaling_governor 
> /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor
> performance
> 
> for i in 0 1 2 3 5 8
> do
>     /usr/bin/time tools/perf/perf record -z $i -F 25000 -N -B -T -R -e cycles 
> -- ../../matrix/linux/matrix.gcc
> done
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record -z 0 -F 25000 -N -B -T -R -e cycles -- 
> ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7fe36de5c010
> Offs of buf1 = 0x7fe36de5c180
> Addr of buf2 = 0x7fe36be5b010
> Offs of buf2 = 0x7fe36be5b1c0
> Addr of buf3 = 0x7fe369e5a010
> Offs of buf3 = 0x7fe369e5a100
> Addr of buf4 = 0x7fe367e59010
> Offs of buf4 = 0x7fe367e59140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 16.949 seconds
> [ perf record: Woken up 309 times to write data ]
> [ perf record: Captured and wrote 179.424 MB perf.data ]
> 133.67user 0.35system 0:17.08elapsed 784%CPU (0avgtext+0avgdata 
> 100580maxresident)k
> 0inputs+367480outputs (0major+34737minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record -z 1 -F 25000 -N -B -T -R -e cycles -- 
> ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7fcaec334010
> Offs of buf1 = 0x7fcaec334180
> Addr of buf2 = 0x7fcaea333010
> Offs of buf2 = 0x7fcaea3331c0
> Addr of buf3 = 0x7fcae8332010
> Offs of buf3 = 0x7fcae8332100
> Addr of buf4 = 0x7fcae6331010
> Offs of buf4 = 0x7fcae6331140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 17.608 seconds
> [ perf record: Woken up 595 times to write data ]
> [ perf record: Compressed 181.148 MB to 21.497 MB, ratio is 8.427 ]
> [ perf record: Captured and wrote 21.527 MB perf.data ]
> 135.69user 0.24system 0:17.73elapsed 766%CPU (0avgtext+0avgdata 
> 100500maxresident)k
> 0inputs+44112outputs (0major+35033minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record -z 2 -F 25000 -N -B -T -R -e cycles -- 
> ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7f1336f8d010
> Offs of buf1 = 0x7f1336f8d180
> Addr of buf2 = 0x7f1334f8c010
> Offs of buf2 = 0x7f1334f8c1c0
> Addr of buf3 = 0x7f1332f8b010
> Offs of buf3 = 0x7f1332f8b100
> Addr of buf4 = 0x7f1330f8a010
> Offs of buf4 = 0x7f1330f8a140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 18.175 seconds
> [ perf record: Woken up 521 times to write data ]
> [ perf record: Compressed 186.953 MB to 23.210 MB, ratio is 8.055 ]
> [ perf record: Captured and wrote 23.239 MB perf.data ]
> 140.21user 0.25system 0:18.32elapsed 766%CPU (0avgtext+0avgdata 
> 100560maxresident)k
> 0inputs+47608outputs (0major+35263minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record -z 3 -F 25000 -N -B -T -R -e cycles -- 
> ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7f97060e3010
> Offs of buf1 = 0x7f97060e3180
> Addr of buf2 = 0x7f97040e2010
> Offs of buf2 = 0x7f97040e21c0
> Addr of buf3 = 0x7f97020e1010
> Offs of buf3 = 0x7f97020e1100
> Addr of buf4 = 0x7f97000e0010
> Offs of buf4 = 0x7f97000e0140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 17.688 seconds
> [ perf record: Woken up 485 times to write data ]
> [ perf record: Compressed 181.908 MB to 21.962 MB, ratio is 8.283 ]
> [ perf record: Captured and wrote 21.991 MB perf.data ]
> 136.87user 0.23system 0:17.81elapsed 769%CPU (0avgtext+0avgdata 
> 100616maxresident)k
> 0inputs+45064outputs (0major+35773minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record -z 5 -F 25000 -N -B -T -R -e cycles -- 
> ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7f477b444010
> Offs of buf1 = 0x7f477b444180
> Addr of buf2 = 0x7f4779443010
> Offs of buf2 = 0x7f47794431c0
> Addr of buf3 = 0x7f4777442010
> Offs of buf3 = 0x7f4777442100
> Addr of buf4 = 0x7f4775441010
> Offs of buf4 = 0x7f4775441140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 18.406 seconds
> [ perf record: Woken up 416 times to write data ]
> [ perf record: Compressed 187.705 MB to 23.170 MB, ratio is 8.101 ]
> [ perf record: Captured and wrote 23.200 MB perf.data ]
> 142.72user 0.26system 0:18.53elapsed 771%CPU (0avgtext+0avgdata 
> 100520maxresident)k
> 0inputs+47528outputs (0major+36928minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record -z 8 -F 25000 -N -B -T -R -e cycles -- 
> ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7fb5bf032010
> Offs of buf1 = 0x7fb5bf032180
> Addr of buf2 = 0x7fb5bd031010
> Offs of buf2 = 0x7fb5bd0311c0
> Addr of buf3 = 0x7fb5bb030010
> Offs of buf3 = 0x7fb5bb030100
> Addr of buf4 = 0x7fb5b902f010
> Offs of buf4 = 0x7fb5b902f140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 17.751 seconds
> [ perf record: Woken up 391 times to write data ]
> [ perf record: Compressed 179.191 MB to 19.441 MB, ratio is 9.217 ]
> [ perf record: Captured and wrote 19.502 MB perf.data ]
> 138.90user 0.29system 0:17.88elapsed 778%CPU (0avgtext+0avgdata 
> 100612maxresident)k
> 0inputs+39968outputs (0major+37436minor)pagefaults 0swaps
> 
> for i in 0 1 2 3 5 8
> do
>     /usr/bin/time tools/perf/perf record --aio=1 -z $i -F 25000 -N -B -T -R 
> -e cycles -- ../../matrix/linux/matrix.gcc
> done
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record --aio=1 -z 0 -F 25000 -N -B -T -R -e 
> cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7feee4519010
> Offs of buf1 = 0x7feee4519180
> Addr of buf2 = 0x7feee2518010
> Offs of buf2 = 0x7feee25181c0
> Addr of buf3 = 0x7feee0517010
> Offs of buf3 = 0x7feee0517100
> Addr of buf4 = 0x7feede516010
> Offs of buf4 = 0x7feede516140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 17.912 seconds
> [ perf record: Woken up 390 times to write data ]
> [ perf record: Captured and wrote 187.527 MB perf.data ]
> 139.70user 0.39system 0:18.04elapsed 776%CPU (0avgtext+0avgdata 
> 100624maxresident)k
> 0inputs+384072outputs (0major+35257minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record --aio=1 -z 1 -F 25000 -N -B -T -R -e 
> cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7f72b93ac010
> Offs of buf1 = 0x7f72b93ac180
> Addr of buf2 = 0x7f72b73ab010
> Offs of buf2 = 0x7f72b73ab1c0
> Addr of buf3 = 0x7f72b53aa010
> Offs of buf3 = 0x7f72b53aa100
> Addr of buf4 = 0x7f72b33a9010
> Offs of buf4 = 0x7f72b33a9140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 18.198 seconds
> [ perf record: Woken up 416 times to write data ]
> [ perf record: Compressed 188.562 MB to 22.252 MB, ratio is 8.474 ]
> [ perf record: Captured and wrote 22.284 MB perf.data ]
> 141.12user 0.32system 0:18.32elapsed 771%CPU (0avgtext+0avgdata 
> 100576maxresident)k
> 0inputs+45664outputs (0major+35040minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record --aio=1 -z 2 -F 25000 -N -B -T -R -e 
> cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7ffb9caf3010
> Offs of buf1 = 0x7ffb9caf3180
> Addr of buf2 = 0x7ffb9aaf2010
> Offs of buf2 = 0x7ffb9aaf21c0
> Addr of buf3 = 0x7ffb98af1010
> Offs of buf3 = 0x7ffb98af1100
> Addr of buf4 = 0x7ffb96af0010
> Offs of buf4 = 0x7ffb96af0140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 18.360 seconds
> [ perf record: Woken up 442 times to write data ]
> [ perf record: Compressed 191.773 MB to 24.238 MB, ratio is 7.912 ]
> [ perf record: Captured and wrote 24.290 MB perf.data ]
> 143.76user 0.49system 0:18.50elapsed 779%CPU (0avgtext+0avgdata 
> 100596maxresident)k
> 0inputs+49760outputs (0major+35276minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record --aio=1 -z 3 -F 25000 -N -B -T -R -e 
> cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7f13f2df2010
> Offs of buf1 = 0x7f13f2df2180
> Addr of buf2 = 0x7f13f0df1010
> Offs of buf2 = 0x7f13f0df11c0
> Addr of buf3 = 0x7f13eedf0010
> Offs of buf3 = 0x7f13eedf0100
> Addr of buf4 = 0x7f13ecdef010
> Offs of buf4 = 0x7f13ecdef140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 18.383 seconds
> [ perf record: Woken up 499 times to write data ]
> [ perf record: Compressed 191.078 MB to 23.246 MB, ratio is 8.220 ]
> [ perf record: Captured and wrote 23.282 MB perf.data ]
> 143.72user 0.34system 0:18.51elapsed 778%CPU (0avgtext+0avgdata 
> 100616maxresident)k
> 0inputs+47704outputs (0major+35783minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record --aio=1 -z 5 -F 25000 -N -B -T -R -e 
> cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7fca0d091010
> Offs of buf1 = 0x7fca0d091180
> Addr of buf2 = 0x7fca0b090010
> Offs of buf2 = 0x7fca0b0901c0
> Addr of buf3 = 0x7fca0908f010
> Offs of buf3 = 0x7fca0908f100
> Addr of buf4 = 0x7fca0708e010
> Offs of buf4 = 0x7fca0708e140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 18.758 seconds
> [ perf record: Woken up 535 times to write data ]
> [ perf record: Compressed 190.065 MB to 24.430 MB, ratio is 7.780 ]
> [ perf record: Captured and wrote 24.519 MB perf.data ]
> 144.62user 0.66system 0:18.88elapsed 769%CPU (0avgtext+0avgdata 
> 100528maxresident)k
> 0inputs+50232outputs (0major+36942minor)pagefaults 0swaps
> + for i in 0 1 2 3 5 8
> + /usr/bin/time tools/perf/perf record --aio=1 -z 8 -F 25000 -N -B -T -R -e 
> cycles -- ../../matrix/linux/matrix.gcc
> Addr of buf1 = 0x7f7e1f449010
> Offs of buf1 = 0x7f7e1f449180
> Addr of buf2 = 0x7f7e1d448010
> Offs of buf2 = 0x7f7e1d4481c0
> Addr of buf3 = 0x7f7e1b447010
> Offs of buf3 = 0x7f7e1b447100
> Addr of buf4 = 0x7f7e19446010
> Offs of buf4 = 0x7f7e19446140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 20.103 seconds
> [ perf record: Woken up 260 times to write data ]
> [ perf record: Compressed 193.024 MB to 31.588 MB, ratio is 6.111 ]
> [ perf record: Captured and wrote 32.139 MB perf.data ]
> 151.73user 4.21system 0:20.23elapsed 770%CPU (0avgtext+0avgdata 
> 100616maxresident)k
> 0inputs+65848outputs (0major+37431minor)pagefaults 0swaps
> 
> ---
> TESTING:
> 
>   $ tools/perf/perf version --build-options
> perf version 4.13.rc5.gd8d056b
>                  dwarf: [ on  ]  # HAVE_DWARF_SUPPORT
>     dwarf_getlocations: [ on  ]  # HAVE_DWARF_GETLOCATIONS_SUPPORT
>                  glibc: [ on  ]  # HAVE_GLIBC_SUPPORT
>                   gtk2: [ on  ]  # HAVE_GTK2_SUPPORT
>          syscall_table: [ on  ]  # HAVE_SYSCALL_TABLE_SUPPORT
>                 libbfd: [ on  ]  # HAVE_LIBBFD_SUPPORT
>                 libelf: [ on  ]  # HAVE_LIBELF_SUPPORT
>                libnuma: [ on  ]  # HAVE_LIBNUMA_SUPPORT
> numa_num_possible_cpus: [ on  ]  # HAVE_LIBNUMA_SUPPORT
>                libperl: [ on  ]  # HAVE_LIBPERL_SUPPORT
>              libpython: [ on  ]  # HAVE_LIBPYTHON_SUPPORT
>               libslang: [ on  ]  # HAVE_SLANG_SUPPORT
>              libcrypto: [ on  ]  # HAVE_LIBCRYPTO_SUPPORT
>              libunwind: [ on  ]  # HAVE_LIBUNWIND_SUPPORT
>     libdw-dwarf-unwind: [ on  ]  # HAVE_DWARF_SUPPORT
>                   zlib: [ on  ]  # HAVE_ZLIB_SUPPORT
>                   lzma: [ on  ]  # HAVE_LZMA_SUPPORT
>              get_cpuid: [ on  ]  # HAVE_AUXTRACE_SUPPORT
>                    bpf: [ on  ]  # HAVE_LIBBPF_SUPPORT
>                    aio: [ OFF ]  # HAVE_AIO_SUPPORT
>                   zstd: [ OFF ]  # HAVE_ZSTD_SUPPORT
> 
>   $ tools/perf/perf version --build-options
> perf version 4.13.rc5.gd8d056b
>                  dwarf: [ on  ]  # HAVE_DWARF_SUPPORT
>     dwarf_getlocations: [ on  ]  # HAVE_DWARF_GETLOCATIONS_SUPPORT
>                  glibc: [ on  ]  # HAVE_GLIBC_SUPPORT
>                   gtk2: [ on  ]  # HAVE_GTK2_SUPPORT
>          syscall_table: [ on  ]  # HAVE_SYSCALL_TABLE_SUPPORT
>                 libbfd: [ on  ]  # HAVE_LIBBFD_SUPPORT
>                 libelf: [ on  ]  # HAVE_LIBELF_SUPPORT
>                libnuma: [ on  ]  # HAVE_LIBNUMA_SUPPORT
> numa_num_possible_cpus: [ on  ]  # HAVE_LIBNUMA_SUPPORT
>                libperl: [ on  ]  # HAVE_LIBPERL_SUPPORT
>              libpython: [ on  ]  # HAVE_LIBPYTHON_SUPPORT
>               libslang: [ on  ]  # HAVE_SLANG_SUPPORT
>              libcrypto: [ on  ]  # HAVE_LIBCRYPTO_SUPPORT
>              libunwind: [ on  ]  # HAVE_LIBUNWIND_SUPPORT
>     libdw-dwarf-unwind: [ on  ]  # HAVE_DWARF_SUPPORT
>                   zlib: [ on  ]  # HAVE_ZLIB_SUPPORT
>                   lzma: [ on  ]  # HAVE_LZMA_SUPPORT
>              get_cpuid: [ on  ]  # HAVE_AUXTRACE_SUPPORT
>                    bpf: [ on  ]  # HAVE_LIBBPF_SUPPORT
>                    aio: [ on  ]  # HAVE_AIO_SUPPORT
>                   zstd: [ on  ]  # HAVE_ZSTD_SUPPORT
> 
>   $ make -C tools/perf clean all
> ...
>   $ pushd tools/perf/ && ./perf test && popd
> ~/abudanko/kernel/acme/tools/perf ~/abudanko/kernel/acme
>  1: vmlinux symtab matches kallsyms                       : Skip
>  2: Detect openat syscall event                           : Ok
>  3: Detect openat syscall event on all cpus               : Ok
>  4: Read samples using the mmap interface                 : Ok
>  5: Test data source output                               : Ok
>  6: Parse event definition strings                        : Ok
>  7: Simple expression parser                              : Ok
>  8: PERF_RECORD_* events & perf_sample fields             : Ok
>  9: Parse perf pmu format                                 : Ok
> 10: DSO data read                                         : Ok
> 11: DSO data cache                                        : Ok
> 12: DSO data reopen                                       : Ok
> 13: Roundtrip evsel->name                                 : Ok
> 14: Parse sched tracepoints fields                        : Ok
> 15: syscalls:sys_enter_openat event fields                : Ok
> 16: Setup struct perf_event_attr                          : Ok
> 17: Match and link multiple hists                         : Ok
> 18: 'import perf' in python                               : Ok
> 19: Breakpoint overflow signal handler                    : Ok
> 20: Breakpoint overflow sampling                          : Ok
> 21: Breakpoint accounting                                 : Ok
> 22: Watchpoint                                            :
> 22.1: Read Only Watchpoint                                : Skip
> 22.2: Write Only Watchpoint                               : Ok
> 22.3: Read / Write Watchpoint                             : Ok
> 22.4: Modify Watchpoint                                   : Ok
> 23: Number of exit events of a simple workload            : Ok
> 24: Software clock events period values                   : Ok
> 25: Object code reading                                   : Ok
> 26: Sample parsing                                        : Ok
> 27: Use a dummy software event to keep tracking           : Ok
> 28: Parse with no sample_id_all bit set                   : Ok
> 29: Filter hist entries                                   : Ok
> 30: Lookup mmap thread                                    : Ok
> 31: Share thread mg                                       : Ok
> 32: Sort output of hist entries                           : Ok
> 33: Cumulate child hist entries                           : Ok
> 34: Track with sched_switch                               : Ok
> 35: Filter fds with revents mask in a fdarray             : Ok
> 36: Add fd to a fdarray, making it autogrow               : Ok
> 37: kmod_path__parse                                      : Ok
> 38: Thread map                                            : Ok
> 39: LLVM search and compile                               :
> 39.1: Basic BPF llvm compile                              : Skip
> 39.2: kbuild searching                                    : Skip
> 39.3: Compile source for BPF prologue generation          : Skip
> 39.4: Compile source for BPF relocation                   : Skip
> 40: Session topology                                      : Ok
> 41: BPF filter                                            :
> 41.1: Basic BPF filtering                                 : Skip
> 41.2: BPF pinning                                         : Skip
> 41.3: BPF prologue generation                             : Skip
> 41.4: BPF relocation checker                              : Skip
> 42: Synthesize thread map                                 : Ok
> 43: Remove thread map                                     : Ok
> 44: Synthesize cpu map                                    : Ok
> 45: Synthesize stat config                                : Ok
> 46: Synthesize stat                                       : Ok
> 47: Synthesize stat round                                 : Ok
> 48: Synthesize attr update                                : Ok
> 49: Event times                                           : Ok
> 50: Read backward ring buffer                             : Ok
> 51: Print cpu map                                         : Ok
> 52: Probe SDT events                                      : Ok
> 53: is_printable_array                                    : Ok
> 54: Print bitmap                                          : Ok
> 55: perf hooks                                            : Ok
> 56: builtin clang support                                 : Skip (not 
> compiled in)
> 57: unit_number__scnprintf                                : Ok
> 58: mem2node                                              : Ok
> 59: x86 rdpmc                                             : Ok
> 60: Convert perf time to TSC                              : Ok
> 61: DWARF unwind                                          : Ok
> 62: x86 instruction decoder - new instructions            : Ok
> 63: x86 bp modify                                         : Ok
> 64: Check open filename arg using perf trace + vfs_getname: Skip
> 65: Add vfs_getname probe to get syscall args filenames   : Skip
> 66: probe libc's inet_pton & backtrace it with ping       : Ok
> 67: Use vfs_getname probe to get syscall args filenames   : Skip
> 68: record trace Zstd compression/decompression           : Ok
> ~/abudanko/kernel/acme
> 
>   $ make -C tools/perf NO_LIBZSTD=1 clean all
> ...
>   $ pushd tools/perf/ && ./perf test && popd
> ~/abudanko/kernel/acme/tools/perf ~/abudanko/kernel/acme
> ...
> 68: record trace Zstd compression/decompression           : Skip
> ~/abudanko/kernel/acme
>

Re: [PATCH v10 00/12] perf: enable compression of record mode trace to save storage space

Reply via email to