Re: perf tools: remove option --tail-synthesize ?
On 2018/11/21 21:11, Arnaldo Carvalho de Melo wrote: > Em Wed, Nov 21, 2018 at 07:45:28AM +, Song Liu escreveu: >> Hi, >> >> I found perf-record --tail-synthesize without --overwrite breaks symbols >> for perf-script, perf-report, etc. For example: >> >> [root@]# ~/perf record -ag --tail-synthesize -- sleep 1 >> [ perf record: Woken up 1 times to write data ] >> [ perf record: Captured and wrote 1.129 MB perf.data (3531 samples) ] >> [root@]# ~/perf script | head >> swapper 0 [000] 1250675.051971: 1 cycles:ppp: >> 81009e15 [unknown] ([unknown]) >> 81196b19 [unknown] ([unknown]) >> 81196579 [unknown] ([unknown]) >> 81110ca7 [unknown] ([unknown]) >> 81a01f4a [unknown] ([unknown]) >> 81a017bf [unknown] ([unknown]) >> 8180e17a [unknown] ([unknown]) >> >> perf-record with --overwrite does NOT have this issue. >> >> After digging into this, I found this issue is introduced by commit >> a73e24d240bc136619d382b1268f34d75c9d25ce. >> >> Reverting this commit does fix this issue. However, on a second thought, >> I feel it is probably better just drop --tail-synthesize, as it doesn't >> make much sense without --overwrite. All we need is to do tail_synthesize >> when --overwrite is set. >> Some cases we use --overwrite without --tail-synthesize. How about setting --tail-synthesize when selecting --overwrite by default, throw a warning when --overwrite is not set and leave a --no-tail-synthesize option? Thank you. >> Thoughts? > > Wang, wdyt? > > - Arnaldo >
Re: [tools/perf] perf test LLVM failure on 4.9
On 2018/1/23 20:37, Pintu Kumar wrote: Hi All, I am verifying all perf tests on Ubuntu-16 x86-64 platform using the kernel version 4.9.20. I have installed several others packages including: clang, llvm But, when I run 'perf test' I get some FAILURE. Specially, 'perf test LLVM' is failing. Please check the below error logs: # perf test LLVM 35: Test LLVM searching and compiling: 35.1: Basic BPF llvm compiling test : FAILED! 35.2: Test kbuild searching : Skip 35.3: Compile source for BPF prologue generation test: Skip 35.4: Compile source for BPF relocation test : Skip When I run with -v I get this: - # perf test -v LLVM 35: Test LLVM searching and compiling: 35.1: Basic BPF llvm compiling test : --- start --- test child forked, pid 3304 Unablt to get kernel version from uname '4.9--amd-x86-64' WARNING:unable to get correct kernel building directory. Hint: Set correct kbuild directory using 'kbuild-dir' option in [llvm] section of ~/.perfconfig or set it to "" to suppress kbuild detection. Unablt to get kernel version from uname '4.9--amd-x86-64' int _version SEC("version") = LINUX_VERSION_CODE; ' | $CLANG_EXEC -D__KERNEL__ -D__NR_CPUS__=$NR_CPUS -DLINUX_VERSION_CODE=$LINUX_VERSION_CODE $CLANG_OPTIONS $KERNEL_INC_OPTIONS -Wno-unused-value -Wno-pointer-sign -working-directory $WORKING_DIR -c "$CLANG_SOURCE" -target bpf -O2 -o - libbpf: 2129190-4a0 doesn't provide kernel version Failed to parse test case 'Basic BPF llvm compiling test' test child finished with -1 end Test LLVM searching and compiling subtest 0: FAILED! - Basic BPF LLVM compiling test is failing. It reports that bpf could not able to read the kernel version from uname. Is it because of this that 'perf test LLVM' is failing ? My uname says: 4.9--amd-x86-64 'perf test LLVM' requires a well formed uname string (%d.%d.%d). Please see tools/perf/fetch_kernel_version . If your distro uses another method to report kernel version number (from your uname string, sublevel can't be determined), please help us improve that function like what we have done for ubuntu (checking /proc/version_signature). My kernel version is: 4.9.20 (from Makefile) So, I dont think there is any problem with kernel version. If any body have come across this problem please let me know how to resolve this issue. Thank You! Regards, Pintu
Re: perf test BPF failing on 4.15.0-rc6
On 2018/1/4 4:13, Arnaldo Carvalho de Melo wrote: Em Wed, Jan 03, 2018 at 03:33:07PM -0300, Arnaldo Carvalho de Melo escreveu: Em Wed, Jan 03, 2018 at 03:27:01PM -0300, Arnaldo Carvalho de Melo escreveu: Continuing investigation... After applying the fallback patch to allow new tools to work with older kernels: [root@felicio ~]# perf test bpf 39: BPF filter: 39.1: Basic BPF filtering : Ok 39.2: BPF pinning : Ok 39.3: BPF prologue generation : Ok 39.4: BPF relocation checker : Ok [root@felicio ~]# uname -a Linux felicio.ghostprotocols.net 4.13.0-rc7+ #1 SMP Mon Sep 11 13:56:18 -03 2017 x86_64 x86_64 x86_64 GNU/Linux [root@felicio ~]# rpm -q glibc glibc-2.17-157.el7_3.2.x86_64 [root@felicio ~]# After applying the patch below I get to, which is what I am trying to fix now: [root@jouet ~]# perf test bpf 39: BPF filter: 39.1: Basic BPF filtering : Ok 39.2: BPF pinning : Ok 39.3: BPF prologue generation : FAILED! 39.4: BPF relocation checker : Skip [root@jouet ~]# Update the patch to the one at the end of this message to make it work with older glibcs, so that we ask for epoll_pwait() and hook into that as well(). Now checking why 39.3 fails... Couldn't reproduce after fixing up some kernel build problems, the patch below is all I need to have this working with both Fedora 27 and RHEL7, please take a look and see if it continues to work on your systems, It works for me. Thank you. Since we test epoll_pwait, we'd better correct function names: diff --git a/tools/perf/tests/bpf-script-example.c b/tools/perf/tests/bpf-script-example.c index 268e5f8..e4123c1 100644 --- a/tools/perf/tests/bpf-script-example.c +++ b/tools/perf/tests/bpf-script-example.c @@ -31,8 +31,8 @@ struct bpf_map_def SEC("maps") flip_table = { .max_entries = 1, }; -SEC("func=SyS_epoll_wait") -int bpf_func__SyS_epoll_wait(void *ctx) +SEC("func=SyS_epoll_pwait") +int bpf_func__SyS_epoll_pwait(void *ctx) { int ind =0; int *flag = bpf_map_lookup_elem(&flip_table, &ind); diff --git a/tools/perf/tests/bpf.c b/tools/perf/tests/bpf.c index 34c22cd..a8f9095 100644 --- a/tools/perf/tests/bpf.c +++ b/tools/perf/tests/bpf.c @@ -19,13 +19,13 @@ #ifdef HAVE_LIBBPF_SUPPORT -static int epoll_wait_loop(void) +static int epoll_pwait_loop(void) { int i; /* Should fail NR_ITERS times */ for (i = 0; i < NR_ITERS; i++) -epoll_wait(-(i + 1), NULL, 0, 0); +epoll_pwait(-(i + 1), NULL, 0, 0, NULL); return 0; } @@ -68,7 +68,7 @@ static struct { "[basic_bpf_test]", "fix 'perf test LLVM' first", "load bpf object failed", -&epoll_wait_loop, +&epoll_pwait_loop, (NR_ITERS + 1) / 2, false, }, @@ -78,7 +78,7 @@ static struct { "[bpf_pinning]", "fix kbuild first", "check your vmlinux setting?", -&epoll_wait_loop, +&epoll_pwait_loop, (NR_ITERS + 1) / 2, true, },
Re: perf test BPF failing on 4.15.0-rc6
And please check if the kprobe created by $ perf probe -v SyS_epoll_wait works for the test program used by this testcase: #include #include #define NR_ITERS 100 static int epoll_wait_loop(void) { int i; /* Should fail NR_ITERS times */ for (i = 0; i < NR_ITERS; i++) epoll_wait(-(i + 1), NULL, 0, 0); return 0; } int main() { epoll_wait_loop(); return 0; } In my configuration it gives 100 samples: $ sudo ./perf record -e probe:SyS_epoll_wait /tmp/a.out [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.016 MB perf.data (100 samples) ] I guess in your configuration you would get 3 tracepoints, and can get 0 sample from them. Thank you. On 2018/1/3 12:42, Wangnan (F) wrote: Unable to reproduce. In my kernel configuration, SyS_epoll_wait is not inlined at all. From the log you sent, it seems that all 3 instances are attached. This testcase should work if the last one (SyS_epoll_wait) get probed correctly. Could you please have a look if the 3rd kprobe event (p:perf_bpf_probe/func_2 _text+2845824) works? Thank you. On 2018/1/3 2:59, Arnaldo Carvalho de Melo wrote: Hi Wang, I just updated my machine to Fedora 27 and 4.15.0-rc6 and the only test failing for me is: [root@jouet linux]# perf test bpf 39: BPF filter: 39.1: Basic BPF filtering : FAILED! 39.2: BPF pinning : Skip 39.3: BPF prologue generation : Skip 39.4: BPF relocation checker : Skip [root@jouet linux]# I haven't checked but perhaps the problem is that SyS_epoll_wait seems to now be inlined in three places and perhaps the eBPF proggie is being added to just one of them? Seemingly relevant excerpt: Open Debuginfo file: /lib/modules/4.15.0-rc6/build/vmlinux Try to find probe point from debuginfo. Matched function: SyS_epoll_wait [2f40eb7] found inline addr: 0x812b6ff1 Probe point found: compat_SyS_epoll_pwait+129 found inline addr: 0x812b6de7 Probe point found: SyS_epoll_pwait+135 found inline addr: 0x812b6c80 Probe point found: SyS_epoll_wait+0 Found 3 probe_trace_events. - Arnaldo P.S.: Full -v output: [root@jouet linux]# uname -a Linux jouet 4.15.0-rc6 #4 SMP Tue Jan 2 14:30:53 -03 2018 x86_64 x86_64 x86_64 GNU/Linux [root@jouet linux]# gcc -v Using built-in specs. COLLECT_GCC=/usr/bin/gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/7/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-redhat-linux Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,objc,obj-c++,fortran,ada,go,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --enable-libmpx --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux Thread model: posix gcc version 7.2.1 20170915 (Red Hat 7.2.1-2) (GCC) [root@jouet linux]# clang -v clang version 6.0.0 (http://llvm.org/git/clang.git 56cc8f8880db2ebc433eeb6b6a707c101467a186) (http://llvm.org/git/llvm.git 3656d83960a4f3fedf6d8f19043abf52379f78c3) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /usr/local/bin Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/7 Selected GCC installation: /usr/lib/gcc/x86_64-redhat-linux/7 Candidate multilib: .;@m64 Candidate multilib: 32;@m32 Selected multilib: .;@m64 [root@jouet linux]# perf test -v BPF 39: BPF filter: 39.1: Basic BPF filtering : --- start --- test child forked, pid 24304 Kernel build dir is set to /lib/modules/4.15.0-rc6/build set env: KBUILD_DIR=/lib/modules/4.15.0-rc6/build unset env: KBUILD_OPTS include option is set to -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated -I/home/acme/git/linux/include -I./include -I/home/acme/git/linux/arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi -I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h set env: NR_CPUS=4 set env: LINUX_VERSION_CODE=0x40f00 set env: CLANG_EXEC=/usr/local/bin/clang set env: CLANG_OPTIONS=-xc set env: KERNEL_INC_OPTIONS= -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated
Re: perf test BPF failing on 4.15.0-rc6
Unable to reproduce. In my kernel configuration, SyS_epoll_wait is not inlined at all. From the log you sent, it seems that all 3 instances are attached. This testcase should work if the last one (SyS_epoll_wait) get probed correctly. Could you please have a look if the 3rd kprobe event (p:perf_bpf_probe/func_2 _text+2845824) works? Thank you. On 2018/1/3 2:59, Arnaldo Carvalho de Melo wrote: Hi Wang, I just updated my machine to Fedora 27 and 4.15.0-rc6 and the only test failing for me is: [root@jouet linux]# perf test bpf 39: BPF filter: 39.1: Basic BPF filtering : FAILED! 39.2: BPF pinning : Skip 39.3: BPF prologue generation : Skip 39.4: BPF relocation checker : Skip [root@jouet linux]# I haven't checked but perhaps the problem is that SyS_epoll_wait seems to now be inlined in three places and perhaps the eBPF proggie is being added to just one of them? Seemingly relevant excerpt: Open Debuginfo file: /lib/modules/4.15.0-rc6/build/vmlinux Try to find probe point from debuginfo. Matched function: SyS_epoll_wait [2f40eb7] found inline addr: 0x812b6ff1 Probe point found: compat_SyS_epoll_pwait+129 found inline addr: 0x812b6de7 Probe point found: SyS_epoll_pwait+135 found inline addr: 0x812b6c80 Probe point found: SyS_epoll_wait+0 Found 3 probe_trace_events. - Arnaldo P.S.: Full -v output: [root@jouet linux]# uname -a Linux jouet 4.15.0-rc6 #4 SMP Tue Jan 2 14:30:53 -03 2018 x86_64 x86_64 x86_64 GNU/Linux [root@jouet linux]# gcc -v Using built-in specs. COLLECT_GCC=/usr/bin/gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/7/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-redhat-linux Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,objc,obj-c++,fortran,ada,go,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --enable-libmpx --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux Thread model: posix gcc version 7.2.1 20170915 (Red Hat 7.2.1-2) (GCC) [root@jouet linux]# clang -v clang version 6.0.0 (http://llvm.org/git/clang.git 56cc8f8880db2ebc433eeb6b6a707c101467a186) (http://llvm.org/git/llvm.git 3656d83960a4f3fedf6d8f19043abf52379f78c3) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /usr/local/bin Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/7 Selected GCC installation: /usr/lib/gcc/x86_64-redhat-linux/7 Candidate multilib: .;@m64 Candidate multilib: 32;@m32 Selected multilib: .;@m64 [root@jouet linux]# perf test -v BPF 39: BPF filter: 39.1: Basic BPF filtering : --- start --- test child forked, pid 24304 Kernel build dir is set to /lib/modules/4.15.0-rc6/build set env: KBUILD_DIR=/lib/modules/4.15.0-rc6/build unset env: KBUILD_OPTS include option is set to -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated -I/home/acme/git/linux/include -I./include -I/home/acme/git/linux/arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi -I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h set env: NR_CPUS=4 set env: LINUX_VERSION_CODE=0x40f00 set env: CLANG_EXEC=/usr/local/bin/clang set env: CLANG_OPTIONS=-xc set env: KERNEL_INC_OPTIONS= -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated -I/home/acme/git/linux/include -I./include -I/home/acme/git/linux/arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi -I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h set env: WORKING_DIR=/lib/modules/4.15.0-rc6/build set env: CLANG_SOURCE=- llvm compiling command template: echo '/* * bpf-script-example.c * Test basic LLVM building */ #ifndef LINUX_VERSION_CODE # error Need LINUX_VERSION_CODE # error Example: for 4.2 kernel, put 'clang-opt="-DLINUX_VERSION_CODE=0x40200" into llvm section of ~/.perfconfig' #endif #define BPF_ANY 0 #define BPF_MAP_TYPE_ARRAY 2 #define BPF_FUNC_map_lookup_elem 1 #define BPF_FUNC_map_update_elem 2 static void *(*bpf_map_lookup_elem)(void *map, void *key) = (void *) BPF_FUNC_m
Re: Php-fpm will crash when perf runs with call graph option
Have you tried updating your kernel to a recent version? On 2017/12/25 14:58, ufo19890607 wrote: From: yuzhoujian I use perf to analyze the performance overhead for the server. There are several dockers in the server. The php-fpm in the docker will crash as long as the perf collects samples for all the cpus with call graph option(perf record -ag). Below is the stack trace in the coredump. #0 0x7f044ff447bd in re_compile_fastmap_iter (bufp=0x7f044ff447bd , fastmap=0x46 , init_state=, init_state=) at regcomp.c:407 407 if (__wcrtomb (buf, towlower (cset->mbchars[i]), &state) (gdb) bt #0 0x7f044ff447bd in re_compile_fastmap_iter (bufp=0x7f044ff447bd , fastmap=0x46 , init_state=, init_state=) at regcomp.c:407 #1 0x00831160 in virtual_file_ex (state=0x7fff9c1a4f70, path=, verify_path=0x0, use_realpath=1) at /home/xiaoju/phpng/php-7.0.6/Zend/zend_virtual_cwd.c:1335 #2 0x007aacee in expand_filepath_with_mode ( filepath=0x7f044d6020d8 "/home/xiaoju/ep/as/store//toggles/beatles_api_discovery_is_open_by_app", real_path=0x7fff9c1a4fc0 "\360X\032\234\377\177", relative_to=, relative_to_len=0, realpath_mode=1) at /home/xiaoju/phpng/php-7.0.6/main/fopen_wrappers.c:812 #3 0x007c1536 in _php_stream_fopen ( filename=0x7f044d6020d8 "/home/xiaoju/ep/as/store//toggles/beatles_api_discovery_is_open_by_app", mode=0xdbb1f1 "rb", opened_path=0x0, options=0) at /home/xiaoju/phpng/php-7.0.6/main/streams/plain_wrapper.c:970 #4 0x007bd084 in _php_stream_open_wrapper_ex ( path=0x7f044d6020d8 "/home/xiaoju/ep/as/store//toggles/beatles_api_discovery_is_open_by_app", mode=0xdbb1f1 "rb", options=8, opened_path=0x0, context=0x7f044d65f4c0) at /home/xiaoju/phpng/php-7.0.6/main/streams/streams.c:2060 #5 0x0071722b in zif_file_get_contents (execute_data=, return_value=0x7f044d615540) at /home/xiaoju/phpng/php-7.0.6/ext/standard/file.c:544 #6 0x0065387c in phar_file_get_contents (execute_data=0x7f044d615570, return_value=0x7f044d615540) at /home/xiaoju/phpng/php-7.0.6/ext/phar/func_interceptors.c:224)) I add some output info in the php source code, and found that virtual_file_ex functions's rbp value is really strange,etc 0x1, 0x31. I guess when the perf collects samples for all the cpus with -g option, it may destroy the php-fpm's stack. When the perf is running without -g option, the php-fpm is normal. Who have ever encountered similar problems? BTW, OS in the server: Centos7.3 , Kernel version: 3.10.0-514.16.1.el7.x86_64. php-fpm version: 7.0.6 Processor info: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz -- To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 2/8] perf tools: rewrite perf mmap read for overwrite
On 2017/12/19 3:07, Liang, Kan wrote: -union perf_event *perf_mmap__read_backward(struct perf_mmap *map) +union perf_event *perf_mmap__read_backward(struct perf_mmap *map, + u64 *start, u64 end) { - u64 head, end; - u64 start = map->prev; - - /* -* Check if event was unmapped due to a POLLHUP/POLLERR. -*/ - if (!refcount_read(&map->refcnt)) - return NULL; - Is removing this checking safe? It was duplicate as the one in perf_mmap__read_catchup. Once planned to remove one. But it looks I removed both accidently. Will keep one in V3. - head = perf_mmap__read_head(map); - if (!head) - return NULL; + union perf_event *event = NULL; - /* -* 'head' pointer starts from 0. Kernel minus sizeof(record) form -* it each time when kernel writes to it, so in fact 'head' is -* negative. 'end' pointer is made manually by adding the size of -* the ring buffer to 'head' pointer, means the validate data can -* read is the whole ring buffer. If 'end' is positive, the ring -* buffer has not fully filled, so we must adjust 'end' to 0. -* -* However, since both 'head' and 'end' is unsigned, we can't -* simply compare 'end' against 0. Here we compare '-head' and -* the size of the ring buffer, where -head is the number of bytes -* kernel write to the ring buffer. -*/ - if (-head < (u64)(map->mask + 1)) - end = 0; - else - end = head + map->mask + 1; + event = perf_mmap__read(map, *start, end, &map->prev); + *start = map->prev; - return perf_mmap__read(map, start, end, &map->prev); + return event; } perf_mmap__read_backward() and perf_mmap__read_forward() become asymmetrical, but their names are symmetrical. -void perf_mmap__read_catchup(struct perf_mmap *map) +static int overwrite_rb_find_range(void *buf, int mask, u64 head, + u64 *start, u64 *end); + +int perf_mmap__read_catchup(struct perf_mmap *map, + bool overwrite, + u64 *start, u64 *end, + unsigned long *size) { - u64 head; + u64 head = perf_mmap__read_head(map); + u64 old = map->prev; + unsigned char *data = map->base + page_size; - if (!refcount_read(&map->refcnt)) - return; + *start = overwrite ? head : old; + *end = overwrite ? old : head; - head = perf_mmap__read_head(map); - map->prev = head; + if (*start == *end) + return 0; + + *size = *end - *start; + if (*size > (unsigned long)(map->mask) + 1) { + if (!overwrite) { + WARN_ONCE(1, "failed to keep up with mmap data. (warn only once)\n"); + + map->prev = head; + perf_mmap__consume(map, overwrite); + return 0; + } + + /* +* Backward ring buffer is full. We still have a chance to read +* most of data from it. +*/ + if (overwrite_rb_find_range(data, map->mask, head, start, end)) + return -1; + } + + return 1; Coding suggestion: this function returns 2 different value (1 and -1) in two fail path. Should return 0 for success and -1 for failure. For perf top, -1 is enough for failure. But for perf_mmap__push, it looks two fail paths are needed, aren't they? I see. I think this is not a good practice. Please try to avoid returning 3 states. For example, comparing start and end outside? If can't avoid, how about returning an enum to notice reader about the difference? You totally redefine perf_mmap__read_catchup(). The original meaning of this function is adjust md->prev so following perf_mmap__read() can read events one by one safely. Only backward ring buffer needs catchup: kernel's 'head' pointer keeps moving (backward), while md->prev keeps unchanged. For forward ring buffer, md->prev will catchup each time when reading from the ring buffer. Your patch totally changes its meaning. Now its meaning is finding the available data from a ring buffer (report start and end). At least we need a better name. Sure, I will introduce a new function to do it in V3. In addition, if I understand your code correctly, we don't need to report 'size' to caller, because size is determined by start and end. Sure, I will remove the 'size' in V3. In addition, since the concept of backward and overwrite are now unified, we can avoid updating map->prev during one-by-one reading (because backward reading don't need consume operation). I think we can decouple the updating of map->prev and these two basic read functions. This can makes the operations on map->prev clearer. Now the moving
Re: [PATCH V2 6/8] perf top: add overwrite fall back
On 2017/12/7 7:33, kan.li...@intel.com wrote: From: Kan Liang Switch to non-overwrite mode if kernel doesnot support overwrite ringbuffer. It's only effect when overwrite mode is supported. No change to current behavior. Signed-off-by: Kan Liang --- tools/perf/builtin-top.c | 35 +++ 1 file changed, 35 insertions(+) diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c index 5e15d27..7c462d6 100644 --- a/tools/perf/builtin-top.c +++ b/tools/perf/builtin-top.c @@ -931,6 +931,27 @@ static int perf_top_overwrite_check(struct perf_top *top) return 0; } +static int perf_top_overwrite_fallback(struct perf_top *top, + struct perf_evsel *evsel) +{ + struct record_opts *opts = &top->record_opts; + struct perf_evlist *evlist = top->evlist; + struct perf_evsel *counter; + + if (!opts->overwrite) + return 0; + + /* only fall back when first event fails */ + if (evsel != perf_evlist__first(evlist)) + return 0; + + evlist__for_each_entry(evlist, counter) + counter->attr.write_backward = false; + opts->overwrite = false; + ui__warning("fall back to non-overwrite mode\n"); + return 1; You can check perf_missing_features.write_backward here. Only do the fallback when we know the problem is caused by missing of write_backward. +} + static int perf_top__start_counters(struct perf_top *top) { char msg[BUFSIZ]; @@ -954,6 +975,20 @@ static int perf_top__start_counters(struct perf_top *top) try_again: if (perf_evsel__open(counter, top->evlist->cpus, top->evlist->threads) < 0) { + + /* +* Specially handle overwrite fall back. +* Because perf top is the only tool which has +* overwrite mode by default, support +* both overwrite and non-overwrite mode, and +* require consistent mode for all events. +* +* May move it to generic code with more tools +* have similar attribute. +*/ + if (perf_top_overwrite_fallback(top, counter)) + goto try_again; + It will unconditionally remove backward bit even if the problem is caused by other missing feature. Thank you.
Re: [PATCH V2 2/8] perf tools: rewrite perf mmap read for overwrite
On 2017/12/7 7:32, kan.li...@intel.com wrote: From: Kan Liang perf_mmap__read_catchup and perf_mmap__read_backward are used to read events one by one from ring buffer under overwrite mode. It always read the stale buffer which is already processed. Because the previous location of processed ring buffer is discarded. Introducing the perf_mmap__read_done, which update the map->prev to indicate the position of processed buffer. Refining perf_mmap__read_catchup to calculate the start and the end of ring buffer. It only need to do the calculation once at the beginning, because the ring buffer is paused in overwrite mode. Doesn't need to do the calculation in perf_mmap__read_backward. For now, the only user is backward-ring-buffer in perf test. Signed-off-by: Kan Liang --- tools/perf/tests/backward-ring-buffer.c | 9 +++- tools/perf/util/mmap.c | 88 +++-- tools/perf/util/mmap.h | 7 ++- 3 files changed, 63 insertions(+), 41 deletions(-) diff --git a/tools/perf/tests/backward-ring-buffer.c b/tools/perf/tests/backward-ring-buffer.c index 4035d43..66d9e54 100644 --- a/tools/perf/tests/backward-ring-buffer.c +++ b/tools/perf/tests/backward-ring-buffer.c @@ -31,10 +31,14 @@ static int count_samples(struct perf_evlist *evlist, int *sample_count, int i; for (i = 0; i < evlist->nr_mmaps; i++) { + struct perf_mmap *map = &evlist->overwrite_mmap[i]; union perf_event *event; + unsigned long size; + u64 start, end; - perf_mmap__read_catchup(&evlist->overwrite_mmap[i]); - while ((event = perf_mmap__read_backward(&evlist->overwrite_mmap[i])) != NULL) { + perf_mmap__read_catchup(map, true, &start, &end, &size); + while ((event = perf_mmap__read_backward(map, &start, end)) != + NULL) { const u32 type = event->header.type; switch (type) { @@ -49,6 +53,7 @@ static int count_samples(struct perf_evlist *evlist, int *sample_count, return TEST_FAIL; } } + perf_mmap__read_done(map); } return TEST_OK; } diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c index 05076e6..bf67460 100644 --- a/tools/perf/util/mmap.c +++ b/tools/perf/util/mmap.c @@ -85,51 +85,65 @@ union perf_event *perf_mmap__read_forward(struct perf_mmap *map) return perf_mmap__read(map, old, head, &map->prev); } -union perf_event *perf_mmap__read_backward(struct perf_mmap *map) +union perf_event *perf_mmap__read_backward(struct perf_mmap *map, + u64 *start, u64 end) { - u64 head, end; - u64 start = map->prev; - - /* -* Check if event was unmapped due to a POLLHUP/POLLERR. -*/ - if (!refcount_read(&map->refcnt)) - return NULL; - Is removing this checking safe? - head = perf_mmap__read_head(map); - if (!head) - return NULL; + union perf_event *event = NULL; - /* -* 'head' pointer starts from 0. Kernel minus sizeof(record) form -* it each time when kernel writes to it, so in fact 'head' is -* negative. 'end' pointer is made manually by adding the size of -* the ring buffer to 'head' pointer, means the validate data can -* read is the whole ring buffer. If 'end' is positive, the ring -* buffer has not fully filled, so we must adjust 'end' to 0. -* -* However, since both 'head' and 'end' is unsigned, we can't -* simply compare 'end' against 0. Here we compare '-head' and -* the size of the ring buffer, where -head is the number of bytes -* kernel write to the ring buffer. -*/ - if (-head < (u64)(map->mask + 1)) - end = 0; - else - end = head + map->mask + 1; + event = perf_mmap__read(map, *start, end, &map->prev); + *start = map->prev; - return perf_mmap__read(map, start, end, &map->prev); + return event; } perf_mmap__read_backward() and perf_mmap__read_forward() become asymmetrical, but their names are symmetrical. -void perf_mmap__read_catchup(struct perf_mmap *map) +static int overwrite_rb_find_range(void *buf, int mask, u64 head, + u64 *start, u64 *end); + +int perf_mmap__read_catchup(struct perf_mmap *map, + bool overwrite, + u64 *start, u64 *end, + unsigned long *size) { - u64 head; + u64 head = perf_mmap__read_head(map); + u64 old = map->prev; + unsigned char *data = map->base + page_size; - if (!refcount_read(&map->refcnt)) - return; + *start = overwrite ? head : old; + *end = overwrite ? old : head; - head = perf_mmap_
Re: [PATCH V2 1/8] perf tools: remove stale perf evlist mmap read for backward
On 2017/12/7 7:32, kan.li...@intel.com wrote: From: Kan Liang perf_evlist__mmap_read_catchup and perf_evlist__mmap_read_backward are only for overwrite mode. But they read the evlist->mmap buffer which is for non-overwrite mode. It did not bring any serious problem yet, because there is no one use it. Remove the unused interfaces. Signed-off-by: Kan Liang Acked-by: Wang Nan --- tools/perf/util/evlist.c | 17 - tools/perf/util/evlist.h | 4 2 files changed, 21 deletions(-) diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index 3570355..4b6a06d 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -714,28 +714,11 @@ union perf_event *perf_evlist__mmap_read_forward(struct perf_evlist *evlist, int return perf_mmap__read_forward(md); } -union perf_event *perf_evlist__mmap_read_backward(struct perf_evlist *evlist, int idx) -{ - struct perf_mmap *md = &evlist->mmap[idx]; - - /* -* No need to check messup for backward ring buffer: -* We can always read arbitrary long data from a backward -* ring buffer unless we forget to pause it before reading. -*/ - return perf_mmap__read_backward(md); -} - union perf_event *perf_evlist__mmap_read(struct perf_evlist *evlist, int idx) { return perf_evlist__mmap_read_forward(evlist, idx); } -void perf_evlist__mmap_read_catchup(struct perf_evlist *evlist, int idx) -{ - perf_mmap__read_catchup(&evlist->mmap[idx]); -} - void perf_evlist__mmap_consume(struct perf_evlist *evlist, int idx) { perf_mmap__consume(&evlist->mmap[idx], false); diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h index 7516066..a80fd47 100644 --- a/tools/perf/util/evlist.h +++ b/tools/perf/util/evlist.h @@ -132,10 +132,6 @@ union perf_event *perf_evlist__mmap_read(struct perf_evlist *evlist, int idx); union perf_event *perf_evlist__mmap_read_forward(struct perf_evlist *evlist, int idx); -union perf_event *perf_evlist__mmap_read_backward(struct perf_evlist *evlist, - int idx); -void perf_evlist__mmap_read_catchup(struct perf_evlist *evlist, int idx); - void perf_evlist__mmap_consume(struct perf_evlist *evlist, int idx); int perf_evlist__open(struct perf_evlist *evlist);
Re: [PATCH 4/5] perf top: switch to overwrite mode
On 2017/12/6 6:39, kan.li...@intel.com wrote: From: Kan Liang perf_top__mmap_read has severe performance issue in Knights Landing/Mill, when monitoring in heavy load system. It costs several minutes to finish, which is unacceptable. Currently, perf top is non overwrite mode. For non overwrite mode, it tries to read everything in the ringbuffer and doesn't pause the ringbuffer. Once there are lots of samples delivered persistently, the processing time could be very long. Also, the latest samples could be lost when the ringbuffer is full. For overwrite mode, it takes a snapshot for the system by pausing the ringbuffer, which could significantly reducing the processing time. Also, the overwrite mode always keep the latest samples. Considering the real time requirement for perf top, the overwrite mode is more suitable for perf top. Actually, perf top was overwrite mode. It is changed to non overwrite mode since commit 93fc64f14472 ("perf top: Switch to non overwrite mode"). It's better to change it back to overwrite mode. There would be some records lost in overwrite mode because of pausing the ringbuffer. It has little impact for the accuracy of the snapshot and could be tolerant. The lost events checking is removed. Unconditionally wait 100 ms before each snapshot. It also reduce the overhead caused by pausing ringbuffer, especially on light load system. Signed-off-by: Kan Liang --- tools/perf/builtin-top.c | 30 ++ tools/perf/ui/browsers/hists.c | 12 +--- 2 files changed, 23 insertions(+), 19 deletions(-) diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c index 540461f..721d786 100644 --- a/tools/perf/builtin-top.c +++ b/tools/perf/builtin-top.c @@ -283,16 +283,6 @@ static void perf_top__print_sym_table(struct perf_top *top) printf("%-*.*s\n", win_width, win_width, graph_dotted_line); - if (hists->stats.nr_lost_warned != - hists->stats.nr_events[PERF_RECORD_LOST]) { - hists->stats.nr_lost_warned = - hists->stats.nr_events[PERF_RECORD_LOST]; - color_fprintf(stdout, PERF_COLOR_RED, - "WARNING: LOST %d chunks, Check IO/CPU overload", - hists->stats.nr_lost_warned); - ++printed; - } - if (top->sym_filter_entry) { perf_top__show_details(top); return; @@ -807,14 +797,19 @@ static void perf_event__process_sample(struct perf_tool *tool, static void perf_top__mmap_read_idx(struct perf_top *top, int idx) { + struct perf_mmap *md = &top->evlist->overwrite_mmap[idx]; struct perf_sample sample; struct perf_evsel *evsel; struct perf_session *session = top->session; union perf_event *event; struct machine *machine; + unsigned long size; + u64 end, start; int ret; - while ((event = perf_evlist__mmap_read(top->evlist, idx)) != NULL) { + perf_mmap__read_catchup(md, true, &start, &end, &size); + + while ((event = perf_mmap__read_backward(md, &start, end)) != NULL) { ret = perf_evlist__parse_sample(top->evlist, event, &sample); if (ret) { pr_err("Can't parse sample, err = %d\n", ret); @@ -869,16 +864,21 @@ static void perf_top__mmap_read_idx(struct perf_top *top, int idx) } else ++session->evlist->stats.nr_unknown_events; next_event: - perf_evlist__mmap_consume(top->evlist, idx); + perf_mmap__consume(md, true); } + + perf_mmap__read_done(md); } static void perf_top__mmap_read(struct perf_top *top) { int i; + perf_evlist__toggle_bkw_mmap(top->evlist, BKW_MMAP_DATA_PENDING); for (i = 0; i < top->evlist->nr_mmaps; i++) perf_top__mmap_read_idx(top, i); + perf_evlist__toggle_bkw_mmap(top->evlist, BKW_MMAP_EMPTY); + perf_evlist__toggle_bkw_mmap(top->evlist, BKW_MMAP_RUNNING); } static int perf_top__start_counters(struct perf_top *top) @@ -1029,12 +1029,9 @@ static int __cmd_top(struct perf_top *top) } while (!done) { - u64 hits = top->samples; - perf_top__mmap_read(top); - if (hits == top->samples) - ret = perf_evlist__poll(top->evlist, 100); + ret = perf_evlist__poll(top->evlist, 100); if (resize) { perf_top__resize(top); @@ -1127,6 +1124,7 @@ int cmd_top(int argc, const char **argv) .uses_mmap = true, }, .proc_map_timeout= 500, + .overwrite = 1, This breaks old kernel: # ./perf top Error: Reading from overwrite event is not supported by this kernel. # uname -r 3.0.13-0.27-default We need a way to fall back to normal ring buffer when kernel doesn'
Re: [RFC PATCH] mm, oom_reaper: gather each vma to prevent leaking TLB entry
On 2017/11/6 19:57, Michal Hocko wrote: On Mon 06-11-17 19:03:34, Wangnan (F) wrote: On 2017/11/6 18:40, Michal Hocko wrote: On Mon 06-11-17 17:59:54, Wangnan (F) wrote: On 2017/11/6 16:52, Michal Hocko wrote: On Mon 06-11-17 15:04:40, Bob Liu wrote: On Mon, Nov 6, 2017 at 11:36 AM, Wang Nan wrote: tlb_gather_mmu(&tlb, mm, 0, -1) means gathering all virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1"). CC'ed Will Deacon. Which makes leaking of tlb entries. For example, when oom_reaper selects a task and reaps its virtual memory space, another thread in this task group may still running on another core and access these already freed memory through tlb entries. No threads should be running in userspace by the time the reaper gets to unmap their address space. So the only potential case is they are accessing the user memory from the kernel when we should fault and we have MMF_UNSTABLE to cause a SIGBUS. So is the race you are describing real? This patch gather each vma instead of gathering full vm space, tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. I do not have any objections to do per vma tlb flushing because it would free gathered pages sooner but I am not sure I see any real problem here. Have you seen any real issues or this is more of a review driven fix? We saw the problem when we try to reuse oom reaper's code in another situation. In our situation, we allow reaping a task before all other tasks in its task group finish their exiting procedure. I'd like to know what ensures "No threads should be running in userspace by the time the reaper"? All tasks are killed by the time. So they should be taken out to the kernel. Sorry. I read oom_kill_process() but still unable to understand why all tasks are killed. oom_kill_process() kill victim by sending SIGKILL. It will be broadcast to all tasks in its task group, but it is asynchronized. In the following case, race can happen (Thread1 in Task1's task group): core 1core 2 Thread1 running oom_kill_process() selects Task1 as victim oom_kill_process() sends SIGKILL to Task1 oom_kill_process() sends SIGKILL to Thread1 oom_kill_process() wakes up oom reaper switch to oom_reaper __oom_reap_task_mm tlb_gather_mmu unmap_page_range, reap Task1 tlb_finish_mmu Write page be kicked off from core Receives SIGKILL So what makes Thread1 being kicked off from core 1 before core 2 starting unmapping? complete_signal should call signal_wake_up on all threads because this is a group fatal signal and that should send an IPI to all of the cpus they run on to. Even if we do not wait for IPI to complete the race window should be few instructions only while it takes quite some time to hand over to the oom reaper. If the complete_signal is the mechanism we rely on to ensure all threads are exited, then I'm sure it is not enough. As you said, we still have a small race window. In some platform, an IPI from one core to another core takes a little bit longer than you may expect, and the core who receive the IPI may in a very low frequency. In our situation, we put the reaper code in do_exit after receiving SIGKILL, and observe TLB entry leaking. Since this is a SIGKILL, complete_signal should have been executed. So I think oom_reaper have similar problem. Thank you.
Re: [RFC PATCH] mm, oom_reaper: gather each vma to prevent leaking TLB entry
On 2017/11/6 18:40, Michal Hocko wrote: On Mon 06-11-17 17:59:54, Wangnan (F) wrote: On 2017/11/6 16:52, Michal Hocko wrote: On Mon 06-11-17 15:04:40, Bob Liu wrote: On Mon, Nov 6, 2017 at 11:36 AM, Wang Nan wrote: tlb_gather_mmu(&tlb, mm, 0, -1) means gathering all virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1"). CC'ed Will Deacon. Which makes leaking of tlb entries. For example, when oom_reaper selects a task and reaps its virtual memory space, another thread in this task group may still running on another core and access these already freed memory through tlb entries. No threads should be running in userspace by the time the reaper gets to unmap their address space. So the only potential case is they are accessing the user memory from the kernel when we should fault and we have MMF_UNSTABLE to cause a SIGBUS. So is the race you are describing real? This patch gather each vma instead of gathering full vm space, tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. I do not have any objections to do per vma tlb flushing because it would free gathered pages sooner but I am not sure I see any real problem here. Have you seen any real issues or this is more of a review driven fix? We saw the problem when we try to reuse oom reaper's code in another situation. In our situation, we allow reaping a task before all other tasks in its task group finish their exiting procedure. I'd like to know what ensures "No threads should be running in userspace by the time the reaper"? All tasks are killed by the time. So they should be taken out to the kernel. Sorry. I read oom_kill_process() but still unable to understand why all tasks are killed. oom_kill_process() kill victim by sending SIGKILL. It will be broadcast to all tasks in its task group, but it is asynchronized. In the following case, race can happen (Thread1 in Task1's task group): core 1core 2 Thread1 running oom_kill_process() selects Task1 as victim oom_kill_process() sends SIGKILL to Task1 oom_kill_process() sends SIGKILL to Thread1 oom_kill_process() wakes up oom reaper switch to oom_reaper __oom_reap_task_mm tlb_gather_mmu unmap_page_range, reap Task1 tlb_finish_mmu Write page be kicked off from core Receives SIGKILL So what makes Thread1 being kicked off from core 1 before core 2 starting unmapping? Thank you.
Re: [RFC PATCH] mm, oom_reaper: gather each vma to prevent leaking TLB entry
On 2017/11/6 16:52, Michal Hocko wrote: On Mon 06-11-17 15:04:40, Bob Liu wrote: On Mon, Nov 6, 2017 at 11:36 AM, Wang Nan wrote: tlb_gather_mmu(&tlb, mm, 0, -1) means gathering all virtual memory space. In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush TLB when tlb->fullmm is true: commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1"). CC'ed Will Deacon. Which makes leaking of tlb entries. For example, when oom_reaper selects a task and reaps its virtual memory space, another thread in this task group may still running on another core and access these already freed memory through tlb entries. No threads should be running in userspace by the time the reaper gets to unmap their address space. So the only potential case is they are accessing the user memory from the kernel when we should fault and we have MMF_UNSTABLE to cause a SIGBUS. So is the race you are describing real? This patch gather each vma instead of gathering full vm space, tlb->fullmm is not true. The behavior of oom reaper become similar to munmapping before do_exit, which should be safe for all archs. I do not have any objections to do per vma tlb flushing because it would free gathered pages sooner but I am not sure I see any real problem here. Have you seen any real issues or this is more of a review driven fix? We saw the problem when we try to reuse oom reaper's code in another situation. In our situation, we allow reaping a task before all other tasks in its task group finish their exiting procedure. I'd like to know what ensures "No threads should be running in userspace by the time the reaper"? Thank you. Signed-off-by: Wang Nan Cc: Bob Liu Cc: Michal Hocko Cc: Andrew Morton Cc: Michal Hocko Cc: David Rientjes Cc: Ingo Molnar Cc: Roman Gushchin Cc: Konstantin Khlebnikov Cc: Andrea Arcangeli --- mm/oom_kill.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index dee0f75..18c5b35 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -532,7 +532,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) */ set_bit(MMF_UNSTABLE, &mm->flags); - tlb_gather_mmu(&tlb, mm, 0, -1); for (vma = mm->mmap ; vma; vma = vma->vm_next) { if (!can_madv_dontneed_vma(vma)) continue; @@ -547,11 +546,13 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) * we do not want to block exit_mmap by keeping mm ref * count elevated without a good reason. */ - if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) + if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) { + tlb_gather_mmu(&tlb, mm, vma->vm_start, vma->vm_end); unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end, NULL); + tlb_finish_mmu(&tlb, vma->vm_start, vma->vm_end); + } } - tlb_finish_mmu(&tlb, 0, -1); pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", task_pid_nr(tsk), tsk->comm, K(get_mm_counter(mm, MM_ANONPAGES)), -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: mailto:"d...@kvack.org";> em...@kvack.org
Re: [PATCH 1/2] perf mmap: Fix perf backward recording
On 2017/11/1 21:57, Liang, Kan wrote: On 2017/11/1 20:00, Namhyung Kim wrote: On Wed, Nov 01, 2017 at 06:32:50PM +0800, Wangnan (F) wrote: On 2017/11/1 17:49, Namhyung Kim wrote: Hi, On Wed, Nov 01, 2017 at 05:53:26AM +, Wang Nan wrote: perf record backward recording doesn't work as we expected: it never overwrite when ring buffer full. Test: (Run a busy printing python task background like this: while True: print 123 send SIGUSR2 to perf to capture snapshot.) [SNIP] Signed-off-by: Wang Nan --- tools/perf/util/evlist.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index c6c891e..4c5daba 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -799,22 +799,28 @@ perf_evlist__should_poll(struct perf_evlist *evlist __maybe_unused, } static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, - struct mmap_params *mp, int cpu_idx, + struct mmap_params *_mp, int cpu_idx, int thread, int *_output, int *_output_backward) { struct perf_evsel *evsel; int revent; int evlist_cpu = cpu_map__cpu(evlist->cpus, cpu_idx); + struct mmap_params *mp; evlist__for_each_entry(evlist, evsel) { struct perf_mmap *maps = evlist->mmap; + struct mmap_params rdonly_mp; int *output = _output; int fd; int cpu; + mp = _mp; if (evsel->attr.write_backward) { output = _output_backward; maps = evlist->backward_mmap; + rdonly_mp = *_mp; + rdonly_mp.prot &= ~PROT_WRITE; + mp = &rdonly_mp; if (!maps) { maps = perf_evlist__alloc_mmap(evlist); -- What about this instead (not tested)? diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index c6c891e154a6..27ebe355e794 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -838,6 +838,11 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, if (*output == -1) { *output = fd; + if (evsel->attr.write_backward) + mp->prot = PROT_READ; + else + mp->prot = PROT_READ | PROT_WRITE; + If evlist->overwrite is true, PROT_WRITE should be unset even if write_backward is not set. If you want to delay the setting of mp->prot, you need to consider both evlist->overwrite and evsel->attr.write_backward. I thought evsel->attr.write_backward should be set when evlist->overwrite is set. Do you mean following case? perf record --overwrite -e 'cycles/no-overwrite/' No. evlist->overwrite is unrelated to '--overwrite'. This is why I said the concept of 'overwrite' and 'backward' is ambiguous. Yes, I think we should make it clear. As we discussed previously, there are four possible combinations to access ring buffer , 'forward non-overwrite', 'forward overwrite', 'backward non-overwrite' and 'backward overwrite'. Actually, not all of the combinations are necessary. - 'forward overwrite' mode brings some problems which were mentioned in commit ID 9ecda41acb97 ("perf/core: Add ::write_backward attribute to perf event"). - 'backward non-overwrite' mode is very similar as 'forward non-overwrite'. There is no extra advantage. Only keep one non-overwrite mode is enough. So 'forward non-overwrite' and 'backward overwrite' are enough for all perf tools. Furthermore, 'forward' and 'backward' only indicate the direction of the ring buffer. They don't impact the result and performance. It is not important as the concept of overwrite/non-overwrite. To simplify the concept, only 'non-overwrite' mode and 'overwrite' mode should be kept. 'non-overwrite' mode indicates the forward ring buffer. 'overwrite' mode indicates the backward ring buffer. perf record never sets 'evlist->overwrite'. What '--overwrite' actually does is setting write_backward. Some testcases needs overwrite evlist. There are only four test cases which set overwrite, sw-clock,task-exit, mmap-basic, backward-ring-buffer. Only backward-ring-buffer is 'backward overwrite'. The rest three are all 'forward overwrite'. We just need to set write_backward to convert them to 'backward overwrite'. I think it's not hard to clean up. If we add a new rule that overwrite ring buffers are always backward then it is not hard to cleanup. However, the support of forward overwrite ring buffer has a long history and the code is not written by me. I'd like to check if there is some reason to keep support this configuration? Thank you.
Re: [PATCH 2/2] perf record: Replace 'overwrite' by 'flightrecorder' for better naming
On 2017/11/1 23:04, Liang, Kan wrote: On 2017/11/1 22:22, Liang, Kan wrote: On 2017/11/1 21:26, Liang, Kan wrote: The meaning of perf record's "overwrite" option and many "overwrite" in source code are not clear. In perf's code, the 'overwrite' has 2 meanings: 1. Make ringbuffer readonly (perf_evlist__mmap_ex's argument). 2. Set evsel's "backward" attribute (in apply_config_terms). perf record doesn't use meaning 1 at all, but have a overwrite option, its real meaning is setting backward. I don't understand here. 'overwrite' has 2 meanings. perf record only support 1. It should be a bug, and need to be fixed. Not a bug, but ambiguous. Perf record doesn't need overwrite main channel (we have two channels: evlist->mmap is main channel and evlist->backward_mmap is backward evlist->channel), but some testcases require it, and your new patchset may require it. 'perf record --overwrite' doesn't set main channel overwrite. What it does is moving all evsels to backward channel, and we can move some evsels back to the main channel by /no-overwrite/ setting. This behavior is hard to understand. As my understanding, the 'main channel' should depends on what user sets. If --overwrite is applied, then evlist->backward_mmap should be the 'main channel'. evlist->overwrite should be set to true as well. Then it introduces a main channel switching mechanism, and we need checking evlist->overwrite and another factor to determine which one is the main channel. Make things more complex. We should check the evlist->overwrite. Now, all perf tools force evlist->overwrite = false. I think it doesn’t make sense. What is another factor? If we support mixed channel as well as forward overwrite ring buffer, evlist->overwrite is not enough. I don't think it will be too complex. In perf_evlist__mmap_ex, we just need to add a check. If (!overwrite) evlist->mmap = perf_evlist__alloc_mmap(evlist); else evlist->backward_mmap = perf_evlist__alloc_mmap(evlist); In perf_evlist__mmap_per_evsel, we already handle per-event overwrite. It just need to add some similar codes to handler per-event nonoverwrite. Then the logic becomes: if (check write_backward) { maps = evlist->backward_mmap; if (!maps) { maps = perf_evlist__alloc_mmap(...); if (!maps) { /* error processing */ } evlist->backward_mmap = maps; if (evlist->bkw_mmap_state == BKW_MMAP_NOTREADY) perf_evlist__toggle_bkw_mmap(evlist, BKW_MMAP_RUNNING); } } else { maps = evlist->mmap; if (!maps) { maps = perf_evlist__alloc_mmap(...); if (!maps) { /* error processing */ } evlist->mmap = maps; } } For other codes, they should already check evlist->mmap and evlist->backward_mmap. So they probably don't need to change. Thanks, Kan
Re: [PATCH 2/2] perf record: Replace 'overwrite' by 'flightrecorder' for better naming
On 2017/11/1 22:22, Liang, Kan wrote: On 2017/11/1 21:26, Liang, Kan wrote: The meaning of perf record's "overwrite" option and many "overwrite" in source code are not clear. In perf's code, the 'overwrite' has 2 meanings: 1. Make ringbuffer readonly (perf_evlist__mmap_ex's argument). 2. Set evsel's "backward" attribute (in apply_config_terms). perf record doesn't use meaning 1 at all, but have a overwrite option, its real meaning is setting backward. I don't understand here. 'overwrite' has 2 meanings. perf record only support 1. It should be a bug, and need to be fixed. Not a bug, but ambiguous. Perf record doesn't need overwrite main channel (we have two channels: evlist->mmap is main channel and evlist->backward_mmap is backward evlist->channel), but some testcases require it, and your new patchset may require it. 'perf record --overwrite' doesn't set main channel overwrite. What it does is moving all evsels to backward channel, and we can move some evsels back to the main channel by /no-overwrite/ setting. This behavior is hard to understand. As my understanding, the 'main channel' should depends on what user sets. If --overwrite is applied, then evlist->backward_mmap should be the 'main channel'. evlist->overwrite should be set to true as well. Then it introduces a main channel switching mechanism, and we need checking evlist->overwrite and another factor to determine which one is the main channel. Make things more complex. /no-overwrite/ setting is per-event setting. Only when we finish the global setting, then the per-event setting will be considered. You may refer to apply_config_terms. Yes. Thanks, Kan
Re: [PATCH 2/2] perf record: Replace 'overwrite' by 'flightrecorder' for better naming
On 2017/11/1 21:26, Liang, Kan wrote: The meaning of perf record's "overwrite" option and many "overwrite" in source code are not clear. In perf's code, the 'overwrite' has 2 meanings: 1. Make ringbuffer readonly (perf_evlist__mmap_ex's argument). 2. Set evsel's "backward" attribute (in apply_config_terms). perf record doesn't use meaning 1 at all, but have a overwrite option, its real meaning is setting backward. I don't understand here. 'overwrite' has 2 meanings. perf record only support 1. It should be a bug, and need to be fixed. Not a bug, but ambiguous. Perf record doesn't need overwrite main channel (we have two channels: evlist->mmap is main channel and evlist->backward_mmap is backward channel), but some testcases require it, and your new patchset may require it. 'perf record --overwrite' doesn't set main channel overwrite. What it does is moving all evsels to backward channel, and we can move some evsels back to the main channel by /no-overwrite/ setting. This behavior is hard to understand. Thank you.
Re: [PATCH 1/2] perf mmap: Fix perf backward recording
On 2017/11/1 20:39, Jiri Olsa wrote: On Wed, Nov 01, 2017 at 08:10:49PM +0800, Wangnan (F) wrote: SNIP diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index c6c891e154a6..27ebe355e794 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -838,6 +838,11 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, if (*output == -1) { *output = fd; + if (evsel->attr.write_backward) + mp->prot = PROT_READ; + else + mp->prot = PROT_READ | PROT_WRITE; + If evlist->overwrite is true, PROT_WRITE should be unset even if write_backward is not set. If you want to delay the setting of mp->prot, you need to consider both evlist->overwrite and evsel->attr.write_backward. I thought evsel->attr.write_backward should be set when evlist->overwrite is set. Do you mean following case? perf record --overwrite -e 'cycles/no-overwrite/' No. evlist->overwrite is unrelated to '--overwrite'. This is why I said the concept of 'overwrite' and 'backward' is ambiguous. perf record never sets 'evlist->overwrite'. What '--overwrite' actually does is setting write_backward. Some testcases needs overwrite evlist. did not check deeply, but so why can't we do the below? jirka --- diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index f4d9fc54b382..4643c487bd29 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -300,7 +300,7 @@ static int record__mmap_evlist(struct record *rec, struct record_opts *opts = &rec->opts; char msg[512]; - if (perf_evlist__mmap_ex(evlist, opts->mmap_pages, false, + if (perf_evlist__mmap_ex(evlist, opts->mmap_pages, opts->overwrite, opts->auxtrace_mmap_pages, opts->auxtrace_snapshot_mode) < 0) { perf_evlist__mmap_ex's overwrite argument is used to control evlist->mmap. Consider Namhyung's example: perf record --overwrite -e 'cycles/no-overwrite/' In this case both evlist->mmap and evlist->backward_mmap would be set to overwrite. 'cycles' will be put into evlist->mmap, so it will be set to overwrite incorrectly. Patch 2/2 clarifies these concepts. What we want is the flight recorder mode, not only a read only ring buffer. Even flight recorder mode is selected, we can still have a normal ring buffer which keep output data. Thank you.
Re: [PATCH 1/2] perf mmap: Fix perf backward recording
On 2017/11/1 20:00, Namhyung Kim wrote: On Wed, Nov 01, 2017 at 06:32:50PM +0800, Wangnan (F) wrote: On 2017/11/1 17:49, Namhyung Kim wrote: Hi, On Wed, Nov 01, 2017 at 05:53:26AM +, Wang Nan wrote: perf record backward recording doesn't work as we expected: it never overwrite when ring buffer full. Test: (Run a busy printing python task background like this: while True: print 123 send SIGUSR2 to perf to capture snapshot.) [SNIP] Signed-off-by: Wang Nan --- tools/perf/util/evlist.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index c6c891e..4c5daba 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -799,22 +799,28 @@ perf_evlist__should_poll(struct perf_evlist *evlist __maybe_unused, } static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, - struct mmap_params *mp, int cpu_idx, + struct mmap_params *_mp, int cpu_idx, int thread, int *_output, int *_output_backward) { struct perf_evsel *evsel; int revent; int evlist_cpu = cpu_map__cpu(evlist->cpus, cpu_idx); + struct mmap_params *mp; evlist__for_each_entry(evlist, evsel) { struct perf_mmap *maps = evlist->mmap; + struct mmap_params rdonly_mp; int *output = _output; int fd; int cpu; + mp = _mp; if (evsel->attr.write_backward) { output = _output_backward; maps = evlist->backward_mmap; + rdonly_mp = *_mp; + rdonly_mp.prot &= ~PROT_WRITE; + mp = &rdonly_mp; if (!maps) { maps = perf_evlist__alloc_mmap(evlist); -- What about this instead (not tested)? diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index c6c891e154a6..27ebe355e794 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -838,6 +838,11 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, if (*output == -1) { *output = fd; + if (evsel->attr.write_backward) + mp->prot = PROT_READ; + else + mp->prot = PROT_READ | PROT_WRITE; + If evlist->overwrite is true, PROT_WRITE should be unset even if write_backward is not set. If you want to delay the setting of mp->prot, you need to consider both evlist->overwrite and evsel->attr.write_backward. I thought evsel->attr.write_backward should be set when evlist->overwrite is set. Do you mean following case? perf record --overwrite -e 'cycles/no-overwrite/' No. evlist->overwrite is unrelated to '--overwrite'. This is why I said the concept of 'overwrite' and 'backward' is ambiguous. perf record never sets 'evlist->overwrite'. What '--overwrite' actually does is setting write_backward. Some testcases needs overwrite evlist. Thank you.
Re: [PATCH 1/2] perf mmap: Fix perf backward recording
On 2017/11/1 17:49, Namhyung Kim wrote: Hi, On Wed, Nov 01, 2017 at 05:53:26AM +, Wang Nan wrote: perf record backward recording doesn't work as we expected: it never overwrite when ring buffer full. Test: (Run a busy printing python task background like this: while True: print 123 send SIGUSR2 to perf to capture snapshot.) [SNIP] Signed-off-by: Wang Nan --- tools/perf/util/evlist.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index c6c891e..4c5daba 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -799,22 +799,28 @@ perf_evlist__should_poll(struct perf_evlist *evlist __maybe_unused, } static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, - struct mmap_params *mp, int cpu_idx, + struct mmap_params *_mp, int cpu_idx, int thread, int *_output, int *_output_backward) { struct perf_evsel *evsel; int revent; int evlist_cpu = cpu_map__cpu(evlist->cpus, cpu_idx); + struct mmap_params *mp; evlist__for_each_entry(evlist, evsel) { struct perf_mmap *maps = evlist->mmap; + struct mmap_params rdonly_mp; int *output = _output; int fd; int cpu; + mp = _mp; if (evsel->attr.write_backward) { output = _output_backward; maps = evlist->backward_mmap; + rdonly_mp = *_mp; + rdonly_mp.prot &= ~PROT_WRITE; + mp = &rdonly_mp; if (!maps) { maps = perf_evlist__alloc_mmap(evlist); -- What about this instead (not tested)? diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index c6c891e154a6..27ebe355e794 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -838,6 +838,11 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, if (*output == -1) { *output = fd; + if (evsel->attr.write_backward) + mp->prot = PROT_READ; + else + mp->prot = PROT_READ | PROT_WRITE; + If evlist->overwrite is true, PROT_WRITE should be unset even if write_backward is not set. If you want to delay the setting of mp->prot, you need to consider both evlist->overwrite and evsel->attr.write_backward. Then why not removing mp->prot totally, and add a prot argument to perf_mmap__mmap, since prot setting is always decided independently? if (perf_mmap__mmap(&maps[idx], mp, *output) < 0) return -1; } else { @@ -1058,9 +1063,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, unsigned int pages, struct perf_evsel *evsel; const struct cpu_map *cpus = evlist->cpus; const struct thread_map *threads = evlist->threads; - struct mmap_params mp = { - .prot = PROT_READ | (overwrite ? 0 : PROT_WRITE), - }; + struct mmap_params mp; if (!evlist->mmap) evlist->mmap = perf_evlist__alloc_mmap(evlist); The 'overwrite' argument in perf_evlist__mmap_ex() controls evlist->overwrite. Thank you.
Re: [PATCH 2/2] perf record: Replace 'overwrite' by 'flightrecorder' for better naming
On 2017/11/1 18:03, Namhyung Kim wrote: On Wed, Nov 01, 2017 at 05:53:27AM +, Wang Nan wrote: The meaning of perf record's "overwrite" option and many "overwrite" in source code are not clear. In perf's code, the 'overwrite' has 2 meanings: 1. Make ringbuffer readonly (perf_evlist__mmap_ex's argument). 2. Set evsel's "backward" attribute (in apply_config_terms). perf record doesn't use meaning 1 at all, but have a overwrite option, its real meaning is setting backward. This patch separates these two concepts, introduce 'flightrecorder' mode which is what we really want. It combines these 2 concept together, wraps them into a record mode. In flight recorder mode, perf only dumps data before something happen. I'm ok with the it but removing old name looks not good. How about keeping them for a while (as deprecated)?. Is there a way to hide '--overwrite' from 'perf record --help' and print something when user really use it? And 'flightrecorder' seems too long. Maybe you can use an acronym like FDR or fdr-mode? fdr-mode is a good name. Thank you.
Re: [PATCH 2/2] perf, tools: Don't force MetricExprs to lower case
On 2017/10/12 23:59, Jiri Olsa wrote: On Thu, Oct 12, 2017 at 11:31:51PM +0800, Wangnan (F) wrote: SNIP Ok. If it works it's fine for me. well it works, but it means that bpf file cannot contains any directory part.. which im not sure is ok with bpf folks ;-) anyone? Sorry I didn't see this thread these days. Do you think adding a special escape character to suppress BPF name parsing in a event is a good idea? for example: % perf stat -e cpu/uops_executed.core,cmask=1/ true bpf: builtin compilation failed: -95, try external compiler ERROR: problems with path cpu/uops_executed.c: No such file or directory event syntax error: 'cpu/uops_executed.core,cmask=1/' \___ Failed to load cpu/uops_executed.c from source: Error when compiling BPF scriptlet. Add a leading '@' to avoid BPF syntax % perf stat -e @cpu/uops_executed.core,cmask=1/ true ... if we go this way, I'd rather mark the bpf syntax instead of changing the generic event format, like Andi suggested in some earlier email but maybe we can workaround this with patch I sent in my last email Great. I'm also working on a patch which check the existence of BPF file and have it rejected if the file is not exist. The only inconvenient is we need to call stat() many times because bpf_object matchs c[a-zA-Z0-9._]*. I think you have solved the problem in your patch. Thank you.
Re: [PATCH 2/2] perf, tools: Don't force MetricExprs to lower case
On 2017/10/9 22:39, Jiri Olsa wrote: On Mon, Oct 09, 2017 at 11:12:58AM -0300, Arnaldo Carvalho de Melo wrote: Em Mon, Oct 09, 2017 at 07:07:29AM -0700, Andi Kleen escreveu: On Mon, Oct 09, 2017 at 03:41:51PM +0200, Jiri Olsa wrote: On Wed, Oct 04, 2017 at 09:27:11AM -0700, Andi Kleen wrote: On Wed, Oct 04, 2017 at 12:30:52PM +0200, Jiri Olsa wrote: On Tue, Oct 03, 2017 at 01:06:05PM -0300, Arnaldo Carvalho de Melo wrote: Em Tue, Sep 12, 2017 at 12:56:43PM -0700, Andi Kleen escreveu: From: Andi Kleen There are still problems with BPF misinterpreting some events that include .c. An earlier fix made it work for stand alone aliases, but it still fails for more complex constructs. Hi Wang, Jiri, Can you please take a look at this and see if there is something we can do to help Andi? - Arnaldo REJECT keeps trying and trying a shorter string until .c is matched and it appears like a valid BPF path. % perf stat -e cpu/uops_executed.core,cmask=1/ true bpf: builtin compilation failed: -95, try external compiler ERROR: problems with path cpu/uops_executed.c: No such file or directory event syntax error: 'cpu/uops_executed.core,cmask=1/' \___ Failed to load cpu/uops_executed.c from source: Error when compiling BPF scriptlet I tried to fix it, but it exceeds my flex knowledge, because REJECT does not interact well with BEGIN states. The BPF syntax in its current form really causes an ambigious grammar. right, it looks like we allow whole path (including / char) for BPF file, which messes up with out pmu/.../ syntax do we need that? (Cc-ed some bpf folks) if not attached patch seems to fix things.. otherwise we need to come up with another fix I tried similar patches, but I always ran into more complex situations where it still matched incorrectly. e.g. try it with cpu/uops_executed.core,... vs uops_executed.core hm, both works for me with the change: perf stat -e cpu/uops_executed.core/ ls perf stat -e uops_executed.core ls Ok. If it works it's fine for me. well it works, but it means that bpf file cannot contains any directory part.. which im not sure is ok with bpf folks ;-) anyone? Sorry I didn't see this thread these days. Do you think adding a special escape character to suppress BPF name parsing in a event is a good idea? for example: % perf stat -e cpu/uops_executed.core,cmask=1/ true bpf: builtin compilation failed: -95, try external compiler ERROR: problems with path cpu/uops_executed.c: No such file or directory event syntax error: 'cpu/uops_executed.core,cmask=1/' \___ Failed to load cpu/uops_executed.c from source: Error when compiling BPF scriptlet. Add a leading '@' to avoid BPF syntax % perf stat -e @cpu/uops_executed.core,cmask=1/ true ...
Re: [PATCH] perf tool: Don't discard prev in backward mode
Hi Kan, Please see if this patch works for you. On 2017/10/14 7:16, Wang Nan wrote: Perf record can switch output. The new output should only store the data after switching. However, in overwrite backward mode, the new output still have the data from old output. That also brings extra overhead. At the end of mmap_read, the position of processed ring buffer is saved in md->prev. Next mmap_read should be end in md->prev if it is not overwriten. That avoids to process duplicate data. However, the md->prev is discarded. So next mmap_read has to process whole valid ring buffer, which probably include the old processed data. Avoid calling backward_rb_find_range() when md->prev is still available. Signed-off-by: Wang Nan Cc: Liang Kan --- tools/perf/util/mmap.c | 33 +++-- 1 file changed, 15 insertions(+), 18 deletions(-) diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c index 9fe5f9c..df1de55 100644 --- a/tools/perf/util/mmap.c +++ b/tools/perf/util/mmap.c @@ -287,18 +287,6 @@ static int backward_rb_find_range(void *buf, int mask, u64 head, u64 *start, u64 return -1; } -static int rb_find_range(void *data, int mask, u64 head, u64 old, -u64 *start, u64 *end, bool backward) -{ - if (!backward) { - *start = old; - *end = head; - return 0; - } - - return backward_rb_find_range(data, mask, head, start, end); -} - int perf_mmap__push(struct perf_mmap *md, bool overwrite, bool backward, void *to, int push(void *to, void *buf, size_t size)) { @@ -310,19 +298,28 @@ int perf_mmap__push(struct perf_mmap *md, bool overwrite, bool backward, void *buf; int rc = 0; - if (rb_find_range(data, md->mask, head, old, &start, &end, backward)) - return -1; + start = backward ? head : old; + end = backward ? old : head; if (start == end) return 0; size = end - start; if (size > (unsigned long)(md->mask) + 1) { - WARN_ONCE(1, "failed to keep up with mmap data. (warn only once)\n"); + if (!backward) { + WARN_ONCE(1, "failed to keep up with mmap data. (warn only once)\n"); - md->prev = head; - perf_mmap__consume(md, overwrite || backward); - return 0; + md->prev = head; + perf_mmap__consume(md, overwrite || backward); + return 0; + } + + /* +* Backward ring buffer is full. We still have a chance to read +* most of data from it. +*/ + if (backward_rb_find_range(data, md->mask, head, &start, &end)) + return -1; } if ((start & md->mask) + size != (end & md->mask)) {
Re: [PATCH] perf tools: fix: Force backward ring buffer mapped readonly
On 2017/10/12 22:46, Wangnan (F) wrote: On 2017/10/12 22:45, Liang, Kan wrote: On 2017/10/12 20:56, Liang, Kan wrote: On 2017/10/11 21:16, Liang, Kan wrote: perf record's --overwrite option doesn't work as we expect. For example: [SNIP] In the above example we get same records from the backward ring buffer all the time. Overwriting is not triggered. This commit maps backward ring buffers readonly, make it overwritable. It is safe because we assume backward ring buffer always overwritable in other part of code. After applying this patch: $ ~/linux/tools/perf$ sudo ./perf record -m 4 -e raw_syscalls:* -g -- overwrite \ --switch-output=1s --tail-synthesize dd if=/dev/zero of=/dev/null [SNIP] Signed-off-by: Wang Nan Cc: Liang Kan Cc: Arnaldo Carvalho de Melo Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Jiri Olsa Cc: Namhyung Kim Cc: Alexander Shishkin Cc: Adrian Hunter Cc: Andi Kleen Cc: Li Zefan --- tools/perf/util/evlist.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index c6c891e..a86b0d2 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -799,12 +799,14 @@ perf_evlist__should_poll(struct perf_evlist *evlist __maybe_unused, } static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, - struct mmap_params *mp, int cpu_idx, + struct mmap_params *_mp, int cpu_idx, int thread, int *_output, int *_output_backward) { struct perf_evsel *evsel; int revent; int evlist_cpu = cpu_map__cpu(evlist->cpus, cpu_idx); +struct mmap_params *mp = _mp; +struct mmap_params backward_mp; evlist__for_each_entry(evlist, evsel) { struct perf_mmap *maps = evlist->mmap; @@ -815,6 +817,9 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, if (evsel->attr.write_backward) { output = _output_backward; maps = evlist->backward_mmap; +backward_mp = *mp; +backward_mp.prot &= ~PROT_WRITE; +mp = &backward_mp; if (!maps) { maps = perf_evlist__alloc_mmap(evlist); So it's trying to support per-event overwrite. How about the global --overwrite option? Not only the per-event overwrite. See the example above. The overall -- overwrite option is also respected. In perf_evsel__config, per-event evsel 'backward' setting is set based on overall '--overwrite' and per-event '/overwrite/' setting. But how about evlist->overwrite? I think it still keeps the wrong setting. The overwrite is implicitly applied. Some settings are inconsistent. Is there any drawback if you use opts->overwrite for perf_evlist__mmap_ex? We will always face such inconsistency, because we have an /no-overwrite/ option which can be set per-evsel. Setting evlist->overwrite won't make things more consistent, because in a evlist, different evsel can have different overwrite setting. A simple solution is making evlist non-overwrite by default, and watch all overwrite evsels a special cases. Then we have only 2 cases to consider: 1. overwrite evsel in a non-overwrite evlist. 2. non-overwrite evsel in a non-overwrite evlist. If evlist->overwrite is always non-overwrite, why not remove it? Some testcases require it. Sorry. I think removing it is reasonable now, but we need to solve the relationship between overwrite and backward first. I suggest remove the whole 'backward' concept, and makes evsels backward if it is overwrite. Is there any usecases that: 1. overwrite but not backward ring buffer: it will be unparsable after ring buffer full. 2. backward but not overwrite ring buffer: I don't see any advantage. Thank you. Thank you.
Re: [PATCH] perf tools: fix: Force backward ring buffer mapped readonly
On 2017/10/12 22:45, Liang, Kan wrote: On 2017/10/12 20:56, Liang, Kan wrote: On 2017/10/11 21:16, Liang, Kan wrote: perf record's --overwrite option doesn't work as we expect. For example: [SNIP] In the above example we get same records from the backward ring buffer all the time. Overwriting is not triggered. This commit maps backward ring buffers readonly, make it overwritable. It is safe because we assume backward ring buffer always overwritable in other part of code. After applying this patch: $ ~/linux/tools/perf$ sudo ./perf record -m 4 -e raw_syscalls:* -g -- overwrite \ --switch-output=1s --tail-synthesize dd if=/dev/zero of=/dev/null [SNIP] Signed-off-by: Wang Nan Cc: Liang Kan Cc: Arnaldo Carvalho de Melo Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Jiri Olsa Cc: Namhyung Kim Cc: Alexander Shishkin Cc: Adrian Hunter Cc: Andi Kleen Cc: Li Zefan --- tools/perf/util/evlist.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index c6c891e..a86b0d2 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -799,12 +799,14 @@ perf_evlist__should_poll(struct perf_evlist *evlist __maybe_unused, } static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, - struct mmap_params *mp, int cpu_idx, + struct mmap_params *_mp, int cpu_idx, int thread, int *_output, int *_output_backward) { struct perf_evsel *evsel; int revent; int evlist_cpu = cpu_map__cpu(evlist->cpus, cpu_idx); + struct mmap_params *mp = _mp; + struct mmap_params backward_mp; evlist__for_each_entry(evlist, evsel) { struct perf_mmap *maps = evlist->mmap; @@ -815,6 +817,9 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, if (evsel->attr.write_backward) { output = _output_backward; maps = evlist->backward_mmap; + backward_mp = *mp; + backward_mp.prot &= ~PROT_WRITE; + mp = &backward_mp; if (!maps) { maps = perf_evlist__alloc_mmap(evlist); So it's trying to support per-event overwrite. How about the global --overwrite option? Not only the per-event overwrite. See the example above. The overall -- overwrite option is also respected. In perf_evsel__config, per-event evsel 'backward' setting is set based on overall '--overwrite' and per-event '/overwrite/' setting. But how about evlist->overwrite? I think it still keeps the wrong setting. The overwrite is implicitly applied. Some settings are inconsistent. Is there any drawback if you use opts->overwrite for perf_evlist__mmap_ex? We will always face such inconsistency, because we have an /no-overwrite/ option which can be set per-evsel. Setting evlist->overwrite won't make things more consistent, because in a evlist, different evsel can have different overwrite setting. A simple solution is making evlist non-overwrite by default, and watch all overwrite evsels a special cases. Then we have only 2 cases to consider: 1. overwrite evsel in a non-overwrite evlist. 2. non-overwrite evsel in a non-overwrite evlist. If evlist->overwrite is always non-overwrite, why not remove it? Some testcases require it. Thank you.
Re: [PATCH 02/10] perf tool: fix: Don't discard prev in backward mode
On 2017/10/12 20:49, Liang, Kan wrote: From 8b058ea6977a97e5705aa2f64bdd014fd76d1247 Mon Sep 17 00:00:00 2001 From: Kan Liang Date: Wed, 11 Oct 2017 07:39:34 -0700 Subject: [PATCH] perf tool: fix: Don't discard prev in backward mode Perf record can switch output. The new output should only store the data after switching. However, in overwrite backward mode, the new output still have the data from old output. That also brings extra overhead. At the end of mmap_read, the position of processed ring buffer is saved in md->prev. Next mmap_read should be end in md->prev if it is not overwriten. That avoids to process duplicate data. However, the md->prev is discarded. So next mmap_read has to process whole valid ring buffer, which probably include the old processed data. Introduce fast path for backward_rb_find_range. Stop searching when md->prev is detected. Signed-off-by: Kan Liang --- tools/perf/util/mmap.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c index 9fe5f9c..36b459a 100644 --- a/tools/perf/util/mmap.c +++ b/tools/perf/util/mmap.c @@ -254,7 +254,8 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd) return 0; } -static int backward_rb_find_range(void *buf, int mask, u64 head, u64 *start, u64 *end) +static int backward_rb_find_range(void *buf, int mask, u64 head, + u64 old, u64 *start, u64 *end) { struct perf_event_header *pheader; u64 evt_head = head; @@ -282,6 +283,12 @@ static int backward_rb_find_range(void *buf, int mask, u64 head, u64 *start, u64 evt_head += pheader->size; pr_debug3("move evt_head: %"PRIx64"\n", evt_head); + + /* fast path: avoid to process duplicate data */ + if (old == evt_head) { + *end = evt_head; + return 0; + } You still need to parse the whole ring buffer. I don't think so. I'm not using (old & mask). The upper bit of old is good enough to tell if there is overwriting. Here are some debugging logs. The start and end is the result from backward_rb_find_range. If there is overwriting, it finishes with 'rewind', not 'fast path'. sudo ./perf record -m1 -e cycles:P -C0 --overwrite --switch-output=1s backward_rb_find_range: buf=0x7f9ea7a93000, head=f550 Finished reading backward ring buffer: fast path start 0xf550 end 0x0 backward_rb_find_range: buf=0x7f9ea7a93000, head=ab30 Finished reading backward ring buffer: rewind start 0xab30 end 0xbb10 backward_rb_find_range: buf=0x7f9ea7a93000, head=28b0 Finished reading backward ring buffer: rewind start 0x28b0 end 0x3898 backward_rb_find_range: buf=0x7f9ea7a93000, head=fffe3e40 Finished reading backward ring buffer: rewind start 0xfffe3e40 end 0xfffe4e40 backward_rb_find_range: buf=0x7f9ea7a93000, head=fffe3aa0 Finished reading backward ring buffer: fast path start 0xfffe3aa0 end 0xfffe3e40 backward_rb_find_range: buf=0x7f9ea7a93000, head=fffe3470 Finished reading backward ring buffer: fast path start 0xfffe3470 end 0xfffe3aa0 backward_rb_find_range: buf=0x7f9ea7a93000, head=fffe0610 Finished reading backward ring buffer: rewind start 0xfffe0610 end 0xfffe1610 backward_rb_find_range: buf=0x7f9ea7a93000, head=fffe00d0 Finished reading backward ring buffer: fast path start 0xfffe00d0 end 0xfffe0610 backward_rb_find_range: buf=0x7f9ea7a93000, head=fffdfe90 Finished reading backward ring buffer: fast path start 0xfffdfe90 end 0xfffe00d0 I'm not saying your code is incorrect. What I mean is you can avoid calling backward_rb_find_range() which parses the ring buffer. Please see rb_find_range. In case of forward ring buffer, it never reads anything for positioning. Just passing start and end pointers. Your and my implementation requires reading the ring buffer to find the last record. It needs a while loop. When 'old - head' less than the size of the ring buffer, even backward ring buffer can also avoid such reading. backward_rb_find_range() should be called when 'old - head' larger than the size of ring buffer only. Thank you.
Re: [PATCH] perf tools: fix: Force backward ring buffer mapped readonly
On 2017/10/12 20:56, Liang, Kan wrote: On 2017/10/11 21:16, Liang, Kan wrote: perf record's --overwrite option doesn't work as we expect. For example: [SNIP] In the above example we get same records from the backward ring buffer all the time. Overwriting is not triggered. This commit maps backward ring buffers readonly, make it overwritable. It is safe because we assume backward ring buffer always overwritable in other part of code. After applying this patch: $ ~/linux/tools/perf$ sudo ./perf record -m 4 -e raw_syscalls:* -g -- overwrite \ --switch-output=1s --tail-synthesize dd if=/dev/zero of=/dev/null [SNIP] Signed-off-by: Wang Nan Cc: Liang Kan Cc: Arnaldo Carvalho de Melo Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Jiri Olsa Cc: Namhyung Kim Cc: Alexander Shishkin Cc: Adrian Hunter Cc: Andi Kleen Cc: Li Zefan --- tools/perf/util/evlist.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index c6c891e..a86b0d2 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -799,12 +799,14 @@ perf_evlist__should_poll(struct perf_evlist *evlist __maybe_unused, } static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, - struct mmap_params *mp, int cpu_idx, + struct mmap_params *_mp, int cpu_idx, int thread, int *_output, int *_output_backward) { struct perf_evsel *evsel; int revent; int evlist_cpu = cpu_map__cpu(evlist->cpus, cpu_idx); + struct mmap_params *mp = _mp; + struct mmap_params backward_mp; evlist__for_each_entry(evlist, evsel) { struct perf_mmap *maps = evlist->mmap; @@ -815,6 +817,9 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, if (evsel->attr.write_backward) { output = _output_backward; maps = evlist->backward_mmap; + backward_mp = *mp; + backward_mp.prot &= ~PROT_WRITE; + mp = &backward_mp; if (!maps) { maps = perf_evlist__alloc_mmap(evlist); So it's trying to support per-event overwrite. How about the global --overwrite option? Not only the per-event overwrite. See the example above. The overall -- overwrite option is also respected. In perf_evsel__config, per-event evsel 'backward' setting is set based on overall '--overwrite' and per-event '/overwrite/' setting. But how about evlist->overwrite? I think it still keeps the wrong setting. The overwrite is implicitly applied. Some settings are inconsistent. Is there any drawback if you use opts->overwrite for perf_evlist__mmap_ex? We will always face such inconsistency, because we have an /no-overwrite/ option which can be set per-evsel. Setting evlist->overwrite won't make things more consistent, because in a evlist, different evsel can have different overwrite setting. A simple solution is making evlist non-overwrite by default, and watch all overwrite evsels a special cases. Then we have only 2 cases to consider: 1. overwrite evsel in a non-overwrite evlist. 2. non-overwrite evsel in a non-overwrite evlist. If we reset evlist->overwrite according to --overwrite, we will have 4 cases to consider: 1. overwrite evsel in a overwrite evlist. 2. non-overwrite evsel in a overwrite evlist. 3. overwrite evsel in a non-overwrite evlist. 4. non-overwrite evsel in a non-overwrite evlist. The real problem is: there's 'overwrite' and 'backward' concepts in our code, but these two concepts are neither independent nor identical. Thank you. Thanks, Kan I think we should use opts->overwrite to replace the hard code 'false' for perf_evlist__mmap_ex as well. Thanks, Kan
Re: [PATCH 02/10] perf tool: fix: Don't discard prev in backward mode
On 2017/10/11 22:57, Liang, Kan wrote: If you really want to avoid record duplication, you need to changes record__mmap_read()'s logic. Now it complains "failed to keep up with mmap data" and avoid dumping data when size of newly generated data is larger than the size of the ring buffer. It is reasonable for forward ring buffer because in this case you lost the head of the first record, the whole ring buffer is unparseable. However, it is wrong in backward case. What you should do in this case is dumping the whole ring buffer. I think what you want should be something like this: (not tested) No. That's not what I want. My test code never trigger the WARN_ONCE. The existing code never trigger that warning because the size computed by rb_find_range is never larger than size of ring buffer. After applying your patch, I believe it will trigger this WARN_ONCE and drop the whole ring buffer. Please set a smaller ring buffer and try again. I think you will see the problem, if you simply run the command as below. sudo ./perf record -e cycles:P -C0 --overwrite --switch-output=1s The output size keep increasing. Because the new output always include the old outputs. What I want is the 'start' and 'end' for the increase, not everything. This is my test result: add a '-m 1' for 'perf record' for shrinking ring buffer, start a while loop on CPU 0 to increase data rate. It stops increasing after the ring buffer is full: $:~/linux/tools/perf$ sudo ./perf record -m1 -e cycles:P -C0 --overwrite --switch-output=1s Warning: File /home/w00229757/.perfconfig not owned by current user or root, ignoring it. [ perf record: dump data: Woken up 1 times ] [ perf record: Dump perf.data.2017101212165072 ] [ perf record: dump data: Woken up 1 times ] [ perf record: Dump perf.data.2017101212165175 ] [ perf record: dump data: Woken up 1 times ] [ perf record: Dump perf.data.2017101212165278 ] [ perf record: dump data: Woken up 1 times ] [ perf record: Dump perf.data.2017101212165381 ] [ perf record: dump data: Woken up 1 times ] [ perf record: Dump perf.data.2017101212165484 ] [ perf record: dump data: Woken up 1 times ] [ perf record: Dump perf.data.2017101212165586 ] ^C[ perf record: Woken up 1 times to write data ] [ perf record: Dump perf.data.2017101212165653 ] [ perf record: Captured and wrote 1.013 MB perf.data. ] $ ls -l ./perf.data* -rw--- 1 root root 538988 Oct 12 12:16 ./perf.data.2017101212165072 -rw--- 1 root root 538988 Oct 12 12:16 ./perf.data.2017101212165175 -rw--- 1 root root 538988 Oct 12 12:16 ./perf.data.2017101212165278 -rw--- 1 root root 538988 Oct 12 12:16 ./perf.data.2017101212165381 -rw--- 1 root root 538988 Oct 12 12:16 ./perf.data.2017101212165484 -rw--- 1 root root 538988 Oct 12 12:16 ./perf.data.2017101212165586 -rw--- 1 root root 1067812 Oct 12 12:16 ./perf.data.2017101212165653 You see the result keep getting larger because the ring buffer is never full in your case. The increasing file size in my case indicates that the old processed data is dumped into the new output. I don't think it’s right. Because we should not process the same data multiple times. That definitely increases the overhead of perf record. For the issue, I mentioned above. What do think about the patch as below? It tries to avoid the duplicate data. From 8b058ea6977a97e5705aa2f64bdd014fd76d1247 Mon Sep 17 00:00:00 2001 From: Kan Liang Date: Wed, 11 Oct 2017 07:39:34 -0700 Subject: [PATCH] perf tool: fix: Don't discard prev in backward mode Perf record can switch output. The new output should only store the data after switching. However, in overwrite backward mode, the new output still have the data from old output. That also brings extra overhead. At the end of mmap_read, the position of processed ring buffer is saved in md->prev. Next mmap_read should be end in md->prev if it is not overwriten. That avoids to process duplicate data. However, the md->prev is discarded. So next mmap_read has to process whole valid ring buffer, which probably include the old processed data. Introduce fast path for backward_rb_find_range. Stop searching when md->prev is detected. Signed-off-by: Kan Liang --- tools/perf/util/mmap.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c index 9fe5f9c..36b459a 100644 --- a/tools/perf/util/mmap.c +++ b/tools/perf/util/mmap.c @@ -254,7 +254,8 @@ int perf_mmap__mmap(struct perf_mmap *map, struct mmap_params *mp, int fd) return 0; } -static int backward_rb_find_range(void *buf, int mask, u64 head, u64 *start, u64 *end) +static int backward_rb_find_range(void *buf, int mask, u64 head, + u64 old, u64 *start, u64 *end) { struct perf_event_header *pheader; u64 evt_head = head; @@ -282,6 +283,12 @@ static int backward_rb_find_range(void *buf, int mask, u64 head, u64 *start, u64 evt_head += pheader->size;
Re: [PATCH] perf tools: fix: Force backward ring buffer mapped readonly
On 2017/10/11 21:16, Liang, Kan wrote: perf record's --overwrite option doesn't work as we expect. For example: [SNIP] In the above example we get same records from the backward ring buffer all the time. Overwriting is not triggered. This commit maps backward ring buffers readonly, make it overwritable. It is safe because we assume backward ring buffer always overwritable in other part of code. After applying this patch: $ ~/linux/tools/perf$ sudo ./perf record -m 4 -e raw_syscalls:* -g -- overwrite \ --switch-output=1s --tail-synthesize dd if=/dev/zero of=/dev/null [SNIP] Signed-off-by: Wang Nan Cc: Liang Kan Cc: Arnaldo Carvalho de Melo Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Jiri Olsa Cc: Namhyung Kim Cc: Alexander Shishkin Cc: Adrian Hunter Cc: Andi Kleen Cc: Li Zefan --- tools/perf/util/evlist.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index c6c891e..a86b0d2 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -799,12 +799,14 @@ perf_evlist__should_poll(struct perf_evlist *evlist __maybe_unused, } static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, - struct mmap_params *mp, int cpu_idx, + struct mmap_params *_mp, int cpu_idx, int thread, int *_output, int *_output_backward) { struct perf_evsel *evsel; int revent; int evlist_cpu = cpu_map__cpu(evlist->cpus, cpu_idx); + struct mmap_params *mp = _mp; + struct mmap_params backward_mp; evlist__for_each_entry(evlist, evsel) { struct perf_mmap *maps = evlist->mmap; @@ -815,6 +817,9 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx, if (evsel->attr.write_backward) { output = _output_backward; maps = evlist->backward_mmap; + backward_mp = *mp; + backward_mp.prot &= ~PROT_WRITE; + mp = &backward_mp; if (!maps) { maps = perf_evlist__alloc_mmap(evlist); So it's trying to support per-event overwrite. How about the global --overwrite option? Not only the per-event overwrite. See the example above. The overall --overwrite option is also respected. In perf_evsel__config, per-event evsel 'backward' setting is set based on overall '--overwrite' and per-event '/overwrite/' setting. I think we should use opts->overwrite to replace the hard code 'false' for perf_evlist__mmap_ex as well. Thanks, Kan
Re: [PATCH 03/10] perf tool: new iterfaces to read event from ring buffer
On 2017/10/11 3:55, Liang, Kan wrote: On 2017/10/11 3:17, Arnaldo Carvalho de Melo wrote: Em Wed, Oct 11, 2017 at 03:10:37AM +0800, Wangnan (F) escreveu: On 2017/10/11 3:00, Arnaldo Carvalho de Melo wrote: Em Tue, Oct 10, 2017 at 03:36:28PM -0300, Arnaldo Carvalho de Melo escreveu: Em Tue, Oct 10, 2017 at 03:34:55PM -0300, Arnaldo Carvalho de Melo escreveu: Em Tue, Oct 10, 2017 at 06:28:18PM +, Liang, Kan escreveu: Em Tue, Oct 10, 2017 at 10:20:16AM -0700, kan.li...@intel.com escreveu: From: Kan Liang The perf_evlist__mmap_read only support forward mode. It needs a common function to support both forward and backward mode. The perf_evlist__mmap_read_backward is buggy. So, what is the bug? You state that it is buggy, but don't spell out the bug, please do so. union perf_event *perf_evlist__mmap_read_backward(struct perf_evlist *evlist, int idx) { struct perf_mmap *md = &evlist->mmap[idx]; <--- it should be backward_mmap If it fixes an existing bug, then it should go separate from this patchkit, right? There is no one use perf_evlist__mmap_read_backward. So it doesn't trigger any issue. There is no one at the end of your patchkit? Or no user _right now_? If there is a user now, lemme see... yeah, no user right now, so _that_ is yet another bug, i.e. it should be used, no? If this is just a left over, then we should just throw it away, now, its a cleanup. Wang, can you take a look at these two issues? So it looks leftover that should've been removed by the following cset, right Wang? commit a0c6f451f90204847ce5f91c3268d83a76bde1b6 Author: Wang Nan Date: Thu Jul 14 08:34:41 2016 + perf evlist: Drop evlist->backward Now there's no real user of evlist->backward. Drop it. We are going to use evlist->backward_mmap as a container for backward ring buffer. Yes, it should be removed, but then there will be no corresponding function to perf_evlist__mmap_read(), which read an record from forward ring buffer. I think Kan wants to become the first user of this function because he is trying to make 'perf top' utilizing backward ring buffer. It needs perf_evlist__mmap_read_backward(), and he triggers the bug use his unpublished patch set. I think we can remove it now, let Kan fix and add it back in his 'perf top' patch set. Well, if there will be a user, perhaps we should fix it, as it seems interesting to have now for, as you said, a counterpart for the forward ring buffer, and one that we have plans for using soon, right? Right if I understand Kan's patch 00/10 correctly. He said: ... But perf top need to switch to overwrite backward mode for good performance. ... Yes, it will be used for perf top optimization. That will be great if you can fix it. You can fix it and post your fix together with your perf top patch set. I can write the code but I can't test it because there's no user now. I still think it's a good idea to have common interfaces for both perf record and other tools like perf top. rb_find_range/backward_rb_find_range could be shared. The mmap read codes are similar. Agree. Please keep going. Thanks, Kan
Re: [PATCH 02/10] perf tool: fix: Don't discard prev in backward mode
On 2017/10/11 3:50, Liang, Kan wrote: On 2017/10/11 2:23, Wangnan (F) wrote: On 2017/10/11 1:20, kan.li...@intel.com wrote: From: Kan Liang Perf record can switch output. The new output should only store the data after switching. However, in overwrite backward mode, the new output still have the data from old output. At the end of mmap_read, the position of processed ring buffer is saved in md->prev. Next mmap_read should be end in md->prev. However, the md->prev is discarded. So next mmap_read has to process whole valid ring buffer, which definitely include the old processed data. Set the prev as the end of the range in backward mode. Signed-off-by: Kan Liang --- tools/perf/util/evlist.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index 33b8837..7d23cf5 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -742,13 +742,25 @@ static int rb_find_range(void *data, int mask, u64 head, u64 old, u64 *start, u64 *end, bool backward) { +int ret; + if (!backward) { *start = old; *end = head; return 0; } -return backward_rb_find_range(data, mask, head, start, end); +ret = backward_rb_find_range(data, mask, head, start, end); + +/* + * The start and end from backward_rb_find_range is the range for all + * valid data in ring buffer. + * However, part of the data is processed previously. + * Reset the end to drop the processed data + */ +*end = old; + [SNIP] If you really want to avoid record duplication, you need to changes record__mmap_read()'s logic. Now it complains "failed to keep up with mmap data" and avoid dumping data when size of newly generated data is larger than the size of the ring buffer. It is reasonable for forward ring buffer because in this case you lost the head of the first record, the whole ring buffer is unparseable. However, it is wrong in backward case. What you should do in this case is dumping the whole ring buffer. I think what you want should be something like this: (not tested) No. That's not what I want. My test code never trigger the WARN_ONCE. The existing code never trigger that warning because the size computed by rb_find_range is never larger than size of ring buffer. After applying your patch, I believe it will trigger this WARN_ONCE and drop the whole ring buffer. Please set a smaller ring buffer and try again. I think you will see the problem, if you simply run the command as below. sudo ./perf record -e cycles:P -C0 --overwrite --switch-output=1s The output size keep increasing. Because the new output always include the old outputs. What I want is the 'start' and 'end' for the increase, not everything. This is my test result: add a '-m 1' for 'perf record' for shrinking ring buffer, start a while loop on CPU 0 to increase data rate. It stops increasing after the ring buffer is full: $:~/linux/tools/perf$ sudo ./perf record -m1 -e cycles:P -C0 --overwrite --switch-output=1s Warning: File /home/w00229757/.perfconfig not owned by current user or root, ignoring it. [ perf record: dump data: Woken up 1 times ] [ perf record: Dump perf.data.2017101212165072 ] [ perf record: dump data: Woken up 1 times ] [ perf record: Dump perf.data.2017101212165175 ] [ perf record: dump data: Woken up 1 times ] [ perf record: Dump perf.data.2017101212165278 ] [ perf record: dump data: Woken up 1 times ] [ perf record: Dump perf.data.2017101212165381 ] [ perf record: dump data: Woken up 1 times ] [ perf record: Dump perf.data.2017101212165484 ] [ perf record: dump data: Woken up 1 times ] [ perf record: Dump perf.data.2017101212165586 ] ^C[ perf record: Woken up 1 times to write data ] [ perf record: Dump perf.data.2017101212165653 ] [ perf record: Captured and wrote 1.013 MB perf.data. ] $ ls -l ./perf.data* -rw--- 1 root root 538988 Oct 12 12:16 ./perf.data.2017101212165072 -rw--- 1 root root 538988 Oct 12 12:16 ./perf.data.2017101212165175 -rw--- 1 root root 538988 Oct 12 12:16 ./perf.data.2017101212165278 -rw--- 1 root root 538988 Oct 12 12:16 ./perf.data.2017101212165381 -rw--- 1 root root 538988 Oct 12 12:16 ./perf.data.2017101212165484 -rw--- 1 root root 538988 Oct 12 12:16 ./perf.data.2017101212165586 -rw--- 1 root root 1067812 Oct 12 12:16 ./perf.data.2017101212165653 You see the result keep getting larger because the ring buffer is never full in your case. Thank you.
Re: [PATCH 03/10] perf tool: new iterfaces to read event from ring buffer
On 2017/10/11 3:17, Arnaldo Carvalho de Melo wrote: Em Wed, Oct 11, 2017 at 03:10:37AM +0800, Wangnan (F) escreveu: On 2017/10/11 3:00, Arnaldo Carvalho de Melo wrote: Em Tue, Oct 10, 2017 at 03:36:28PM -0300, Arnaldo Carvalho de Melo escreveu: Em Tue, Oct 10, 2017 at 03:34:55PM -0300, Arnaldo Carvalho de Melo escreveu: Em Tue, Oct 10, 2017 at 06:28:18PM +, Liang, Kan escreveu: Em Tue, Oct 10, 2017 at 10:20:16AM -0700, kan.li...@intel.com escreveu: From: Kan Liang The perf_evlist__mmap_read only support forward mode. It needs a common function to support both forward and backward mode. The perf_evlist__mmap_read_backward is buggy. So, what is the bug? You state that it is buggy, but don't spell out the bug, please do so. union perf_event *perf_evlist__mmap_read_backward(struct perf_evlist *evlist, int idx) { struct perf_mmap *md = &evlist->mmap[idx]; <--- it should be backward_mmap If it fixes an existing bug, then it should go separate from this patchkit, right? There is no one use perf_evlist__mmap_read_backward. So it doesn't trigger any issue. There is no one at the end of your patchkit? Or no user _right now_? If there is a user now, lemme see... yeah, no user right now, so _that_ is yet another bug, i.e. it should be used, no? If this is just a left over, then we should just throw it away, now, its a cleanup. Wang, can you take a look at these two issues? So it looks leftover that should've been removed by the following cset, right Wang? commit a0c6f451f90204847ce5f91c3268d83a76bde1b6 Author: Wang Nan Date: Thu Jul 14 08:34:41 2016 + perf evlist: Drop evlist->backward Now there's no real user of evlist->backward. Drop it. We are going to use evlist->backward_mmap as a container for backward ring buffer. Yes, it should be removed, but then there will be no corresponding function to perf_evlist__mmap_read(), which read an record from forward ring buffer. I think Kan wants to become the first user of this function because he is trying to make 'perf top' utilizing backward ring buffer. It needs perf_evlist__mmap_read_backward(), and he triggers the bug use his unpublished patch set. I think we can remove it now, let Kan fix and add it back in his 'perf top' patch set. Well, if there will be a user, perhaps we should fix it, as it seems interesting to have now for, as you said, a counterpart for the forward ring buffer, and one that we have plans for using soon, right? Right if I understand Kan's patch 00/10 correctly. He said: ... But perf top need to switch to overwrite backward mode for good performance. ... Thank you.
Re: [PATCH 03/10] perf tool: new iterfaces to read event from ring buffer
On 2017/10/11 3:00, Arnaldo Carvalho de Melo wrote: Em Tue, Oct 10, 2017 at 03:36:28PM -0300, Arnaldo Carvalho de Melo escreveu: Em Tue, Oct 10, 2017 at 03:34:55PM -0300, Arnaldo Carvalho de Melo escreveu: Em Tue, Oct 10, 2017 at 06:28:18PM +, Liang, Kan escreveu: Em Tue, Oct 10, 2017 at 10:20:16AM -0700, kan.li...@intel.com escreveu: From: Kan Liang The perf_evlist__mmap_read only support forward mode. It needs a common function to support both forward and backward mode. The perf_evlist__mmap_read_backward is buggy. So, what is the bug? You state that it is buggy, but don't spell out the bug, please do so. union perf_event *perf_evlist__mmap_read_backward(struct perf_evlist *evlist, int idx) { struct perf_mmap *md = &evlist->mmap[idx]; <--- it should be backward_mmap If it fixes an existing bug, then it should go separate from this patchkit, right? There is no one use perf_evlist__mmap_read_backward. So it doesn't trigger any issue. There is no one at the end of your patchkit? Or no user _right now_? If there is a user now, lemme see... yeah, no user right now, so _that_ is yet another bug, i.e. it should be used, no? If this is just a left over, then we should just throw it away, now, its a cleanup. Wang, can you take a look at these two issues? So it looks leftover that should've been removed by the following cset, right Wang? - Arnaldo commit a0c6f451f90204847ce5f91c3268d83a76bde1b6 Author: Wang Nan Date: Thu Jul 14 08:34:41 2016 + perf evlist: Drop evlist->backward Now there's no real user of evlist->backward. Drop it. We are going to use evlist->backward_mmap as a container for backward ring buffer. Yes, it should be removed, but then there will be no corresponding function to perf_evlist__mmap_read(), which read an record from forward ring buffer. I think Kan wants to become the first user of this function because he is trying to make 'perf top' utilizing backward ring buffer. It needs perf_evlist__mmap_read_backward(), and he triggers the bug use his unpublished patch set. I think we can remove it now, let Kan fix and add it back in his 'perf top' patch set. Thank you.
Re: [PATCH 02/10] perf tool: fix: Don't discard prev in backward mode
On 2017/10/11 2:23, Wangnan (F) wrote: On 2017/10/11 1:20, kan.li...@intel.com wrote: From: Kan Liang Perf record can switch output. The new output should only store the data after switching. However, in overwrite backward mode, the new output still have the data from old output. At the end of mmap_read, the position of processed ring buffer is saved in md->prev. Next mmap_read should be end in md->prev. However, the md->prev is discarded. So next mmap_read has to process whole valid ring buffer, which definitely include the old processed data. Set the prev as the end of the range in backward mode. Signed-off-by: Kan Liang --- tools/perf/util/evlist.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index 33b8837..7d23cf5 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -742,13 +742,25 @@ static int rb_find_range(void *data, int mask, u64 head, u64 old, u64 *start, u64 *end, bool backward) { +int ret; + if (!backward) { *start = old; *end = head; return 0; } -return backward_rb_find_range(data, mask, head, start, end); +ret = backward_rb_find_range(data, mask, head, start, end); + +/* + * The start and end from backward_rb_find_range is the range for all + * valid data in ring buffer. + * However, part of the data is processed previously. + * Reset the end to drop the processed data + */ +*end = old; + [SNIP] If you really want to avoid record duplication, you need to changes record__mmap_read()'s logic. Now it complains "failed to keep up with mmap data" and avoid dumping data when size of newly generated data is larger than the size of the ring buffer. It is reasonable for forward ring buffer because in this case you lost the head of the first record, the whole ring buffer is unparseable. However, it is wrong in backward case. What you should do in this case is dumping the whole ring buffer. I think what you want should be something like this: (not tested) diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c index ee7d0a8..f621a8e 100644 --- a/tools/perf/builtin-record.c +++ b/tools/perf/builtin-record.c @@ -173,7 +173,9 @@ rb_find_range(void *data, int mask, u64 head, u64 old, return 0; } -return backward_rb_find_range(data, mask, head, start, end); +*start = head; +*end = old; +return 0; } static int @@ -199,10 +201,15 @@ record__mmap_read(struct record *rec, struct perf_mmap *md, size = end - start; if (size > (unsigned long)(md->mask) + 1) { -WARN_ONCE(1, "failed to keep up with mmap data. (warn only once)\n"); +if (!backward) { +WARN_ONCE(1, "failed to keep up with mmap data. (warn only once)\n"); -md->prev = head; -perf_mmap__consume(md, overwrite || backward); +md->prev = head; +perf_mmap__consume(md, overwrite || backward); +} else { +backward_rb_find_range(data, md->mask, head, start, end); +/* FIXME: error processing */ +} return 0; } Use 'head' and 'old' to locate data position in ring buffer by default. If overwrite happen, use backward_rb_find_range() to fetch the last available record and dump the whole ring buffer. Thank you.
Re: [PATCH 02/10] perf tool: fix: Don't discard prev in backward mode
On 2017/10/11 1:20, kan.li...@intel.com wrote: From: Kan Liang Perf record can switch output. The new output should only store the data after switching. However, in overwrite backward mode, the new output still have the data from old output. At the end of mmap_read, the position of processed ring buffer is saved in md->prev. Next mmap_read should be end in md->prev. However, the md->prev is discarded. So next mmap_read has to process whole valid ring buffer, which definitely include the old processed data. Set the prev as the end of the range in backward mode. Signed-off-by: Kan Liang --- tools/perf/util/evlist.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index 33b8837..7d23cf5 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -742,13 +742,25 @@ static int rb_find_range(void *data, int mask, u64 head, u64 old, u64 *start, u64 *end, bool backward) { + int ret; + if (!backward) { *start = old; *end = head; return 0; } - return backward_rb_find_range(data, mask, head, start, end); + ret = backward_rb_find_range(data, mask, head, start, end); + + /* +* The start and end from backward_rb_find_range is the range for all +* valid data in ring buffer. +* However, part of the data is processed previously. +* Reset the end to drop the processed data +*/ + *end = old; + I don't understand this patch. What rb_find_range() wants to do is to find start and end pointer, so record__mmap_read() knows where to start reading and where to stop. In backward mode, 'start' pointer is clearly 'head' (because 'head' is the header of the last record kernel writes to the ring buffer), but the 'end' pointer is not very easy to find ('end' pointer is the last (the earliest) available record in the ring buffer). We have to parse the whole ring buffer to find the last available record and set its tail into 'end', this is what backward_rb_find_range() tries hard to do. However, your patch unconditionally overwrites *end (which is set by backward_rb_find_range()), makes backward_rb_find_range() meaningless. In my mind when you decide to use backward ring buffer you must make it overwrite because you want the kernel silently overwrite old records without waking up perf, while still able to parse the ring buffer even if overwriting happens. The use case should be: perf runs in background, kernel silently ignores old records in the ring buffer until something bad happens, then perf dumps 'events before the bad thing'. Similar to fight recorder. In this use case, each dumping is independent. Even if one record is appears in previous dumping, it is harmless (and useful) to make it appears in following dumping again. If you really want to avoid record duplication, you need to changes record__mmap_read()'s logic. Now it complains "failed to keep up with mmap data" and avoid dumping data when size of newly generated data is larger than the size of the ring buffer. It is reasonable for forward ring buffer because in this case you lost the head of the first record, the whole ring buffer is unparseable. However, it is wrong in backward case. What you should do in this case is dumping the whole ring buffer. + return ret; } /*
Re: [PATCHv2] perf bpf: Fix endianness problem when loading parameters in prologue
On 2017/8/15 14:42, Thomas-Mich Richter wrote: On 08/14/2017 06:39 PM, Arnaldo Carvalho de Melo wrote: Em Mon, Aug 14, 2017 at 01:46:44PM +0200, Thomas Richter escreveu: Perf BPF prologue generator unconditionally fetches 8 bytes for function parameters, which causes problem on big endian machine. Thomas gives a detail analysis for this problem: http://lkml.kernel.org/r/968ebda5-abe4-8830-8d69-49f62529d...@linux.vnet.ibm.com This patch parses the type of each argument and converts data from memory to expected type. Now the test runs successfully on 4.13.0-rc5: [root@s8360046 perf]# ./perf test bpf 38: BPF filter : 38.1: Basic BPF filtering : Ok 38.2: BPF pinning : Ok 38.3: BPF prologue generation : Ok 38.4: BPF relocation checker : Ok [root@s8360046 perf]# Signed-off-by: Thomas Richter Signed-off-by: Wang Nan Acked-by: Wang Nan Tested-by: Wang Nan That is strange, who is the author of the patch? Also I think Tested-by is enough, being an even stronger form of Acked-by? But then you also have Signed-off-by: Wang in there... From Documentation/process/submitting-patches.rst: - 12) When to use Acked-by: and Cc: - The Signed-off-by: tag indicates that the signer was involved in the development of the patch, or that he/she was in the patch's delivery path. If a person was not directly involved in the preparation or handling of a patch but wishes to signify and record their approval of it then they can ask to have an Acked-by: line added to the patch's changelog. Acked-by: is often used by the maintainer of the affected code when that maintainer neither contributed to nor forwarded the patch. -- If Wang wrote the original patch and you made it better working together with him, probably having both of you in Signed-off-by lines should be enough? - Arnaldo Ok, my fault then. Wang wrote to patch in the first place, I just fixed one line. Should I resend the patch and delete the Acked-by/Tested-by lines in the commit message? Yes, please resend it. Thanks
Re: perf bpf: Reworked fix endianness problem when loading parameters in prologue
OK. On 2017/8/14 18:58, Thomas-Mich Richter wrote: On 08/14/2017 12:30 PM, Wangnan (F) wrote: Hi Thomas, Your patch looks good to me. I've tested in my environment and it works. Please resend it to lkml and let Arnaldo to collect it. Thank you. Will do, I take this as an Acked-by and Tested-by, ok? Thanks
Re: perf bpf: Reworked fix endianness problem when loading parameters in prologue
Hi Thomas, Your patch looks good to me. I've tested in my environment and it works. Please resend it to lkml and let Arnaldo to collect it. Thank you. On 2017/8/14 17:47, Thomas Richter wrote: Perf BPF prologue generator unconditionally fetches 8 bytes for function parameters, which causes problem on big endian machine. Thomas gives a detail analysis for this problem: http://lkml.kernel.org/r/968ebda5-abe4-8830-8d69-49f62529d...@linux.vnet.ibm.com This patch parses the type of each argument and converts data from memory to expected type. Now the test runs successfully on 4.13.0-rc5: [root@s8360046 perf]# ./perf test bpf 38: BPF filter : 38.1: Basic BPF filtering : Ok 38.2: BPF pinning : Ok 38.3: BPF prologue generation : Ok 38.4: BPF relocation checker : Ok [root@s8360046 perf]# Signed-off-by: Thomas Richter Signed-off-by: Wang Nan Cc: Arnaldo Carvalho de Melo Cc: Thomas Richter Cc: Alexei Starovoitov Cc: Hendrik Brueckner Cc: Li Zefan --- tools/perf/util/bpf-prologue.c | 49 -- 1 file changed, 47 insertions(+), 2 deletions(-) diff --git a/tools/perf/util/bpf-prologue.c b/tools/perf/util/bpf-prologue.c index 1356220..827f914 100644 --- a/tools/perf/util/bpf-prologue.c +++ b/tools/perf/util/bpf-prologue.c @@ -58,6 +58,46 @@ check_pos(struct bpf_insn_pos *pos) return 0; } +/* + * Convert type string (u8/u16/u32/u64/s8/s16/s32/s64 ..., see + * Documentation/trace/kprobetrace.txt) to size field of BPF_LDX_MEM + * instruction (BPF_{B,H,W,DW}). + */ +static int +argtype_to_ldx_size(const char *type) +{ + int arg_size = type ? atoi(&type[1]) : 64; + + switch (arg_size) { + case 8: + return BPF_B; + case 16: + return BPF_H; + case 32: + return BPF_W; + case 64: + default: + return BPF_DW; + } +} + +static const char * +insn_sz_to_str(int insn_sz) +{ + switch (insn_sz) { + case BPF_B: + return "BPF_B"; + case BPF_H: + return "BPF_H"; + case BPF_W: + return "BPF_W"; + case BPF_DW: + return "BPF_DW"; + default: + return "UNKNOWN"; + } +} + /* Give it a shorter name */ #define ins(i, p) append_insn((i), (p)) @@ -258,9 +298,14 @@ gen_prologue_slowpath(struct bpf_insn_pos *pos, } /* Final pass: read to registers */ - for (i = 0; i < nargs; i++) - ins(BPF_LDX_MEM(BPF_DW, BPF_PROLOGUE_START_ARG_REG + i, + for (i = 0; i < nargs; i++) { + int insn_sz = (args[i].ref) ? argtype_to_ldx_size(args[i].type) : BPF_DW; + + pr_debug("prologue: load arg %d, insn_sz is %s\n", +i, insn_sz_to_str(insn_sz)); + ins(BPF_LDX_MEM(insn_sz, BPF_PROLOGUE_START_ARG_REG + i, BPF_REG_FP, -BPF_REG_SIZE * (i + 1)), pos); + } ins(BPF_JMP_IMM(BPF_JA, BPF_REG_0, 0, JMP_TO_SUCCESS_CODE), pos);
Re: [PATCH] perf bpf: Fix endianness problem when loading parameters in prologue
Hi Thomas, Please try this patch on your machine and give me the result. Thank you. On 2017/8/13 2:49, Wang Nan wrote: Perf BPF prologue generator unconditionally fetches 8 bytes for function parameters, which causes problem on big endian machine. Thomas gives a detail analysis for this problem: http://lkml.kernel.org/r/968ebda5-abe4-8830-8d69-49f62529d...@linux.vnet.ibm.com This patch parses the type of each argument and converts data from memory to expected type. Signed-off-by: Wang Nan Cc: Arnaldo Carvalho de Melo Cc: Thomas Richter Cc: Alexei Starovoitov Cc: Hendrik Brueckner Cc: Li Zefan --- tools/perf/tests/bpf-script-test-prologue.c | 4 ++- tools/perf/util/bpf-prologue.c | 49 +++-- 2 files changed, 50 insertions(+), 3 deletions(-) diff --git a/tools/perf/tests/bpf-script-test-prologue.c b/tools/perf/tests/bpf-script-test-prologue.c index b4ebc75..43f1e16 100644 --- a/tools/perf/tests/bpf-script-test-prologue.c +++ b/tools/perf/tests/bpf-script-test-prologue.c @@ -26,9 +26,11 @@ static void (*bpf_trace_printk)(const char *fmt, int fmt_size, ...) = (void *) 6; SEC("func=null_lseek file->f_mode offset orig") -int bpf_func__null_lseek(void *ctx, int err, unsigned long f_mode, +int bpf_func__null_lseek(void *ctx, int err, unsigned long _f_mode, unsigned long offset, unsigned long orig) { + fmode_t f_mode = (fmode_t)_f_mode; + if (err) return 0; if (f_mode & FMODE_WRITE) diff --git a/tools/perf/util/bpf-prologue.c b/tools/perf/util/bpf-prologue.c index 6cdbee1..ce28993 100644 --- a/tools/perf/util/bpf-prologue.c +++ b/tools/perf/util/bpf-prologue.c @@ -57,6 +57,46 @@ check_pos(struct bpf_insn_pos *pos) return 0; } +/* + * Convert type string (u8/u16/u32/u64/s8/s16/s32/s64 ..., see + * Documentation/trace/kprobetrace.txt) to size field of BPF_LDX_MEM + * instruction (BPF_{B,H,W,DW}). + */ +static int +argtype_to_ldx_size(const char *type) +{ + int arg_size = type ? atoi(&type[1]) : 64; + + switch (arg_size) { + case 8: + return BPF_B; + case 16: + return BPF_H; + case 32: + return BPF_W; + case 64: + default: + return BPF_DW; + } +} + +static const char * +insn_sz_to_str(int insn_sz) +{ + switch (insn_sz) { + case BPF_B: + return "BPF_B"; + case BPF_H: + return "BPF_H"; + case BPF_W: + return "BPF_W"; + case BPF_DW: + return "BPF_DW"; + default: + return "UNKNOWN"; + } +} + /* Give it a shorter name */ #define ins(i, p) append_insn((i), (p)) @@ -257,9 +297,14 @@ gen_prologue_slowpath(struct bpf_insn_pos *pos, } /* Final pass: read to registers */ - for (i = 0; i < nargs; i++) - ins(BPF_LDX_MEM(BPF_DW, BPF_PROLOGUE_START_ARG_REG + i, + for (i = 0; i < nargs; i++) { + int insn_sz = argtype_to_ldx_size(args[i].type); + + pr_debug("prologue: load arg %d, insn_sz is %s\n", +i, insn_sz_to_str(insn_sz)); + ins(BPF_LDX_MEM(insn_sz, BPF_PROLOGUE_START_ARG_REG + i, BPF_REG_FP, -BPF_REG_SIZE * (i + 1)), pos); + } ins(BPF_JMP_IMM(BPF_JA, BPF_REG_0, 0, JMP_TO_SUCCESS_CODE), pos);
Re: perf test BPF subtest bpf-prologue test fails on s390x
On 2017/8/11 17:17, Thomas-Mich Richter wrote: On 08/11/2017 07:19 AM, Wangnan (F) wrote: On 2017/8/11 2:13, Arnaldo Carvalho de Melo wrote: Thomas, please try to find who wrote that test and CC him, I'm doing it this time, Wang, can you please take a look at this? I also added lkml to the CC list, here we have more users of perf, lkml is the more developer centric perf list, as perf touches way too many places in the kernel. Best regards, - Arnaldo Em Wed, Aug 09, 2017 at 11:24:19AM +0200, Thomas Richter escreveu: I investigate perf test BPF for s390x and have a question regarding the 38.3 subtest (bpf-prologue test) which fails on s390x. When I turn on trace_printk in tests/bpf-script-test-prologue.c I see this output in /sys/kernel/debug/tracing/trace: [root@s8360047 perf]# cat /sys/kernel/debug/tracing/trace perf-30229 [000] d..2 170161.535791: : f_mode 2001d offset:0 orig:0 perf-30229 [000] d..2 170161.535809: : f_mode 6001f offset:0 orig:0 perf-30229 [000] d..2 170161.535815: : f_mode 6001f offset:1 orig:0 perf-30229 [000] d..2 170161.535819: : f_mode 2001d offset:1 orig:0 perf-30229 [000] d..2 170161.535822: : f_mode 2001d offset:2 orig:1 perf-30229 [000] d..2 170161.535825: : f_mode 6001f offset:2 orig:1 perf-30229 [000] d..2 170161.535828: : f_mode 6001f offset:3 orig:1 perf-30229 [000] d..2 170161.535832: : f_mode 2001d offset:3 orig:1 perf-30229 [000] d..2 170161.535835: : f_mode 2001d offset:4 orig:0 perf-30229 [000] d..2 170161.535841: : f_mode 6001f offset:4 orig:0 [...] There are 3 parameters the eBPF program tests/bpf-script-test-prologue.c accesses: f_mode (member of struct file at offset 140) offset and orig. They are parameters of the lseek() system call triggered in this test case in function llseek_loop(). What is really strange is the value of f_mode. It is an 8 byte value, whereas in the probe event it is defined as a 4 byte value. The lower 4 bytes are all zero and do not belong to member f_mode. The correct value should be 2001d for read-only and 6001f for read-write open mode. Here is the output of the 'perf test -vv bpf' trace: Try to find probe point from debuginfo. Matched function: null_lseek [2d9310d] Probe point found: null_lseek+0 Searching 'file' variable in context. Converting variable file into trace event. converting f_mode in file f_mode type is unsigned int. Opening /sys/kernel/debug/tracing//README write=0 Searching 'offset' variable in context. Converting variable offset into trace event. offset type is long long int. Searching 'orig' variable in context. Converting variable orig into trace event. orig type is int. Found 1 probe_trace_events. Opening /sys/kernel/debug/tracing//kprobe_events write=1 Writing event: p:perf_bpf_probe/func _text+8794224 f_mode=+140(%r2):x32 offset=%r3:s64 orig=%r4:s32 Thank you for your information. This is an endianess problem. Here f_mode is x32, so perf probe and debuginfo in vmlinux is correct. However, BPF prologue generator doesn't obey type, but unconditionally fetch 8 bytes (on 64-bit machine) and pass it to parameter of bpf_func__null_lseek. This is reasonable, because all args should be unsigned long. However, to recover its original value, we need a casting. Please help me verify if the following fix works on your platform: diff --git a/tools/perf/tests/bpf-script-test-prologue.c b/tools/perf/tests/bpf-script-test-prologue.c index b4ebc75..43f1e16 100644 --- a/tools/perf/tests/bpf-script-test-prologue.c +++ b/tools/perf/tests/bpf-script-test-prologue.c @@ -26,9 +26,11 @@ static void (*bpf_trace_printk)(const char *fmt, int fmt_size, ...) = (void *) 6; SEC("func=null_lseek file->f_mode offset orig") -int bpf_func__null_lseek(void *ctx, int err, unsigned long f_mode, +int bpf_func__null_lseek(void *ctx, int err, unsigned long _f_mode, unsigned long offset, unsigned long orig) { + fmode_t f_mode = (fmode_t)_f_mode; + if (err) return 0; if (f_mode & FMODE_WRITE) Thank you. Thanks for this patch, unfortunately it does not work. The result is the same as before. Here is the trace output from /sys/kernel/debug/tracing/trace: perf-18119 [001] d..2 66832.765154: : f_mode 0 offset:0 orig:0 perf-18119 [001] d..2 66832.765189: : f_mode 0 offset:0 orig:0 perf-18119 [001] d..2 66832.765203: : f_mode 0 offset:1 orig:0 perf-18119 [001] d..2 66832.765216: : f_mode 0 offset:1 orig:0 perf-18119 [001] d..2 66832.765229: : f_mode 0 offset:2 orig:1 perf-18119 [001] d..2 66832.765242: : f_mode 0 offset:2 orig:1 So casting in big-endian platform doesn't work as I thought. However when I alter your patch and right shift _f_mode by 32 bits like this:
Re: perf test BPF subtest bpf-prologue test fails on s390x
On 2017/8/11 2:13, Arnaldo Carvalho de Melo wrote: Thomas, please try to find who wrote that test and CC him, I'm doing it this time, Wang, can you please take a look at this? I also added lkml to the CC list, here we have more users of perf, lkml is the more developer centric perf list, as perf touches way too many places in the kernel. Best regards, - Arnaldo Em Wed, Aug 09, 2017 at 11:24:19AM +0200, Thomas Richter escreveu: I investigate perf test BPF for s390x and have a question regarding the 38.3 subtest (bpf-prologue test) which fails on s390x. When I turn on trace_printk in tests/bpf-script-test-prologue.c I see this output in /sys/kernel/debug/tracing/trace: [root@s8360047 perf]# cat /sys/kernel/debug/tracing/trace perf-30229 [000] d..2 170161.535791: : f_mode 2001d offset:0 orig:0 perf-30229 [000] d..2 170161.535809: : f_mode 6001f offset:0 orig:0 perf-30229 [000] d..2 170161.535815: : f_mode 6001f offset:1 orig:0 perf-30229 [000] d..2 170161.535819: : f_mode 2001d offset:1 orig:0 perf-30229 [000] d..2 170161.535822: : f_mode 2001d offset:2 orig:1 perf-30229 [000] d..2 170161.535825: : f_mode 6001f offset:2 orig:1 perf-30229 [000] d..2 170161.535828: : f_mode 6001f offset:3 orig:1 perf-30229 [000] d..2 170161.535832: : f_mode 2001d offset:3 orig:1 perf-30229 [000] d..2 170161.535835: : f_mode 2001d offset:4 orig:0 perf-30229 [000] d..2 170161.535841: : f_mode 6001f offset:4 orig:0 [...] There are 3 parameters the eBPF program tests/bpf-script-test-prologue.c accesses: f_mode (member of struct file at offset 140) offset and orig. They are parameters of the lseek() system call triggered in this test case in function llseek_loop(). What is really strange is the value of f_mode. It is an 8 byte value, whereas in the probe event it is defined as a 4 byte value. The lower 4 bytes are all zero and do not belong to member f_mode. The correct value should be 2001d for read-only and 6001f for read-write open mode. Here is the output of the 'perf test -vv bpf' trace: Try to find probe point from debuginfo. Matched function: null_lseek [2d9310d] Probe point found: null_lseek+0 Searching 'file' variable in context. Converting variable file into trace event. converting f_mode in file f_mode type is unsigned int. Opening /sys/kernel/debug/tracing//README write=0 Searching 'offset' variable in context. Converting variable offset into trace event. offset type is long long int. Searching 'orig' variable in context. Converting variable orig into trace event. orig type is int. Found 1 probe_trace_events. Opening /sys/kernel/debug/tracing//kprobe_events write=1 Writing event: p:perf_bpf_probe/func _text+8794224 f_mode=+140(%r2):x32 offset=%r3:s64 orig=%r4:s32 Thank you for your information. This is an endianess problem. Here f_mode is x32, so perf probe and debuginfo in vmlinux is correct. However, BPF prologue generator doesn't obey type, but unconditionally fetch 8 bytes (on 64-bit machine) and pass it to parameter of bpf_func__null_lseek. This is reasonable, because all args should be unsigned long. However, to recover its original value, we need a casting. Please help me verify if the following fix works on your platform: diff --git a/tools/perf/tests/bpf-script-test-prologue.c b/tools/perf/tests/bpf-script-test-prologue.c index b4ebc75..43f1e16 100644 --- a/tools/perf/tests/bpf-script-test-prologue.c +++ b/tools/perf/tests/bpf-script-test-prologue.c @@ -26,9 +26,11 @@ static void (*bpf_trace_printk)(const char *fmt, int fmt_size, ...) = (void *) 6; SEC("func=null_lseek file->f_mode offset orig") -int bpf_func__null_lseek(void *ctx, int err, unsigned long f_mode, +int bpf_func__null_lseek(void *ctx, int err, unsigned long _f_mode, unsigned long offset, unsigned long orig) { + fmode_t f_mode = (fmode_t)_f_mode; + if (err) return 0; if (f_mode & FMODE_WRITE) Thank you.
Re: change uprobe_events default ? Was: [PATCH] perf: Rename CONFIG_[UK]PROBE_EVENT to CONFIG_[UK]PROBE_EVENTS
On 2017/3/16 9:06, Arnaldo Carvalho de Melo wrote: Added more people to the CC list. Em Wed, Mar 15, 2017 at 05:58:19PM -0700, Alexei Starovoitov escreveu: On Thu, Feb 16, 2017 at 05:00:50PM +1100, Anton Blanchard wrote: We have uses of CONFIG_UPROBE_EVENT and CONFIG_KPROBE_EVENT as well as CONFIG_UPROBE_EVENTS and CONFIG_KPROBE_EVENTS. Consistently use the plurals. this rename made me notice that UPROBE_EVENTS still defaults to 'n'. this is key feature that all distros enable, so having default 'n' is kinda saying that it's not something that should be turned on or used widely. which is obviously not the case. imo it's time to change it to 'y'. Thoughts? Agreed, I also found it strange that it was disabled by default when I recently did a 'make oldconfig' :-\ - Arnaldo +1. Some internal distros in my company now have started discussing supporting user space probing. Turning uprobes on for upstream kernel helps us saving a lot of discussion. Thank you.
Re: [PATCH net-next v1] bpf: Remove redundant ifdef
On 2017/2/15 1:07, David Miller wrote: From: "Wangnan (F)" Date: Mon, 13 Feb 2017 09:53:49 +0800 On 2017/2/12 3:37, Mickaël Salaün wrote: Remove a useless ifdef __NR_bpf as requested by Wang Nan. Inline one-line static functions as it was in the bpf_sys.h file. Signed-off-by: Mickaël Salaün Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: David S. Miller Cc: Wang Nan Link: https://lkml.kernel.org/r/828ab1ff-4dcf-53ff-c97b-074adb895...@huawei.com --- tools/lib/bpf/bpf.c | 12 +++- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 50e04cc5..2de9c386989a 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -42,21 +42,15 @@ # endif #endif -static __u64 ptr_to_u64(const void *ptr) +static inline __u64 ptr_to_u64(const void *ptr) { return (__u64) (unsigned long) ptr; } -static int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, - unsigned int size) +static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, + unsigned int size) { -#ifdef __NR_bpf return syscall(__NR_bpf, cmd, attr, size); -#else - fprintf(stderr, "No bpf syscall, kernel headers too old?\n"); - errno = ENOSYS; - return -1; -#endif } int bpf_create_map(enum bpf_map_type map_type, int key_size, Acked-by: Wang Nan However, it is better to merge this patch with commit 702498a1426bc95b6f49f9c5fba616110cbd3947. I don't know where this commit ID is. Since this patch is targetting net-next I would expect a commit ID with not context to be in that tree. Please always specify where the commit ID you mention is. It is "bpf: Remove bpf_sys.h from selftests" in net-next. Futhermore, commits in net-next are permanent so it is not possible afterwards to "merge this patch with commit X". I understand. Maintainers sometime reset his/her head to an early version and amend the commit to make the history clean, but clearly net-next never do this. Thank you.
Re: [PATCH net-next v1] bpf: Remove redundant ifdef
On 2017/2/12 3:37, Mickaël Salaün wrote: Remove a useless ifdef __NR_bpf as requested by Wang Nan. Inline one-line static functions as it was in the bpf_sys.h file. Signed-off-by: Mickaël Salaün Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: David S. Miller Cc: Wang Nan Link: https://lkml.kernel.org/r/828ab1ff-4dcf-53ff-c97b-074adb895...@huawei.com --- tools/lib/bpf/bpf.c | 12 +++- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 50e04cc5..2de9c386989a 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -42,21 +42,15 @@ # endif #endif -static __u64 ptr_to_u64(const void *ptr) +static inline __u64 ptr_to_u64(const void *ptr) { return (__u64) (unsigned long) ptr; } -static int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, - unsigned int size) +static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, + unsigned int size) { -#ifdef __NR_bpf return syscall(__NR_bpf, cmd, attr, size); -#else - fprintf(stderr, "No bpf syscall, kernel headers too old?\n"); - errno = ENOSYS; - return -1; -#endif } int bpf_create_map(enum bpf_map_type map_type, int key_size, Acked-by: Wang Nan However, it is better to merge this patch with commit 702498a1426bc95b6f49f9c5fba616110cbd3947. Thank you.
Re: [PATCH v4 0/3] Miscellaneous fixes for BPF (perf tree)
On 2017/2/9 4:27, Mickaël Salaün wrote: This series brings some fixes and small improvements to the BPF samples. This is intended for the perf tree and apply on 7a5980f9c006 ("tools lib bpf: Add missing header to the library"). Changes since v3: * remove applied patch 1/5 * remove patch 2/5 on bpf_load_program() as requested by Wang Nan Changes since v2: * add this cover letter Changes since v1: * exclude patches not intended for the perf tree Regards, Mickaël Salaün (3): samples/bpf: Ignore already processed ELF sections samples/bpf: Reset global variables samples/bpf: Add missing header samples/bpf/bpf_load.c | 7 +++ samples/bpf/tracex5_kern.c | 1 + 2 files changed, 8 insertions(+) Looks good to me. Thank you.
Re: [PATCH net-next v5 04/11] bpf: Use bpf_load_program() from the library
On 2017/2/10 10:25, Wangnan (F) wrote: On 2017/2/10 7:21, Mickaël Salaün wrote: Replace bpf_prog_load() with bpf_load_program() calls. Signed-off-by: Mickaël Salaün Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Shuah Khan --- tools/lib/bpf/bpf.c | 6 +++--- tools/lib/bpf/bpf.h | 4 ++-- tools/testing/selftests/bpf/Makefile| 4 +++- tools/testing/selftests/bpf/bpf_sys.h | 21 - tools/testing/selftests/bpf/test_tag.c | 6 -- tools/testing/selftests/bpf/test_verifier.c | 8 +--- 6 files changed, 17 insertions(+), 32 deletions(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 3ddb58a36d3c..58ce252073fa 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -42,7 +42,7 @@ # endif #endif -static __u64 ptr_to_u64(void *ptr) +static __u64 ptr_to_u64(const void *ptr) { return (__u64) (unsigned long) ptr; } @@ -69,8 +69,8 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size, return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); } -int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, - size_t insns_cnt, char *license, +int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns, + size_t insns_cnt, const char *license, __u32 kern_version, char *log_buf, size_t log_buf_sz) { int fd; diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index a2f9853dd882..bc959a2de023 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -28,8 +28,8 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, /* Recommend log buffer size */ #define BPF_LOG_BUF_SIZE 65536 -int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, - size_t insns_cnt, char *license, +int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns, + size_t insns_cnt, const char *license, __u32 kern_version, char *log_buf, size_t log_buf_sz); For libbpf changes: And for similar code in patch 5-8: Acked-by Wang Nan Thank you. Acked-by Wang Nan Thank you.
Re: [PATCH net-next v5 10/11] bpf: Remove bpf_sys.h from selftests
On 2017/2/10 7:21, Mickaël Salaün wrote: Add require dependency headers. Signed-off-by: Mickaël Salaün Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Shuah Khan --- tools/lib/bpf/bpf.c | 6 ++ tools/testing/selftests/bpf/bpf_sys.h | 27 --- tools/testing/selftests/bpf/test_lpm_map.c | 1 - tools/testing/selftests/bpf/test_lru_map.c | 1 - tools/testing/selftests/bpf/test_maps.c | 1 - tools/testing/selftests/bpf/test_tag.c | 3 +-- tools/testing/selftests/bpf/test_verifier.c | 4 ++-- 7 files changed, 9 insertions(+), 34 deletions(-) delete mode 100644 tools/testing/selftests/bpf/bpf_sys.h diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index f8a2b7fa7741..50e04cc5 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -50,7 +50,13 @@ static __u64 ptr_to_u64(const void *ptr) static int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, unsigned int size) { +#ifdef __NR_bpf return syscall(__NR_bpf, cmd, attr, size); +#else + fprintf(stderr, "No bpf syscall, kernel headers too old?\n"); + errno = ENOSYS; + return -1; +#endif } We don't need check __NR_bpf again. It has already been checked at the header of this file: #ifndef __NR_bpf # if defined(__i386__) # define __NR_bpf 357 # elif defined(__x86_64__) # define __NR_bpf 321 # elif defined(__aarch64__) # define __NR_bpf 280 # else # error __NR_bpf not defined. libbpf does not support your arch. # endif #endif Thank you.
Re: [PATCH net-next v5 04/11] bpf: Use bpf_load_program() from the library
On 2017/2/10 7:21, Mickaël Salaün wrote: Replace bpf_prog_load() with bpf_load_program() calls. Signed-off-by: Mickaël Salaün Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: Shuah Khan --- tools/lib/bpf/bpf.c | 6 +++--- tools/lib/bpf/bpf.h | 4 ++-- tools/testing/selftests/bpf/Makefile| 4 +++- tools/testing/selftests/bpf/bpf_sys.h | 21 - tools/testing/selftests/bpf/test_tag.c | 6 -- tools/testing/selftests/bpf/test_verifier.c | 8 +--- 6 files changed, 17 insertions(+), 32 deletions(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 3ddb58a36d3c..58ce252073fa 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -42,7 +42,7 @@ # endif #endif -static __u64 ptr_to_u64(void *ptr) +static __u64 ptr_to_u64(const void *ptr) { return (__u64) (unsigned long) ptr; } @@ -69,8 +69,8 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size, return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); } -int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, -size_t insns_cnt, char *license, +int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns, +size_t insns_cnt, const char *license, __u32 kern_version, char *log_buf, size_t log_buf_sz) { int fd; diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index a2f9853dd882..bc959a2de023 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -28,8 +28,8 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, /* Recommend log buffer size */ #define BPF_LOG_BUF_SIZE 65536 -int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, -size_t insns_cnt, char *license, +int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns, +size_t insns_cnt, const char *license, __u32 kern_version, char *log_buf, size_t log_buf_sz); For libbpf changes: Acked-by Wang Nan Thank you.
Re: [PATCH v2 1/5] bpf: Add missing header to the library
Please add me into the cc list of all of the 5 patches. Thank you. On 2017/2/7 4:40, Mickaël Salaün wrote: Include stddef.h to define size_t. Signed-off-by: Mickaël Salaün Cc: Alexei Starovoitov Cc: Arnaldo Carvalho de Melo Cc: Daniel Borkmann Cc: Wang Nan --- tools/lib/bpf/bpf.h | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index a2f9853dd882..df6e186da788 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -22,6 +22,7 @@ #define __BPF_BPF_H #include +#include int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, int max_entries, __u32 map_flags);
Re: [PATCH v3 1/5] bpf: Add missing header to the library
On 2017/2/8 4:56, Mickaël Salaün wrote: Include stddef.h to define size_t. Signed-off-by: Mickaël Salaün Cc: Alexei Starovoitov Cc: Arnaldo Carvalho de Melo Cc: Daniel Borkmann Cc: Wang Nan --- tools/lib/bpf/bpf.h | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index a2f9853dd882..df6e186da788 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -22,6 +22,7 @@ #define __BPF_BPF_H #include +#include int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, int max_entries, __u32 map_flags); Looks good to me. Thank you.
Re: [PATCH v3 2/5] bpf: Simplify bpf_load_program() error handling in the library
On 2017/2/8 4:56, Mickaël Salaün wrote: Do not call a second time bpf(2) when a program load failed. BPF_PROG_LOAD should success most of the time. Setting log_level to 0 by default and require log buffer when failure can make it faster in normal case. Thank you. Signed-off-by: Mickaël Salaün Cc: Alexei Starovoitov Cc: Arnaldo Carvalho de Melo Cc: Daniel Borkmann Cc: Wang Nan --- tools/lib/bpf/bpf.c | 18 ++ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 3ddb58a36d3c..fda3f494f1cd 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -73,7 +73,6 @@ int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, size_t insns_cnt, char *license, __u32 kern_version, char *log_buf, size_t log_buf_sz) { - int fd; union bpf_attr attr; bzero(&attr, sizeof(attr)); @@ -81,20 +80,15 @@ int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, attr.insn_cnt = (__u32)insns_cnt; attr.insns = ptr_to_u64(insns); attr.license = ptr_to_u64(license); - attr.log_buf = ptr_to_u64(NULL); - attr.log_size = 0; - attr.log_level = 0; + attr.log_buf = ptr_to_u64(log_buf); + attr.log_size = log_buf_sz; attr.kern_version = kern_version; - fd = sys_bpf(BPF_PROG_LOAD, &attr, sizeof(attr)); - if (fd >= 0 || !log_buf || !log_buf_sz) - return fd; + if (log_buf && log_buf_sz > 0) { + attr.log_level = 1; + log_buf[0] = 0; + } - /* Try again with log */ - attr.log_buf = ptr_to_u64(log_buf); - attr.log_size = log_buf_sz; - attr.log_level = 1; - log_buf[0] = 0; return sys_bpf(BPF_PROG_LOAD, &attr, sizeof(attr)); }
Re: [PATCHv2 perf/core 5/7] tools lib bpf: Add bpf_program__pin()
On 2017/1/25 9:16, Joe Stringer wrote: On 24 January 2017 at 17:06, Wangnan (F) wrote: On 2017/1/25 9:04, Wangnan (F) wrote: On 2017/1/23 9:11, Joe Stringer wrote: Add a new API to pin a BPF program to the filesystem. The user can specify the path full path within a BPF filesystem to pin the program. Programs with multiple instances are pinned as 'foo', 'foo_1', 'foo_2', and so on. Signed-off-by: Joe Stringer --- v2: Don't automount BPF filesystem Split program, map, object pinning into separate APIs and separate patches. --- [SNIP] +int bpf_program__pin(struct bpf_program *prog, const char *path) In your next patch please let caller select one instance: int bpf_program__pin(struct bpf_program *prog, int instance, const char *path) (please choose a better name) Then your can wrap it with another function to pin all instances, implement naming schema (%s_%d) there. Then implement naming schema in bpf_object__pin like: %s/objectname_mapname %s/objectname_progname_%d Is it possible to use directory tree instead? %s/object/mapname %s/object/prog/instance I don't think objects have names, so let's assume an object with two program instances named foo, and one map named bar. A call of bpf_object__pin(obj, "/sys/fs/bpf/myobj") would mount with the following files and directories: /sys/fs/bpf/myobj/foo/1 /sys/fs/bpf/myobj/foo/2 /sys/fs/bpf/myobj/bar Alternatively, if you want to control exactly where you want the progs/maps to be pinned, you can call eg bpf_program__pin_instance(prog, "/sys/fs/bpf/wherever", 0) and that instance will be mounted to /sys/fs/bpf/wherever, or alternatively bpf_program__pin(prog, "/sys/fs/bpf/foo"), and you will end up with /sys/fs/bpf/foo/{0,1}. This looks pretty reasonable to me. It looks good to me. Thank you.
Re: [PATCHv2 perf/core 5/7] tools lib bpf: Add bpf_program__pin()
On 2017/1/25 9:04, Wangnan (F) wrote: On 2017/1/23 9:11, Joe Stringer wrote: Add a new API to pin a BPF program to the filesystem. The user can specify the path full path within a BPF filesystem to pin the program. Programs with multiple instances are pinned as 'foo', 'foo_1', 'foo_2', and so on. Signed-off-by: Joe Stringer --- v2: Don't automount BPF filesystem Split program, map, object pinning into separate APIs and separate patches. --- [SNIP] +int bpf_program__pin(struct bpf_program *prog, const char *path) In your next patch please let caller select one instance: int bpf_program__pin(struct bpf_program *prog, int instance, const char *path) (please choose a better name) Then your can wrap it with another function to pin all instances, implement naming schema (%s_%d) there. Then implement naming schema in bpf_object__pin like: %s/objectname_mapname %s/objectname_progname_%d Is it possible to use directory tree instead? %s/object/mapname %s/object/prog/instance Thank you.
Re: [PATCHv2 perf/core 5/7] tools lib bpf: Add bpf_program__pin()
On 2017/1/23 9:11, Joe Stringer wrote: Add a new API to pin a BPF program to the filesystem. The user can specify the path full path within a BPF filesystem to pin the program. Programs with multiple instances are pinned as 'foo', 'foo_1', 'foo_2', and so on. Signed-off-by: Joe Stringer --- v2: Don't automount BPF filesystem Split program, map, object pinning into separate APIs and separate patches. --- tools/lib/bpf/libbpf.c | 76 ++ tools/lib/bpf/libbpf.h | 1 + 2 files changed, 77 insertions(+) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index e6cd62b1264b..eea5c74808f7 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -4,6 +4,7 @@ * Copyright (C) 2013-2015 Alexei Starovoitov * Copyright (C) 2015 Wang Nan * Copyright (C) 2015 Huawei Inc. + * Copyright (C) 2017 Nicira, Inc. * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public @@ -22,6 +23,7 @@ #include #include #include +#include #include #include #include @@ -31,7 +33,10 @@ #include #include #include +#include #include +#include +#include #include #include @@ -1237,6 +1242,77 @@ int bpf_object__load(struct bpf_object *obj) return err; } +static int check_path(const char *path) +{ + struct statfs st_fs; + char *dname, *dir; + int err = 0; + + if (path == NULL) + return -EINVAL; + + dname = strdup(path); + dir = dirname(dname); + if (statfs(dir, &st_fs)) { + pr_warning("failed to statfs %s: %s\n", dir, strerror(errno)); + err = -errno; + } + free(dname); + + if (!err && st_fs.f_type != BPF_FS_MAGIC) { + pr_warning("specified path %s is not on BPF FS\n", path); + err = -EINVAL; + } + + return err; +} + +int bpf_program__pin(struct bpf_program *prog, const char *path) In your next patch please let caller select one instance: int bpf_program__pin(struct bpf_program *prog, int instance, const char *path) (please choose a better name) Then your can wrap it with another function to pin all instances, implement naming schema (%s_%d) there. Then implement naming schema in bpf_object__pin like: %s/objectname_mapname %s/objectname_progname_%d and make a bpf_{map,program}__pin the raw pinning functions with no naming schema. Thank you.
Re: [PATCH perf/core 0/6] Libbpf improvements
On 2017/1/19 7:57, Joe Stringer wrote: Patch 1 fixes an issue when using drastically different BPF map definitions inside ELFs from a client using libbpf, vs the map definition libbpf uses. Patch 2 is a trivial typo fix. Patches 3-5 add some simple, useful helper functions for setting prog type and retrieving libbpf errors without depending on kernel headers from userspace programs. Patch 6 adds a new pinning functionality for an entire object. Calling bpf_object__pin(obj, subpath) will mount all programs from the BPF object to $bpf_fs_path/$subpath/progs/$progname, and all maps from the BPF object to $bpf_fs_path/$subpath/maps/$mapname. The first program with a particular name will be mounted with its progname and the suffix "_0"; subsequent programs with the same will have "_1", and so on; duplicate maps with the same name are disallowed. Joe Stringer (6): tools lib bpf: Fix map offsets in relocation tools lib bpf: Fix grammar in map_idx warning tools lib bpf: Define prog_type fns with macro tools lib bpf: Add set/is helpers for all prog types tools lib bpf: Add libbpf_get_error() tools lib bpf: Add bpf_object__pin() tools/lib/bpf/libbpf.c | 197 +--- tools/lib/bpf/libbpf.h | 15 +++- tools/perf/tests/llvm.c | 2 +- 3 files changed, 185 insertions(+), 29 deletions(-) For patch 3,4 and 5: Acked-by: Wang Nan See my comment to commits 1, 2 and 6. Thank you.
Re: [PATCH perf/core 6/6] tools lib bpf: Add bpf_object__pin()
On 2017/1/19 7:57, Joe Stringer wrote: Add a new API to pin a BPF object to the filesystem. The user can specify a subdirectory under the BPF filesystem to pin these programs. For example, with the subdirectory 'foo', programs and maps are pinned: /sys/fs/bpf/foo/progs/PROG_NAME /sys/fs/bpf/foo/maps/MAP_NAME If the user has specified an alternative BPF filesystem mountpoint via /proc/mounts, that will be read and used instead. Signed-off-by: Joe Stringer --- tools/lib/bpf/libbpf.c | 136 + tools/lib/bpf/libbpf.h | 1 + 2 files changed, 137 insertions(+) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 6b651c19870d..181dca0bdacb 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -4,6 +4,7 @@ * Copyright (C) 2013-2015 Alexei Starovoitov * Copyright (C) 2015 Wang Nan * Copyright (C) 2015 Huawei Inc. + * Copyright (C) 2016 Nicira, Inc. * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public @@ -31,7 +32,12 @@ #include #include #include +#include #include +#include +#include +#include +#include #include #include @@ -1231,6 +1237,136 @@ int bpf_object__load(struct bpf_object *obj) return err; } +#define stringize(x) #x + +static int mount_bpf(char *path) We can use tools/lib/api/fs/fs.c to help us mounting filesystems. Try not reinventing it. +{ + const char *mounts_fmt = "%*s %"stringize(PATH_MAX)"s %#"stringize(NAME_MAX)"s %*s\n"; + struct statfs st_fs; + char type[NAME_MAX]; + int err = 0; + FILE *fp; + + /* Populate 'path'. */ + fp = fopen("/proc/mounts", "r"); + if (fp) { + while (fscanf(fp, mounts_fmt, path, type) == 2) { + if (!strcmp(type, "bpf")) + break; + } + if (fclose(fp)) { + err = -errno; + pr_warning("failed to close /proc/mounts: %s\n", + strerror(errno)); + } + if (strcmp(type, "bpf")) { + err = -errno; + pr_debug("failed to find bpf mount\n"); + } + } else { + err = -errno; + pr_warning("cannot open /proc/mounts: %s\n", strerror(errno)); + } + if (err) { + pr_debug("using /sys/fs/bpf for BPF filesystem mountpoint\n"); + strcpy(path, "/sys/fs/bpf"); + } + + if (!statfs(path, &st_fs) && st_fs.f_type == BPF_FS_MAGIC) + return 0; + + if (mount("bpf", path, "bpf", 0, NULL)) { + pr_warning("failed to mount bpf: %s\n", strerror(errno)); + return -errno; + } + + return 0; +} + +static int make_dirs(const char *path, const char *subdir, char *buf, +size_t len) +{ + snprintf(buf, len, "%s/%s/", path, subdir); + if (mkdir(buf, 0700) && errno != EEXIST) { + pr_warning("failed to mkdir %s: %s\n", subdir, strerror(errno)); + return -errno; + } + + snprintf(buf, len, "%s/%s/maps/", path, subdir); + if (mkdir(buf, 0700) && errno != EEXIST) { + pr_warning("failed to mkdir map: %s\n", strerror(errno)); + return -errno; + } + + snprintf(buf, len, "%s/%s/progs/", path, subdir); + if (mkdir(buf, 0700) && errno != EEXIST) { + pr_warning("failed to mkdir prog: %s\n", strerror(errno)); + return -errno; + } + + return 0; +} + +int bpf_object__pin(struct bpf_object *obj, const char *subdir) +{ We must pin a whole object? I suggest we provide bpf_program__pin and bpf_map__pin, and build bpf_object__pin on top of them. Thank you. + const size_t len = 255; + char path[PATH_MAX]; + char buf[len]; + size_t i, j; + int err = 0; + + if (!obj) + return -ENOENT; + + if (!obj->loaded) { + pr_warning("object not yet loaded; load it first\n"); + return -ENOENT; + } + + err = mount_bpf(path); + if (err) + return err; + + err = make_dirs(path, subdir, buf, len); + if (err) + return err; + + for (i = 0; i < obj->nr_maps; i++) { + struct bpf_map *map = &obj->maps[i]; + + snprintf(buf, len, "%s/%s/maps/%s", path, subdir, +bpf_map__name(map)); + if (bpf_obj_pin(map->fd, buf)) { + err = -errno; + pr_warning("failed to pin map: %s\n", strerror(errno)); + goto out; + } + pr_debug("Loaded map '%s' at '%s'\n", bpf_map__name(map), buf); + } + + if (obj->programs && obj->nr_programs) { +
Re: [PATCH perf/core 2/6] tools lib bpf: Fix grammar in map_idx warning
On 2017/1/19 7:57, Joe Stringer wrote: "Larger" is the comparative adjective form of "large". Signed-off-by: Joe Stringer --- tools/lib/bpf/libbpf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 350ee4c59f85..653b1b368deb 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -834,7 +834,7 @@ bpf_program__collect_reloc(struct bpf_program *prog, map_idx = sym.st_value / prog->obj->map_len; if (map_idx >= nr_maps) { - pr_warning("bpf relocation: map_idx %d large than %d\n", + pr_warning("bpf relocation: map_idx %d larger than %d\n", With my change to 1/6, this message become meaningless. We can change it to: bpf relocation: failed to relocate instruction %u Thank you. (int)map_idx, (int)nr_maps - 1); return -LIBBPF_ERRNO__RELOC; }
Re: [PATCH perf/core 1/6] tools lib bpf: Fix map offsets in relocation
On 2017/1/19 7:57, Joe Stringer wrote: Commit 4708bbda5cb2 ("tools lib bpf: Fix maps resolution") attempted to fix map resolution by identifying the number of symbols that point to maps, and using this number to resolve each of the maps. However, during relocation the original definition of the map size was still in use. Oops... Didn't realize that. For up to two maps, the calculation was correct if there was a small difference in size between the map definition in libbpf and the one that the client library uses. However if the difference was large, particularly if more than two maps were used in the BPF program, the relocation would fail. For example, when using a map definition with size 28, with three maps, map relocation would count (sym_offset / sizeof(struct bpf_map_def) => map_idx) (0 / 16 => 0), ie map_idx = 0 (28 / 16 => 1), ie map_idx = 1 (56 / 16 => 3), ie map_idx = 3 So, libbpf reports: libbpf: bpf relocation: map_idx 3 large than 2 Understand. Fix map relocation by tracking the size of the map definition during map counting, then reuse this instead of the size of the libbpf map structure. Then we add a restriction that map size is fixed in one object, but we don't have to. During relocating, we can get exact map positioning information from sym.st_value. I think finding a map with exact offset from the map array would be better. I'll write a patch to replace this one. Thank you. With this patch applied, libbpf will accept such ELFs. Fixes: 4708bbda5cb2 ("tools lib bpf: Fix maps resolution") Signed-off-by: Joe Stringer --- tools/lib/bpf/libbpf.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 84e6b35da4bd..350ee4c59f85 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -201,6 +201,7 @@ struct bpf_object { size_t nr_programs; struct bpf_map *maps; size_t nr_maps; + size_t map_len; bool loaded; @@ -379,6 +380,7 @@ static struct bpf_object *bpf_object__new(const char *path, obj->efile.maps_shndx = -1; obj->loaded = false; + obj->map_len = sizeof(struct bpf_map_def); INIT_LIST_HEAD(&obj->list); list_add(&obj->list, &bpf_objects_list); @@ -602,6 +604,7 @@ bpf_object__init_maps(struct bpf_object *obj) return -ENOMEM; } obj->nr_maps = nr_maps; + obj->map_len = data->d_size / nr_maps; /* * fill all fd with -1 so won't close incorrect @@ -829,7 +832,7 @@ bpf_program__collect_reloc(struct bpf_program *prog, return -LIBBPF_ERRNO__RELOC; } - map_idx = sym.st_value / sizeof(struct bpf_map_def); + map_idx = sym.st_value / prog->obj->map_len; if (map_idx >= nr_maps) { pr_warning("bpf relocation: map_idx %d large than %d\n", (int)map_idx, (int)nr_maps - 1);
Re: [PATCH 0/7] perf tools: Add switch-output size and time threshold options
On 2017/1/3 16:19, Jiri Olsa wrote: hi, adding a way to configure switch data output for size and time, like: $ sudo perf record -e 'sched:*' --switch-output=10M -avg callchain: type FP switch-output with 10M size threshold mmap size 528384B [ perf record: dump data: Woken up 37 times ] [ perf record: Dump perf.data.2017010309135512 ] [ perf record: dump data: Woken up 39 times ] [ perf record: Dump perf.data.2017010309135771 ] [ perf record: dump data: Woken up 38 times ] [ perf record: Dump perf.data.2017010309140005 ] ^C[ perf record: Woken up 16 times to write data ] [ perf record: Dump perf.data.2017010309140111 ] [ perf record: Captured and wrote 4.748 MB perf.data. ] ... the default for switch-output option stays and does the SIGUSR2 output switch Also available in: git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git perf/fixes thanks, jirka Cc: Wang Nan --- Good functions, and thank you for fixing documentations. You didn't cc patch 1 and 2 to me. Tested-by: Wang Nan Jiri Olsa (7): tools lib subcmd: Add OPT_STRING_OPTARG_SET option perf record: Make __record_options static perf record: Fix --switch-output documentation and comment perf record: Add struct switch_output perf record: Change switch-output option to take optional argument perf record: Add switch-output size option argument perf record: Add switch-output time option argument tools/lib/subcmd/parse-options.c | 3 ++ tools/lib/subcmd/parse-options.h | 5 tools/perf/Documentation/perf-record.txt | 13 +++-- tools/perf/builtin-record.c | 134 -- 4 files changed, 139 insertions(+), 16 deletions(-)
coresight: Problem caused by resetting enable_sink
Hi Mathieu, I meet problems caused by your commit d52c9750f150 ('coresight: reset "enable_sink" flag when need be'). Not only the one I posted in the previous patch. My use case is a simple 'perf record -e cs_etm// ls'. It works properly before this commit, and failed when allocating aux buffer after your commit. I can't fully understand your code, but the problem I meet seems caused by inappropriately reseting sink. My device is connected like this (use two etfs): Core0--+ Core1--+-- funnel0 --> etf0 Core2--| Core3--+ Core0--+ Core1--+-- funnel1 --> etf1 Core2--| Core3--+ Before running perf, two etfs are activated using sysfs enable_sink interface. During etm_setup_aux, coresight_get_enabled_sink() finds etf0 for core0, and automatically deactivates it. For core1, coresight_get_enabled_sink() returns etf1. However, etf1 is not on the link of core1, so following coresight_build_path() fails. I guess your commit is based on the assumption that all sinks are in the patch for each cores. Like this: Core0--+ Core1--+-- funnel0 --> etf0 ++ Core2--| |+--- etr Core3--+ || + replicator + Core0--+ || Core1--+-- funnel1 --> etf1 +++--- etb Core2--| Core3--+ But it is not true, at least for some hisilicon board. I have to revert your patch to make CoreSight on my board work. Please reconsider this patch, or please give some suggestion if you think I misunderstood your patch. Thank you.
Re: [PATCHv3 perf/core 5/7] samples/bpf: Switch over to libbpf
On 2016/12/9 13:04, Wangnan (F) wrote: On 2016/12/9 10:46, Joe Stringer wrote: [SNIP] diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile index 62d89d50fcbd..616bd55f3be8 100644 --- a/tools/lib/bpf/Makefile +++ b/tools/lib/bpf/Makefile @@ -149,6 +149,8 @@ CMD_TARGETS = $(LIB_FILE) TARGETS = $(CMD_TARGETS) +libbpf: all + Why we need this? I tested this patch without it and it seems to work, and this line causes an extra error: $ pwd /home/wn/kernel/tools/lib/bpf $ make libbpf ... gcc -g -Wall -DHAVE_LIBELF_MMAP_SUPPORT -DHAVE_ELF_GETPHDRNUM_SUPPORT -Wbad-function-cast -Wdeclaration-after-statement -Wformat-security -Wformat-y2k -Winit-self -Wmissing-declarations -Wmissing-prototypes -Wnested-externs -Wno-system-headers -Wold-style-definition -Wpacked -Wredundant-decls -Wshadow -Wstrict-aliasing=3 -Wstrict-prototypes -Wswitch-default -Wswitch-enum -Wundef -Wwrite-strings -Wformat -Werror -Wall -fPIC -I. -I/home/wn/kernel-hydrogen/tools/include -I/home/wn/kernel-hydrogen/tools/arch/x86/include/uapi -I/home/wn/kernel-hydrogen/tools/include/uapilibbpf.c all -o libbpf gcc: error: all: No such file or directory make: *** [libbpf] Error 1 Thank you. It is not 'caused' by your patch. 'make libbpf' fails without your change because it tries to build an executable from libbpf.c, but main() is missing. I think libbpf should never be used as a make target. Your new dependency looks strange. Thank you.
Re: [PATCHv3 perf/core 5/7] samples/bpf: Switch over to libbpf
On 2016/12/9 10:46, Joe Stringer wrote: Now that libbpf under tools/lib/bpf/* is synced with the version from samples/bpf, we can get rid most of the libbpf library here. Signed-off-by: Joe Stringer --- v3: First post. --- samples/bpf/Makefile | 60 +- samples/bpf/README.rst | 4 +- samples/bpf/libbpf.c | 111 - samples/bpf/libbpf.h | 19 + tools/lib/bpf/Makefile | 2 + 5 files changed, 38 insertions(+), 158 deletions(-) diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 72c58675973e..c8f7ed37b2de 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -29,35 +29,38 @@ hostprogs-y += trace_event hostprogs-y += sampleip hostprogs-y += tc_l2_redirect -test_verifier-objs := test_verifier.o libbpf.o -test_maps-objs := test_maps.o libbpf.o -sock_example-objs := sock_example.o libbpf.o -fds_example-objs := bpf_load.o libbpf.o fds_example.o -sockex1-objs := bpf_load.o libbpf.o sockex1_user.o -sockex2-objs := bpf_load.o libbpf.o sockex2_user.o -sockex3-objs := bpf_load.o libbpf.o sockex3_user.o -tracex1-objs := bpf_load.o libbpf.o tracex1_user.o -tracex2-objs := bpf_load.o libbpf.o tracex2_user.o -tracex3-objs := bpf_load.o libbpf.o tracex3_user.o -tracex4-objs := bpf_load.o libbpf.o tracex4_user.o -tracex5-objs := bpf_load.o libbpf.o tracex5_user.o -tracex6-objs := bpf_load.o libbpf.o tracex6_user.o -test_probe_write_user-objs := bpf_load.o libbpf.o test_probe_write_user_user.o -trace_output-objs := bpf_load.o libbpf.o trace_output_user.o -lathist-objs := bpf_load.o libbpf.o lathist_user.o -offwaketime-objs := bpf_load.o libbpf.o offwaketime_user.o -spintest-objs := bpf_load.o libbpf.o spintest_user.o -map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o -test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o -test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o -xdp1-objs := bpf_load.o libbpf.o xdp1_user.o +# Libbpf dependencies +LIBBPF := libbpf.o ../../tools/lib/bpf/bpf.o + +test_verifier-objs := test_verifier.o $(LIBBPF) +test_maps-objs := test_maps.o $(LIBBPF) +sock_example-objs := sock_example.o $(LIBBPF) +fds_example-objs := bpf_load.o $(LIBBPF) fds_example.o +sockex1-objs := bpf_load.o $(LIBBPF) sockex1_user.o +sockex2-objs := bpf_load.o $(LIBBPF) sockex2_user.o +sockex3-objs := bpf_load.o $(LIBBPF) sockex3_user.o +tracex1-objs := bpf_load.o $(LIBBPF) tracex1_user.o +tracex2-objs := bpf_load.o $(LIBBPF) tracex2_user.o +tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o +tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o +tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o +tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o +test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o +trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o +lathist-objs := bpf_load.o $(LIBBPF) lathist_user.o +offwaketime-objs := bpf_load.o $(LIBBPF) offwaketime_user.o +spintest-objs := bpf_load.o $(LIBBPF) spintest_user.o +map_perf_test-objs := bpf_load.o $(LIBBPF) map_perf_test_user.o +test_overhead-objs := bpf_load.o $(LIBBPF) test_overhead_user.o +test_cgrp2_array_pin-objs := $(LIBBPF) test_cgrp2_array_pin.o +xdp1-objs := bpf_load.o $(LIBBPF) xdp1_user.o # reuse xdp1 source intentionally -xdp2-objs := bpf_load.o libbpf.o xdp1_user.o -test_current_task_under_cgroup-objs := bpf_load.o libbpf.o \ +xdp2-objs := bpf_load.o $(LIBBPF) xdp1_user.o +test_current_task_under_cgroup-objs := bpf_load.o $(LIBBPF) \ test_current_task_under_cgroup_user.o -trace_event-objs := bpf_load.o libbpf.o trace_event_user.o -sampleip-objs := bpf_load.o libbpf.o sampleip_user.o -tc_l2_redirect-objs := bpf_load.o libbpf.o tc_l2_redirect_user.o +trace_event-objs := bpf_load.o $(LIBBPF) trace_event_user.o +sampleip-objs := bpf_load.o $(LIBBPF) sampleip_user.o +tc_l2_redirect-objs := bpf_load.o $(LIBBPF) tc_l2_redirect_user.o # Tell kbuild to always build the programs always := $(hostprogs-y) @@ -89,7 +92,7 @@ always += test_current_task_under_cgroup_kern.o always += trace_event_kern.o always += sampleip_kern.o -HOSTCFLAGS += -I$(objtree)/usr/include +HOSTCFLAGS += -I$(objtree)/usr/include -I$(objtree)/tools/lib/ Should use srctree like this: +HOSTCFLAGS += -I$(objtree)/usr/include -I$(srctree)/tools/lib/ Or you will see following failure when doing off-tree build: $ mkdir buildkernel $ cd buildkernel $ make -C ../ O=`pwd` menuconfig $ make -j64 $ make samples/bpf/ CHK include/config/kernel.release Using .. as source for kernel GEN ./Makefile CHK include/generated/uapi/linux/version.h CHK include/generated/utsrelease.h CHK include/generated/bounds.h CHK include/generated/timeconst.h CHK include/generated/asm-offsets.h CALL../scripts/checksyscalls.sh HOSTCC samples/bpf/test_verifier.o In file included from ../samples/bpf/test_verifier.c:20:0: .
Re: [PATCHv3 perf/core 3/7] tools lib bpf: Add flags to bpf_create_map()
On 2016/12/9 10:46, Joe Stringer wrote: The map_flags argument to bpf_create_map() was previously not exposed. By exposing it, users can access flags such as whether or not to preallocate the map. Signed-off-by: Joe Stringer Please mention commit 6c90598174322b029e40dd84a4eb01f56afe in commit message: Commit 6c905981743 ("bpf: pre-allocate hash map elements") introduces map_flags to bpf_attr for BPF_MAP_CREATE command. Expose this new parameter in libbpf. Acked-by: Wang Nan --- v3: Split from "tools lib bpf: Sync with samples/bpf/libbpf". --- tools/lib/bpf/bpf.c| 3 ++- tools/lib/bpf/bpf.h| 2 +- tools/lib/bpf/libbpf.c | 3 ++- 3 files changed, 5 insertions(+), 3 deletions(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 89e8e8e5b60e..d0afb26c2e0f 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -54,7 +54,7 @@ static int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, } int bpf_create_map(enum bpf_map_type map_type, int key_size, - int value_size, int max_entries) + int value_size, int max_entries, __u32 map_flags) { union bpf_attr attr; @@ -64,6 +64,7 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size, attr.key_size = key_size; attr.value_size = value_size; attr.max_entries = max_entries; + attr.map_flags = map_flags; return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); } diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index 61130170a6ad..7fcdce16fd62 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -24,7 +24,7 @@ #include int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, - int max_entries); + int max_entries, __u32 map_flags); /* Recommend log buffer size */ #define BPF_LOG_BUF_SIZE 65536 diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 2e974593f3e8..84e6b35da4bd 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -854,7 +854,8 @@ bpf_object__create_maps(struct bpf_object *obj) *pfd = bpf_create_map(def->type, def->key_size, def->value_size, - def->max_entries); + def->max_entries, + 0); if (*pfd < 0) { size_t j; int err = *pfd;
Re: [PATCHv3 perf/core 2/7] tools lib bpf: use __u32 from linux/types.h
On 2016/12/9 10:46, Joe Stringer wrote: Fixes the following issue when building without access to 'u32' type: ./tools/lib/bpf/bpf.h:27:23: error: unknown type name ‘u32’ Signed-off-by: Joe Stringer --- v3: Split from "tools lib bpf: Sync with samples/bpf/libbpf" --- tools/lib/bpf/bpf.c | 4 ++-- tools/lib/bpf/bpf.h | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) Acked-by: Wang Nan diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 8143536b462a..89e8e8e5b60e 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -70,7 +70,7 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size, int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, size_t insns_cnt, char *license, -u32 kern_version, char *log_buf, size_t log_buf_sz) +__u32 kern_version, char *log_buf, size_t log_buf_sz) { int fd; union bpf_attr attr; @@ -98,7 +98,7 @@ int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, } int bpf_map_update_elem(int fd, void *key, void *value, - u64 flags) + __u64 flags) { union bpf_attr attr; diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index 253c3dbb06b4..61130170a6ad 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -30,11 +30,11 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, #define BPF_LOG_BUF_SIZE 65536 int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, size_t insns_cnt, char *license, -u32 kern_version, char *log_buf, +__u32 kern_version, char *log_buf, size_t log_buf_sz); int bpf_map_update_elem(int fd, void *key, void *value, - u64 flags); + __u64 flags); int bpf_map_lookup_elem(int fd, void *key, void *value); int bpf_map_delete_elem(int fd, void *key);
Re: [PATCH v4 01/18] perf build: Check LLVM version in feature check
On 2016/12/6 15:13, Wang Nan wrote: Cancel builtin llvm and clang support when LLVM version is less than 3.9.0: following commits uses newer API. Since Clang/LLVM's API is not guaranteed to be stable, add a test-llvm-version.cpp feature checker, issue warning if LLVM found in compiling environment is not tested yet. Signed-off-by: Wang Nan Cc: Alexei Starovoitov Cc: Arnaldo Carvalho de Melo Cc: He Kuang Cc: Jiri Olsa Cc: Joe Stringer Cc: Zefan Li Cc: pi3or...@163.com --- tools/build/feature/Makefile | 8 ++-- tools/build/feature/test-llvm-version.cpp | 12 tools/build/feature/test-llvm.cpp | 5 + tools/perf/Makefile.config| 8 ++-- 4 files changed, 29 insertions(+), 4 deletions(-) create mode 100644 tools/build/feature/test-llvm-version.cpp [SNIP] diff --git a/tools/build/feature/test-llvm-version.cpp b/tools/build/feature/test-llvm-version.cpp new file mode 100644 index 000..e86b642 --- /dev/null +++ b/tools/build/feature/test-llvm-version.cpp @@ -0,0 +1,12 @@ +#include +#include "llvm/Config/llvm-config.h" + +#define NUM_VERSION (((LLVM_VERSION_MAJOR) << 16) + (LLVM_VERSION_MINOR << 8) + LLVM_VERSION_PATCH) +#define pass int main() {printf("%x\n", NUM_VERSION); return 0;} + +#if NUM_VERSION >= 0x030900 +pass +#else +# error This LLVM is not tested yet. +#endif + Sorry for this blank line. Will resend.
Re: [PATCH v3 10/30] perf clang: Add builtin clang support ant test case
On 2016/12/6 5:48, Arnaldo Carvalho de Melo wrote: Em Mon, Dec 05, 2016 at 07:02:48PM -0200, Arnaldo Carvalho de Melo escreveu: Em Mon, Dec 05, 2016 at 08:51:01AM -0800, Alexei Starovoitov escreveu: yeah. it's kinda high. I'm guessing rpm llvm libs are in debug mode. Try llvm-config --build-mode --assertion-mode it should be Release OFF Probably this was with 3.9 and built from git, quite a while ago, now I removed it from /usr/local/ and installed what is in f25, but I fear it will be insufficient, does 3.8 cuts it for what we're testing? Humm, it looks like it will: [root@jouet ~]# llc --version LLVM (http://llvm.org/): LLVM version 3.8.0 But I'm now running the container based tests to send a pull req, will check later, after that. Not really, Wang, we need to update that feature detection test to state what is the minimum required LLVM/clang version, one that has those functions, which, unfortunately, isn't the one in the latest fedora, fedora 25: I'll set the minimum required LLVM version to 3.9, and report warning when LLVM is too old. However, since LLVM interface is keep changing, finally we will have problem if we want to support 2 or 3 different clang/LLVM. We should keep moving minimum requirement LLVM version if we don't want to see '#ifdef's spread in our code. Thank you.
Re: [PATCH v3 10/30] perf clang: Add builtin clang support ant test case
On 2016/12/5 10:36, Wangnan (F) wrote: On 2016/12/2 23:44, Arnaldo Carvalho de Melo wrote: Em Sat, Nov 26, 2016 at 07:03:34AM +, Wang Nan escreveu: Add basic clang support in clang.cpp and test__clang() testcase. The first testcase checks if builtin clang is able to generate LLVM IR. tests/clang.c is a proxy. Real testcase resides in utils/c++/clang-test.cpp in c++ and exports C interface to perf test subsystem. Test result: $ perf test -v clang 51: Test builtin clang support : 51.1: Test builtin clang compile C source to IR : --- start --- test child forked, pid 13215 test child finished with 0 end Test builtin clang support subtest 0: Ok While testing this I noticed that the perf binary got huge, can't this be done in some other way, i.e. using dynamic library? I intentionally use statically linking because it is good for smartphone: we can simply 'adb push' a statically linked perf to Android. The resulting ELF executable would be even larger if LLVM is built with default setting. In my setting the resuling 'perf' is less than 60MB: $ ls -s ~/perf -h 58M /home/wn/perf $ size ~/perf text databssdechexfilename 569312732950808241086328399071350198b9 /home/wn/perf It is reasonable for me. I think using dynamic clang and llvm libraries is possible but I never tried it before. It depend on LLVM compiling. I think if distro provides shared libraries then perf can utilize them automatically. Let me check it today. Good news: it works. To enable llvm and clang dynamic libraries, we need to build llvm and clang with cmake option -DBUILD_SHARED_LIBS=ON. Then apply a trivial patch to perf: diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf index dfb20dd..e3054f3 100644 --- a/tools/perf/Makefile.perf +++ b/tools/perf/Makefile.perf @@ -346,7 +346,7 @@ LIBS = -Wl,--whole-archive $(PERFLIBS) -Wl,--no-whole-archive -Wl,--start-group ifeq ($(USE_CLANG), 1) CLANGLIBS_LIST = AST Basic CodeGen Driver Frontend Lex Tooling Edit Sema Analysis Parse Serialization - LIBCLANG = $(foreach l,$(CLANGLIBS_LIST),$(wildcard $(shell $(LLVM_CONFIG) --libdir)/libclang$(l).a)) + LIBCLANG = $(foreach l,$(CLANGLIBS_LIST),$(wildcard $(shell $(LLVM_CONFIG) --libdir)/libclang$(l).so)) LIBS += -Wl,--start-group $(LIBCLANG) -Wl,--end-group endif (replace '.a' to '.so') Resuling perf executable: $ ls -s -h ~/perf 26M /home/wn/perf $ size ~/perf text databssdechexfilename 4274339 75503223959984289893551ba57ab /home/wn/perf $ strip ~/perf $ ls -s -h 4.9M /home/wn/perf $ ldd ~/perf | grep 'LLVM\|clang' libclangBasic.so => /tmp/oxygen_root/usr/lib64/libclangBasic.so (0x7f24da49f000) libclangCodeGen.so => /tmp/oxygen_root/usr/lib64/libclangCodeGen.so (0x7f24d9f72000) libclangFrontend.so => /tmp/oxygen_root/usr/lib64/libclangFrontend.so (0x7f24d9c64000) libclangTooling.so => /tmp/oxygen_root/usr/lib64/libclangTooling.so (0x7f24d9a01000) libLLVMBPFCodeGen.so => /tmp/oxygen_root/usr/lib64/libLLVMBPFCodeGen.so (0x7f24d97e2000) libLLVMBPFDesc.so => /tmp/oxygen_root/usr/lib64/libLLVMBPFDesc.so (0x7f24d95da000) As I said, if distro provides dynamic libraries then everything would be fine. If you are okay with the above conclusion, in next version I'll use '-lclangBasic' style linking options so it works for both dynamic and static LLVM/clang, then let's wait for distro's action. Thank you.
Re: [PATCH v3 10/30] perf clang: Add builtin clang support ant test case
On 2016/12/2 23:44, Arnaldo Carvalho de Melo wrote: Em Sat, Nov 26, 2016 at 07:03:34AM +, Wang Nan escreveu: Add basic clang support in clang.cpp and test__clang() testcase. The first testcase checks if builtin clang is able to generate LLVM IR. tests/clang.c is a proxy. Real testcase resides in utils/c++/clang-test.cpp in c++ and exports C interface to perf test subsystem. Test result: $ perf test -v clang 51: Test builtin clang support : 51.1: Test builtin clang compile C source to IR : --- start --- test child forked, pid 13215 test child finished with 0 end Test builtin clang support subtest 0: Ok While testing this I noticed that the perf binary got huge, can't this be done in some other way, i.e. using dynamic library? I intentionally use statically linking because it is good for smartphone: we can simply 'adb push' a statically linked perf to Android. The resulting ELF executable would be even larger if LLVM is built with default setting. In my setting the resuling 'perf' is less than 60MB: $ ls -s ~/perf -h 58M /home/wn/perf $ size ~/perf text databssdechexfilename 569312732950808241086328399071350198b9 /home/wn/perf It is reasonable for me. I think using dynamic clang and llvm libraries is possible but I never tried it before. It depend on LLVM compiling. I think if distro provides shared libraries then perf can utilize them automatically. Let me check it today. About the file size, I discussed with Alexei, he taught me a lot on it. Maybe he or his ioVisor ex-colleagues can provide some new idea? Thank you.
Re: [PATCH v3 14/30] perf clang: Support compile IR to BPF object and add testcase
On 2016/11/28 14:32, Wangnan (F) wrote: On 2016/11/27 1:25, Alexei Starovoitov wrote: On Sat, Nov 26, 2016 at 07:03:38AM +, Wang Nan wrote: getBPFObjectFromModule() is introduced to compile LLVM IR(Module) to BPF object. Add new testcase for it. Test result: $ ./buildperf/perf test -v clang 51: Test builtin clang support : 51.1: Test builtin clang compile C source to IR : --- start --- test child forked, pid 21822 test child finished with 0 end Test builtin clang support subtest 0: Ok 51.2: Test builtin clang compile C source to ELF object : --- start --- test child forked, pid 21823 test child finished with 0 end Test builtin clang support subtest 1: Ok Signed-off-by: Wang Nan ... +legacy::PassManager PM; +if (TargetMachine->addPassesToEmitFile(PM, ostream, + TargetMachine::CGFT_ObjectFile)) { +llvm::errs() << "TargetMachine can't emit a file of this type\n"; +return std::unique_ptr>(nullptr);; +} +PM.run(*Module); I'm pretty sure you want to add FunctionInlingPass as well otherwise I think llvm won't be doing much inlining and only very very simple programs will compile fine. See what we did on bcc side. Thank you for your information. I though inlining should be done during C to IR phase, and we have use -O2 for it. Let me check it. I did a simple test. It seems even without FunctionInliningPass clang/llvm can inline static function with no problem. For example, in the sample code in the cover letter, extract a static function like this: static void inc_counter(u64 id) { u64 *counter; counter = bpf_map_lookup_elem(&syscall_counter, &id); if (!counter) { u64 value = 1; bpf_map_update_elem(&syscall_counter, &id, &value, 0); return; } __sync_fetch_and_add(counter, 1); return; } Then enable llvm.dump-obj = true in ~/.perfconfig so we can see the resuling ELF object. The script worked correctly. readelf report: $ readelf -a ./count_syscalls.o | grep inc_counter $ Inserting output command into PerfModule::prepareBPF and PerfModule::prepareJIT to print names of functions, can't see inc_counter. Then remove -O2 in cflags in createCompilerInvocation. Result: # ./perf record -e ./count_syscalls.c -a sleep 1 LLVM ERROR: Cannot select: t38: ch,glue = BPFISD::CALL t37, t31, Register:i64 %R1, Register:i64 %R2, t37:1 t31: i64,ch = load t51, t58, undef:i64 t58: i64 = BPFISD::Wrapper TargetGlobalAddress:i64@bpf_map_lookup_elem> 0 t57: i64 = TargetGlobalAddress@bpf_map_lookup_elem> 0 t5: i64 = undef t34: i64 = Register %R1 t36: i64 = Register %R2 t37: ch,glue = CopyToReg t35, Register:i64 %R2, FrameIndex:i64<5>, t35:1 t36: i64 = Register %R2 t8: i64 = FrameIndex<5> t35: ch,glue = CopyToReg t33, Register:i64 %R1, t56 t34: i64 = Register %R1 t56: i64 = BPFISD::Wrapper TargetGlobalAddress:i64<%struct.bpf_map_def* @GVALS> 0 t55: i64 = TargetGlobalAddress<%struct.bpf_map_def* @GVALS> 0 In function: func Don't know whether -O2 imply inlining. In bcc, you not only use FunctionInlining, but also add AlwaysInlinerPass and use populateModulePassManager to append other optimization. I tried to minimic your code, but it seems the perfhook functions are optimized out by some optimization added by populateModulePassManager. Although not quite clear, I'll make following change. Please help me check it. Thank you. diff --git a/tools/perf/util/c++/clang.cpp b/tools/perf/util/c++/clang.cpp index d05ab6f..d6d1959 100644 --- a/tools/perf/util/c++/clang.cpp +++ b/tools/perf/util/c++/clang.cpp @@ -22,6 +22,8 @@ #include "llvm/Support/TargetSelect.h" #include "llvm/Target/TargetMachine.h" #include "llvm/Target/TargetOptions.h" +#include "llvm-c/Transforms/IPO.h" +#include "llvm/Transforms/IPO.h" #include #include "clang.h" @@ -133,6 +135,13 @@ getBPFObjectFromModule(llvm::Module *Module) raw_svector_ostream ostream(*Buffer); legacy::PassManager PM; + +PM.add(createFunctionInliningPass()); +/* + * LLVM is changing its interface. Use a stable workaround. + */ + LLVMAddAlwaysInlinerPass(reinterpret_cast(&PM)); + if (TargetMachine->addPassesToEmitFile(PM, ostream, TargetMachine::CGFT_ObjectFile)) { llvm::errs() << "TargetMachine can't emit a file of this type\n";
Re: [PATCH v3 10/30] perf clang: Add builtin clang support ant test case
On 2016/11/27 1:17, Alexei Starovoitov wrote: On Sat, Nov 26, 2016 at 07:03:34AM +, Wang Nan wrote: Add basic clang support in clang.cpp and test__clang() testcase. The first testcase checks if builtin clang is able to generate LLVM IR. tests/clang.c is a proxy. Real testcase resides in utils/c++/clang-test.cpp in c++ and exports C interface to perf test subsystem. Test result: $ perf test -v clang 51: Test builtin clang support : 51.1: Test builtin clang compile C source to IR : --- start --- test child forked, pid 13215 test child finished with 0 end Test builtin clang support subtest 0: Ok Signed-off-by: Wang Nan ... +static CompilerInvocation * +createCompilerInvocation(StringRef& Path, DiagnosticsEngine& Diags) +{ + llvm::opt::ArgStringList CCArgs { + "-cc1", + "-triple", "bpf-pc-linux", + "-fsyntax-only", + "-ferror-limit", "19", + "-fmessage-length", "127", why such limits? + "-O2", + "-nostdsysteminc", + "-nobuiltininc", + "-vectorize-loops", + "-vectorize-slp", Thank you for pointing these out. These arguments are get by analysising the clang example: https://llvm.org/svn/llvm-project/cfe/branches/ggreif/CallInst-operands/examples/clang-interpreter/main.cpp The above example create a C compiler using clang::driver::Driver (bcc also uses driver). I form the argument list according to arglist the driver created for its CI, and leaves arguments I'm not quite sure unchanged. why above two flags are needed? + "-Wno-unused-value", + "-Wno-pointer-sign", these two -Wno makes sense. please add the comment to explain the reasons. They are inherited from samples/bpf/Makefile to suppress some warning when include kernel headers. Thank you.
Re: [PATCH v3 20/30] perf clang jit: add PerfModule::doJIT to JIT perfhook functions
On 2016/11/27 1:29, Alexei Starovoitov wrote: On Sat, Nov 26, 2016 at 07:03:44AM +, Wang Nan wrote: PerfModule::doJIT JIT compile perfhook functions and saves result into a map. Add a test case for it. At this stage perfhook functions can do no useful things because they can't invoke external functions and can't return value. Following commits are going to make improvment. Don't hook functions right after jitted because bpf_object is unavailable during jitting but it should be the context of jitted functions. Signed-off-by: Wang Nan ... + for (Function *F : JITFunctions) { + JITSymbol sym = CompileLayer.findSymbol(F->getName().str(), true); + + /* +* Type of F->getSection() is moving from +* const char * to StringRef. +* Convert it to std::string so we don't need +* consider this API change. +*/ + std::string sec(F->getSection()); + std::string hook(&sec.c_str()[sizeof("perfhook:") - 1]); + perf_hook_func_t func = (perf_hook_func_t)(intptr_t)sym.getAddress(); + + if (JITResult[hook]) + llvm::errs() << "Warning: multiple functions on hook " +<< hook << ", only one is used\n"; + JITResult[hook] = func; + } + return 0; +} + class ClangOptions { llvm::SmallString FileName; llvm::SmallString<64> KVerDef; @@ -292,6 +359,10 @@ void perf_clang__init(void) LLVMInitializeBPFTarget(); LLVMInitializeBPFTargetMC(); LLVMInitializeBPFAsmPrinter(); + + llvm::InitializeNativeTarget(); + llvm::InitializeNativeTargetAsmPrinter(); + llvm::InitializeNativeTargetAsmParser(); Looks great. I bet a lot of people reading perf code won't be able to understand what you're doing here. Could you please add a design doc on how perf<->clang/llvm interaction is done. I'll put a perf-clang.txt in tools/perf/Documentation. Thank you. Acked-by: Alexei Starovoitov
Re: [PATCH v3 14/30] perf clang: Support compile IR to BPF object and add testcase
On 2016/11/27 1:25, Alexei Starovoitov wrote: On Sat, Nov 26, 2016 at 07:03:38AM +, Wang Nan wrote: getBPFObjectFromModule() is introduced to compile LLVM IR(Module) to BPF object. Add new testcase for it. Test result: $ ./buildperf/perf test -v clang 51: Test builtin clang support : 51.1: Test builtin clang compile C source to IR : --- start --- test child forked, pid 21822 test child finished with 0 end Test builtin clang support subtest 0: Ok 51.2: Test builtin clang compile C source to ELF object : --- start --- test child forked, pid 21823 test child finished with 0 end Test builtin clang support subtest 1: Ok Signed-off-by: Wang Nan ... + legacy::PassManager PM; + if (TargetMachine->addPassesToEmitFile(PM, ostream, + TargetMachine::CGFT_ObjectFile)) { + llvm::errs() << "TargetMachine can't emit a file of this type\n"; + return std::unique_ptr>(nullptr);; + } + PM.run(*Module); I'm pretty sure you want to add FunctionInlingPass as well otherwise I think llvm won't be doing much inlining and only very very simple programs will compile fine. See what we did on bcc side. Thank you for your information. I though inlining should be done during C to IR phase, and we have use -O2 for it. Let me check it. Also did you consider skipping elf generation and using in memory instead ? That will improve compile/run time. Maybe in future work? Current design reuse libelf and bpf-loader for loading BPF code. Skipping ELF generation make compiling faster, but require another loader to be coexist with bpf-loader.c, or we need a new libbpf. In my current experience, outputting ELF is not so slow for both server and smartphone. Thank you.
Re: [PATCHv2 perf/core 2/2] tools lib bpf: Sync with samples/bpf/libbpf
On 2016/11/17 10:46, Joe Stringer wrote: On 16 November 2016 at 18:10, Wangnan (F) wrote: I'm also working on improving bpf.c. Please have a look at: https://lkml.org/lkml/2016/11/14/1078 Since bpf.c is simple, I think we can add more functions and fixes gradually, instead of a full copy. See my inline comment below. Ah, I missed this, my apologies. It looks like it will provide much of what I need, I can reassess this patch with your series in mind. One comment though for your patch (I don't have the original thread to respond to unfortunately): The map_pin and map_get functions in your patch series can be used to pin progs too, so maybe there is a better name? You'll see that this patch uses bpf_obj_{pin,get}() - although I wouldn't want those to be confused with the libbpf.c objects so maybe there's a clearer name that could be used. I also have some patches to rework the samples/bpf/* code to use libbpf instead of the sample code that is there, is it worth me submitting that? It will need to wait for your patch to go in, plus a merge with davem's tree. I'm happy to see you are trying to replace samples/bpf 's own libbpf with tools/lib/bpf. I think you can submit your work on libbpf and patches on samples together if they are ready. Arnaldo can pick up patches good for him, and you can improve other patches based on his newest branch. Thank you.
Re: [PATCH 05/34] tools lib bpf: Add missing bpf map functions
On 2016/11/15 12:05, Wang Nan wrote: Add more BPF map operations to libbpf. Signed-off-by: Wang Nan Cc: Alexei Starovoitov Cc: Arnaldo Carvalho de Melo Cc: Li Zefan --- tools/lib/bpf/bpf.c | 56 + tools/lib/bpf/bpf.h | 7 +++ 2 files changed, 63 insertions(+) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 4212ed6..e966248 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -110,3 +110,59 @@ int bpf_map_update_elem(int fd, void *key, void *value, return sys_bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr)); } + +int bpf_map_lookup_elem(int fd, void *key, void *value) +{ + union bpf_attr attr; + + bzero(&attr, sizeof(attr)); + attr.map_fd = fd; + attr.key = ptr_to_u64(key); + attr.value = ptr_to_u64(value); + + return sys_bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr)); +} + +int bpf_map_delete_elem(int fd, void *key) +{ + union bpf_attr attr; + + bzero(&attr, sizeof(attr)); + attr.map_fd = fd; + attr.key = ptr_to_u64(key); + + return sys_bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr)); +} + +int bpf_map_get_next_key(int fd, void *key, void *next_key) +{ + union bpf_attr attr; + + bzero(&attr, sizeof(attr)); + attr.map_fd = fd; + attr.key = ptr_to_u64(key); + attr.next_key = ptr_to_u64(next_key); + + return sys_bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr)); +} + +int bpf_map_pin(int fd, const char *pathname) +{ + union bpf_attr attr; + + bzero(&attr, sizeof(attr)); + attr.pathname = ptr_to_u64((void *)pathname); + attr.bpf_fd = fd; + + return sys_bpf(BPF_OBJ_PIN, &attr, sizeof(attr)); +} + +int bpf_map_get(const char *pathname) +{ + union bpf_attr attr; + + bzero(&attr, sizeof(attr)); + attr.pathname = ptr_to_u64((void *)pathname); + + return sys_bpf(BPF_OBJ_GET, &attr, sizeof(attr)); +} bpf_map_{pin,get} should be rename to bpf_obj_{pin,get} since they can be used on BPF program. Thanks to Joe Stringer. diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index e8ba540..5b3e52b 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -35,4 +35,11 @@ int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, int bpf_map_update_elem(int fd, void *key, void *value, u64 flags); + +int bpf_map_lookup_elem(int fd, void *key, void *value); +int bpf_map_delete_elem(int fd, void *key); +int bpf_map_get_next_key(int fd, void *key, void *next_key); +int bpf_map_pin(int fd, const char *pathname); +int bpf_map_get(const char *pathname); + #endif
Re: [PATCHv2 perf/core 2/2] tools lib bpf: Sync with samples/bpf/libbpf
On 2016/11/17 10:46, Joe Stringer wrote: On 16 November 2016 at 18:10, Wangnan (F) wrote: I'm also working on improving bpf.c. Please have a look at: https://lkml.org/lkml/2016/11/14/1078 Since bpf.c is simple, I think we can add more functions and fixes gradually, instead of a full copy. See my inline comment below. Ah, I missed this, my apologies. It looks like it will provide much of what I need, I can reassess this patch with your series in mind. One comment though for your patch (I don't have the original thread to respond to unfortunately): The map_pin and map_get functions in your patch series can be used to pin progs too, so maybe there is a better name? You'll see that this patch uses bpf_obj_{pin,get}() - although I wouldn't want those to be confused with the libbpf.c objects so maybe there's a clearer name that could be used. The full thread can be found: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1272045.html (lkml.kernel.org is not working for me, sorry) In that patch set, bpf_map_pin/get is linked into perf hooks, so BPF script can pin a map to sysfs. I think this feature would be useful, but I don't have an example to show how to use it. I didn't provide program pinning/get interface because in perf hook they are not useful. After rethinking your suggestion now I think it is okay to provide bpf_obj_{pin,get} in bpf.c and export bpf_map_pin to perf hook. I'll adjust my own patch. I also have some patches to rework the samples/bpf/* code to use libbpf instead of the sample code that is there, is it worth me submitting that? It will need to wait for your patch to go in, plus a merge with davem's tree. On 2016/11/17 1:43, Joe Stringer wrote: [SNIP] /* @@ -53,24 +60,71 @@ static int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, return syscall(__NR_bpf, cmd, attr, size); } -int bpf_create_map(enum bpf_map_type map_type, int key_size, - int value_size, int max_entries) +int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, + int max_entries, int map_flags) { - union bpf_attr attr; + union bpf_attr attr = { + .map_type = map_type, + .key_size = key_size, + .value_size = value_size, + .max_entries = max_entries, + .map_flags = map_flags, + }; - memset(&attr, '\0', sizeof(attr)); + return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); +} I lost map_flags in original bpf.c. Thanks to your patch. map_flags is useful when creating BPF_MAP_TYPE_PERCPU_HASH: BPF_F_NO_PREALLOC is meanful in this case. Do you want me to resubmit this piece as a separate patch or will you address this? Please send it. Thank you.
Re: [PATCHv2 perf/core 2/2] tools lib bpf: Sync with samples/bpf/libbpf
I'm also working on improving bpf.c. Please have a look at: https://lkml.org/lkml/2016/11/14/1078 Since bpf.c is simple, I think we can add more functions and fixes gradually, instead of a full copy. See my inline comment below. On 2016/11/17 1:43, Joe Stringer wrote: Extend the tools/ version of libbpf to include all of the functionality provided in the samples/bpf version. Signed-off-by: Joe Stringer --- v2: Don't shift non-bpf changes across. Various type cleanups, removal of extraneous declarations --- tools/lib/bpf/bpf.c| 107 -- tools/lib/bpf/bpf.h| 202 +++-- tools/lib/bpf/libbpf.c | 3 +- 3 files changed, 279 insertions(+), 33 deletions(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 4212ed62235b..5e061851ac00 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -20,10 +20,17 @@ */ #include -#include +#include #include #include +#include +#include #include +#include +#include +#include +#include +#include #include "bpf.h" Why we need these network related headers? /* @@ -53,24 +60,71 @@ static int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr, return syscall(__NR_bpf, cmd, attr, size); } -int bpf_create_map(enum bpf_map_type map_type, int key_size, - int value_size, int max_entries) +int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, + int max_entries, int map_flags) { - union bpf_attr attr; + union bpf_attr attr = { + .map_type = map_type, + .key_size = key_size, + .value_size = value_size, + .max_entries = max_entries, + .map_flags = map_flags, + }; - memset(&attr, '\0', sizeof(attr)); + return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); +} I lost map_flags in original bpf.c. Thanks to your patch. map_flags is useful when creating BPF_MAP_TYPE_PERCPU_HASH: BPF_F_NO_PREALLOC is meanful in this case. Although it is okay in samples, I still prefer a explicit bzero() or memset(), because kernel checks if unused field in this union is zero. However I'll check c standard to see how unused field would be initialized. diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index e8ba54087497..4dba36995771 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -23,16 +23,202 @@ #include +struct bpf_insn; + int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size, - int max_entries); + int max_entries, int map_flags); +int bpf_update_elem(int fd, void *key, void *value, unsigned long long flags); +int bpf_lookup_elem(int fd, void *key, void *value); +int bpf_delete_elem(int fd, void *key); +int bpf_get_next_key(int fd, void *key, void *next_key); + +int bpf_load_program(enum bpf_prog_type prog_type, +const struct bpf_insn *insns, int insn_len, +const char *license, int kern_version, +char *log_buf, size_t log_buf_sz); + +int bpf_obj_pin(int fd, const char *pathname); +int bpf_obj_get(const char *pathname); -/* Recommend log buffer size */ #define BPF_LOG_BUF_SIZE 65536 -int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, -size_t insns_cnt, char *license, -u32 kern_version, char *log_buf, -size_t log_buf_sz); -int bpf_map_update_elem(int fd, void *key, void *value, - u64 flags); +/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */ + +#define BPF_ALU64_REG(OP, DST, SRC)\ + ((struct bpf_insn) {\ + .code = BPF_ALU64 | BPF_OP(OP) | BPF_X,\ + .dst_reg = DST, \ + .src_reg = SRC, \ + .off = 0, \ + .imm = 0 }) + Should we define these macros here? They are in include/linux/filter.h and duplicated in tools/include/linux/filter.h. Redefining them here would cause conflict. Thank you.
Re: [PATCHv2 perf/core 1/2] tools lib bpf: Sync {tools,}/include/uapi/linux/bpf.h
On 2016/11/17 1:43, Joe Stringer wrote: The tools version of this header is out of date; update it to the latest version from the kernel headers. Signed-off-by: Joe Stringer --- v2: No change. --- tools/include/uapi/linux/bpf.h | 51 ++ 1 file changed, 51 insertions(+) Acked-by: Wang Nan diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 9e5fc168c8a3..f09c70b97eca 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -95,6 +95,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_SCHED_ACT, BPF_PROG_TYPE_TRACEPOINT, BPF_PROG_TYPE_XDP, + BPF_PROG_TYPE_PERF_EVENT, }; #define BPF_PSEUDO_MAP_FD 1 @@ -375,6 +376,56 @@ enum bpf_func_id { */ BPF_FUNC_probe_write_user, + /** +* bpf_current_task_under_cgroup(map, index) - Check cgroup2 membership of current task +* @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type +* @index: index of the cgroup in the bpf_map +* Return: +* == 0 current failed the cgroup2 descendant test +* == 1 current succeeded the cgroup2 descendant test +*< 0 error +*/ + BPF_FUNC_current_task_under_cgroup, + + /** +* bpf_skb_change_tail(skb, len, flags) +* The helper will resize the skb to the given new size, +* to be used f.e. with control messages. +* @skb: pointer to skb +* @len: new skb length +* @flags: reserved +* Return: 0 on success or negative error +*/ + BPF_FUNC_skb_change_tail, + + /** +* bpf_skb_pull_data(skb, len) +* The helper will pull in non-linear data in case the +* skb is non-linear and not all of len are part of the +* linear section. Only needed for read/write with direct +* packet access. +* @skb: pointer to skb +* @len: len to make read/writeable +* Return: 0 on success or negative error +*/ + BPF_FUNC_skb_pull_data, + + /** +* bpf_csum_update(skb, csum) +* Adds csum into skb->csum in case of CHECKSUM_COMPLETE. +* @skb: pointer to skb +* @csum: csum to add +* Return: csum on success or negative error +*/ + BPF_FUNC_csum_update, + + /** +* bpf_set_hash_invalid(skb) +* Invalidate current skb>hash. +* @skb: pointer to skb +*/ + BPF_FUNC_set_hash_invalid, + __BPF_FUNC_MAX_ID, };
Re: [PATCH 00/34] perf clang: Builtin clang and perfhook support
On 2016/11/15 13:21, Alexei Starovoitov wrote: On Mon, Nov 14, 2016 at 9:03 PM, Wangnan (F) wrote: On 2016/11/15 12:57, Alexei Starovoitov wrote: On Mon, Nov 14, 2016 at 8:05 PM, Wang Nan wrote: This is version 2 of perf builtin clang patch series. Compare to v1, add an exciting feature: jit compiling perf hook functions. This features allows script writer report result through BPF map in a customized way. looks great. SEC("perfhook:record_start") void record_start(void *ctx) { int perf_pid = getpid(), key = G_perf_pid; printf("Start count, perfpid=%d\n", perf_pid); jit_helper__map_update_elem(ctx, &GVALS, &key, &perf_pid, 0); the name, I think, is too verbose. Why not to keep them as bpf_map_update_elem even for user space programs? I can make it shorter by give it a better name or use a wrapper like BPF_MAP(update_elem) the macro isn't pretty, since function calls won't look like calls. but the only thing I can't do is to make perfhook and in-kernel script use a uniform name for these bpf_map functions, because bpf_map_update_elem is already defined: "static long (*bpf_map_update_elem)(void *, void *, void *, unsigned long) = (void *)2;\n" right. i guess you could have #ifdef it, so it's different for bpf backend and for native. Then the '.c' -> LLVM IR compiling should be done twice for BPF and for JIT to make the macro work. In current implementation we have only one LLVM IR. It is faster and can make sure the data layout ("maps" section) is identical. Another alternative is to call it map_update_elem or map_update or bpf_map_update. Something shorter is already a win. 'jit_helper__' prefix is an implementation detail. The users don't need to know and don't need to spell it out everywhere. Good. Let choose a better name for them. Thank you.
Re: [PATCH 00/34] perf clang: Builtin clang and perfhook support
On 2016/11/15 12:57, Alexei Starovoitov wrote: On Mon, Nov 14, 2016 at 8:05 PM, Wang Nan wrote: This is version 2 of perf builtin clang patch series. Compare to v1, add an exciting feature: jit compiling perf hook functions. This features allows script writer report result through BPF map in a customized way. looks great. SEC("perfhook:record_start") void record_start(void *ctx) { int perf_pid = getpid(), key = G_perf_pid; printf("Start count, perfpid=%d\n", perf_pid); jit_helper__map_update_elem(ctx, &GVALS, &key, &perf_pid, 0); the name, I think, is too verbose. Why not to keep them as bpf_map_update_elem even for user space programs? I can make it shorter by give it a better name or use a wrapper like BPF_MAP(update_elem) but the only thing I can't do is to make perfhook and in-kernel script use a uniform name for these bpf_map functions, because bpf_map_update_elem is already defined: "static long (*bpf_map_update_elem)(void *, void *, void *, unsigned long) = (void *)2;\n" SEC("perfhook:record_end") void record_end(void *ctx) { u64 key = -1, value; while (!jit_helper__map_get_next_key(ctx, &syscall_counter, &key, &key)) { jit_helper__map_lookup_elem(ctx, &syscall_counter, &key, &value); printf("syscall %ld\tcount: %ld\n", (long)key, (long)value); this loop will be less verbose as well.
Re: [PATCH 00/34] perf clang: Builtin clang and perfhook support
On 2016/11/15 12:05, Wang Nan wrote: $ sudo -s # ulimit -l unlimited # perf record -e ./count_syscalls.c echo "Haha" Start count, perfpid=25209 Haha [ perf record: Woken up 1 times to write data ] syscall 8count: 6 syscall 11 count: 1 syscall 4count: 6 syscall 21 count: 1 syscall 5count: 3 syscall 231 count: 1 syscall 45 count: 3 syscall 0count: 24 syscall 257 count: 1 syscall 59 count: 4 syscall 23 count: 9 syscall 78 count: 2 syscall 41 count: 4 syscall 72 count: 8 syscall 10 count: 3 syscall 321 count: 1 syscall 298 count: 7 syscall 16 count: 21 syscall 9count: 16 syscall 1count: 114 syscall 12 count: 3 syscall 14 count: 35 syscall 158 count: 1 syscall 2count: 15 syscall 7count: 18 syscall 3count: 11 [ perf record: Captured and wrote 0.011 MB perf.data ] Note that this example counts system wide syscall histogram, not only 'echo' proc. The in-kernel BPF script doesn't know pid of 'echo' so can't filter base on it. I'm planning adding more perf hook points to pass information like this. Thank you.
Re: [PATCH 7/8] tools lib bpf: fix maps resolution
Hi Eric, During testing this patch I find a segfault, please see inline comment. In addition, since both the BPF map array and map names should be done after symbol table is collected, merging bpf_object__init_maps and bpf_object__init_maps_name would be a good practice, making code simpler. So I prepare a new patch. Please have a look at: http://lkml.kernel.org/g/20161108215734.28905-1-wangn...@huawei.com New version ensure not crashing in any case user provides a corrupted maps section, including array of bpf maps, maps with different definition structures and very short map definition. Thank you. On 2016/10/16 14:18, Eric Leblond wrote: It is not correct to assimilate the elf data of the maps section to an array of map definition. In fact the sizes differ. The offset provided in the symbol section has to be used instead. This patch fixes a bug causing a elf with two maps not to load correctly. Signed-off-by: Eric Leblond --- tools/lib/bpf/libbpf.c | 50 +++--- 1 file changed, 35 insertions(+), 15 deletions(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 1fe4532..f72628b 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -186,6 +186,7 @@ struct bpf_program { struct bpf_map { int fd; char *name; + size_t offset; struct bpf_map_def def; void *priv; bpf_map_clear_priv_t clear_priv; @@ -529,13 +530,6 @@ bpf_object__init_maps(struct bpf_object *obj, void *data, pr_debug("maps in %s: %zd bytes\n", obj->path, size); - obj->maps = calloc(nr_maps, sizeof(obj->maps[0])); - if (!obj->maps) { - pr_warning("alloc maps for object failed\n"); - return -ENOMEM; - } - obj->nr_maps = nr_maps; - for (i = 0; i < nr_maps; i++) { struct bpf_map_def *def = &obj->maps[i].def; @@ -547,23 +541,42 @@ bpf_object__init_maps(struct bpf_object *obj, void *data, obj->maps[i].fd = -1; /* Save map definition into obj->maps */ - *def = ((struct bpf_map_def *)data)[i]; + *def = *(struct bpf_map_def *)(data + obj->maps[i].offset); } Here, nr_maps is still size / sizeof(struct bpf_map_def), so obj->maps[i] can be invalid. return 0; } static int -bpf_object__init_maps_name(struct bpf_object *obj) +bpf_object__init_maps_symbol(struct bpf_object *obj) { int i; + int nr_maps = 0; Elf_Data *symbols = obj->efile.symbols; + size_t map_idx = 0; if (!symbols || obj->efile.maps_shndx < 0) return -EINVAL; + /* get the number of maps */ + for (i = 0; i < symbols->d_size / sizeof(GElf_Sym); i++) { + GElf_Sym sym; + + if (!gelf_getsym(symbols, i, &sym)) + continue; + if (sym.st_shndx != obj->efile.maps_shndx) + continue; + nr_maps++; + } + + obj->maps = calloc(nr_maps, sizeof(obj->maps[0])); + if (!obj->maps) { + pr_warning("alloc maps for object failed\n"); + return -ENOMEM; + } + obj->nr_maps = nr_maps; + for (i = 0; i < symbols->d_size / sizeof(GElf_Sym); i++) { GElf_Sym sym; - size_t map_idx; const char *map_name; if (!gelf_getsym(symbols, i, &sym)) @@ -574,12 +587,12 @@ bpf_object__init_maps_name(struct bpf_object *obj) map_name = elf_strptr(obj->efile.elf, obj->efile.strtabidx, sym.st_name); - map_idx = sym.st_value / sizeof(struct bpf_map_def); if (map_idx >= obj->nr_maps) { pr_warning("index of map \"%s\" is buggy: %zu > %zu\n", map_name, map_idx, obj->nr_maps); continue; } + obj->maps[map_idx].offset = sym.st_value; obj->maps[map_idx].name = strdup(map_name); if (!obj->maps[map_idx].name) { pr_warning("failed to alloc map name\n"); @@ -587,6 +600,7 @@ bpf_object__init_maps_name(struct bpf_object *obj) } pr_debug("map %zu is \"%s\"\n", map_idx, obj->maps[map_idx].name); + map_idx++; } return 0; } @@ -647,8 +661,6 @@ static int bpf_object__elf_collect(struct bpf_object *obj) data->d_buf, data->d_size); else if (strcmp(name, "maps") == 0) { - err = bpf_object__init_maps(obj, data->d_buf, - data->d_size); obj->efile.maps_shndx = idx; } else if (sh.sh_type == S
Re: [PATCH 7/8] tools lib bpf: fix maps resolution
Hi Eric, Are you still working in this patch set? Now I know why maps section is not a simple array from a patch set from Joe Stringer: https://www.mail-archive.com/netdev@vger.kernel.org/msg135088.html So I think this patch is really useful. Are you going to resend the whole patch set? If not, let me collect this patch 7/8 into my local code base and send to Arnaldo with my other patches. Thank you. On 2016/10/17 5:18, Eric Leblond wrote: It is not correct to assimilate the elf data of the maps section to an array of map definition. In fact the sizes differ. The offset provided in the symbol section has to be used instead. This patch fixes a bug causing a elf with two maps not to load correctly. Signed-off-by: Eric Leblond --- tools/lib/bpf/libbpf.c | 50 +++--- 1 file changed, 35 insertions(+), 15 deletions(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 1fe4532..f72628b 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -186,6 +186,7 @@ struct bpf_program { struct bpf_map { int fd; char *name; + size_t offset; struct bpf_map_def def; void *priv; bpf_map_clear_priv_t clear_priv; @@ -529,13 +530,6 @@ bpf_object__init_maps(struct bpf_object *obj, void *data, pr_debug("maps in %s: %zd bytes\n", obj->path, size); - obj->maps = calloc(nr_maps, sizeof(obj->maps[0])); - if (!obj->maps) { - pr_warning("alloc maps for object failed\n"); - return -ENOMEM; - } - obj->nr_maps = nr_maps; - for (i = 0; i < nr_maps; i++) { struct bpf_map_def *def = &obj->maps[i].def; @@ -547,23 +541,42 @@ bpf_object__init_maps(struct bpf_object *obj, void *data, obj->maps[i].fd = -1; /* Save map definition into obj->maps */ - *def = ((struct bpf_map_def *)data)[i]; + *def = *(struct bpf_map_def *)(data + obj->maps[i].offset); } return 0; } static int -bpf_object__init_maps_name(struct bpf_object *obj) +bpf_object__init_maps_symbol(struct bpf_object *obj) { int i; + int nr_maps = 0; Elf_Data *symbols = obj->efile.symbols; + size_t map_idx = 0; if (!symbols || obj->efile.maps_shndx < 0) return -EINVAL; + /* get the number of maps */ + for (i = 0; i < symbols->d_size / sizeof(GElf_Sym); i++) { + GElf_Sym sym; + + if (!gelf_getsym(symbols, i, &sym)) + continue; + if (sym.st_shndx != obj->efile.maps_shndx) + continue; + nr_maps++; + } + + obj->maps = calloc(nr_maps, sizeof(obj->maps[0])); + if (!obj->maps) { + pr_warning("alloc maps for object failed\n"); + return -ENOMEM; + } + obj->nr_maps = nr_maps; + for (i = 0; i < symbols->d_size / sizeof(GElf_Sym); i++) { GElf_Sym sym; - size_t map_idx; const char *map_name; if (!gelf_getsym(symbols, i, &sym)) @@ -574,12 +587,12 @@ bpf_object__init_maps_name(struct bpf_object *obj) map_name = elf_strptr(obj->efile.elf, obj->efile.strtabidx, sym.st_name); - map_idx = sym.st_value / sizeof(struct bpf_map_def); if (map_idx >= obj->nr_maps) { pr_warning("index of map \"%s\" is buggy: %zu > %zu\n", map_name, map_idx, obj->nr_maps); continue; } + obj->maps[map_idx].offset = sym.st_value; obj->maps[map_idx].name = strdup(map_name); if (!obj->maps[map_idx].name) { pr_warning("failed to alloc map name\n"); @@ -587,6 +600,7 @@ bpf_object__init_maps_name(struct bpf_object *obj) } pr_debug("map %zu is \"%s\"\n", map_idx, obj->maps[map_idx].name); + map_idx++; } return 0; } @@ -647,8 +661,6 @@ static int bpf_object__elf_collect(struct bpf_object *obj) data->d_buf, data->d_size); else if (strcmp(name, "maps") == 0) { - err = bpf_object__init_maps(obj, data->d_buf, - data->d_size); obj->efile.maps_shndx = idx; } else if (sh.sh_type == SHT_SYMTAB) { if (obj->efile.symbols) { @@ -698,8 +710,16 @@ static int bpf_object__elf_collect(struct bpf_object *obj) pr_warning("Corrupted ELF file: index of strtab invalid\n"); return LIBBPF_ERRNO__FORMAT; } - if (obj->efile.
Re: [PATCH 2/2] perf_event_open.2: Document write_backward
On 2016/10/22 18:05, Michael Kerrisk (man-pages) wrote: On 10/21/2016 11:25 PM, Vince Weaver wrote: On Fri, 21 Oct 2016, Wang Nan wrote: context_switch : 1, /* context switch data */ - - __reserved_1 : 37; + write_backward : 1, /* Write ring buffer from end to beginning */ + __reserved_1 : 36; This removes a blank line, not sure if intentional or not. Maybe it would be better to keep it. I don't feel too strongly about this though. +.IR "write_backward" " (since Linux 4.6)" It didn't committed until Linux 4.7 from what I can tell? Yes, that's my recollection too. +This makes the resuling event use a backward ring-buffer, which resulting +writes samples from the end of the ring-buffer. + +It is not allowed to connect events with backward and forward +ring-buffer settings together using +.B PERF_EVENT_IOC_SET_OUTPUT. + +Backward ring-buffer is useful when the ring-buffer is overwritable +(created by readonly +.BR mmap (2) +). A ring buffer is over-writable when it is mmapped readonly? Is this a hard requirement? I'd like to explain over-writable ring buffer in patch 1/1 like this: diff --git a/man2/perf_event_open.2 b/man2/perf_event_open.2 index fade28c..561331c 100644 --- a/man2/perf_event_open.2 +++ b/man2/perf_event_open.2 @@ -1687,6 +1687,15 @@ the .I data_tail value should be written by user space to reflect the last read data. In this case, the kernel will not overwrite unread data. + +When the mapping is read only (without +.BR PROT_WRITE ), +setting .I data_tail is not allowed. +In this case, the kernel will overwrite data when sample coming, unless +the ring buffer is paused by a +.BR PERF_EVENT_IOC_PAUSE_OUTPUT +.BR ioctl (2) +system call before reading. .TP .IR data_offset " (since Linux 4.1)" .\" commit e8c6deac69629c0cb97c3d3272f8631ef17f8f0f The ring buffer become over-writable because there's no way to tell kernel the positioin of the last read data when mmaped read only. Can you set the read-backwards bit if not mapped readonly? I don't understand why we need read-backwards. Mapped with PROT_WRITE is the *default* setting. In this case user program like perf is able to tell the reading position to kernel through writing to 'data_tail'. In this case kernel won't overwrite unread data, it reads forwardly. Or do you think the naming is confusing? The name of 'write_backward' is kernel-centric, means adjust kernel behavior. kernel *write* data, so I call it 'write_backward'. The name 'read-backwards' is user-centric, because user 'read' data. Thank you.
Re: [PATCH 1/5] perf core: Introduce new ioctl options to pause and resume ring buffer
On 2016/10/21 15:06, Michael Kerrisk (man-pages) wrote: Hello Wangnan, The patch below seems to have landed in Linux 4.7, commit 86e7972f690c1017fd086cdfe53d8524e68c661c Could you draft a man-pages patch for this interface change, please? Or, failing that, a plain-text description that we can integrate into the man-page. I sent man-pages patches at March: https://patchwork.kernel.org/patch/8678861/ http://www.spinics.net/lists/linux-man/msg10177.html Let me resend them again. Thank you.
Re: [PATCH 1/8] tools lib bpf: add error functions
On 2016/10/19 6:52, Joe Stringer wrote: On 16 October 2016 at 14:18, Eric Leblond wrote: The include of err.h is not explicitely needed in exported functions and it was causing include conflict with some existing code due to redefining some macros. To fix this, let's have error handling functions provided by the library. Furthermore this will allow user to have an homogeneous API. Signed-off-by: Eric Leblond Does it need to return the error like this or should we just fix up the bpf_object__open() API to return errors in a simpler form? There's already libbpf_set_print(...) for outputting errors, is it reasonable to just change the library to return NULLs in error cases instead? Returning error code to caller so caller knows what happen. Other subsystems in perf also do this. Perf hides libbpf's error output (make it silent unless -v), so it needs a way for receiving libbpf's error code. I think this patch is good, decouple libbpf.h and kernel headers. Thank you.
Re: [PATCH 7/8] tools lib bpf: fix maps resolution
On 2016/10/17 5:18, Eric Leblond wrote: It is not correct to assimilate the elf data of the maps section to an array of map definition. In fact the sizes differ. The offset provided in the symbol section has to be used instead. This patch fixes a bug causing a elf with two maps not to load correctly. Could you please give an example so we can understand why section 'maps' is not an array? Thank you.
Re: [PATCH 6/8] tools lib bpf: improve warning
On 2016/10/17 5:18, Eric Leblond wrote: Signed-off-by: Eric Leblond Please add some commit messages. Thank you. --- tools/lib/bpf/libbpf.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 7cd341e..1fe4532 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -802,7 +802,8 @@ bpf_object__create_maps(struct bpf_object *obj) size_t j; int err = *pfd; - pr_warning("failed to create map: %s\n", + pr_warning("failed to create map (name: '%s'): %s\n", + obj->maps[i].name, strerror(errno)); for (j = 0; j < i; j++) zclose(obj->maps[j].fd);
Re: [PATCH 5/8] tools lib bpf: add missing functions
On 2016/10/17 5:18, Eric Leblond wrote: Some functions were missing in the library to be able to use it in the case where the userspace is handling the maps in kernel. The patch also renames functions to have a homogeneous naming convention. Signed-off-by: Eric Leblond --- tools/lib/bpf/bpf.c| 35 ++- tools/lib/bpf/bpf.h| 2 -- tools/lib/bpf/libbpf.h | 5 + 3 files changed, 39 insertions(+), 3 deletions(-) diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 4212ed6..c0e07bd 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -25,6 +25,7 @@ #include #include #include "bpf.h" +#include "libbpf.h" /* * When building perf, unistd.h is overrided. __NR_bpf is @@ -97,7 +98,7 @@ int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, return sys_bpf(BPF_PROG_LOAD, &attr, sizeof(attr)); } -int bpf_map_update_elem(int fd, void *key, void *value, +int bpf_map__update_elem(int fd, void *key, void *value, u64 flags) Please don't use '__' style API here. It is easily be confused with bpf_map__*() in libbpf.h. They are APIs at different level. bpf_map__*() are APIs for 'struct bpf_map's, they are object introduced by libbpf, defined in libbpf.h. bpf_map_*() APIs operate on fd, they are objects defined by kernel. bpf_map_*() APIs are declared in bpf.h. In libbpf, bpf.h directly operates on kernel objects (fd), APIs in it are named bpf_map_*(); libbpf.h operates on 'struct bpf_map' object, APIs in it are named using bpf_map__*(). libbpf.h and bpf.h are independent with each other. { union bpf_attr attr; @@ -110,3 +111,35 @@ int bpf_map_update_elem(int fd, void *key, void *value, return sys_bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr)); } + +int bpf_map__lookup_elem(int fd, void *key, void *value) +{ + union bpf_attr attr = { + .map_fd = fd, + .key = ptr_to_u64(key), + .value = ptr_to_u64(value), + }; + + return sys_bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr)); +} + +int bpf_map__delete_elem(int fd, void *key) +{ + union bpf_attr attr = { + .map_fd = fd, + .key = ptr_to_u64(key), + }; + + return sys_bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr)); +} + +int bpf_map__get_next_key(int fd, void *key, void *next_key) +{ + union bpf_attr attr = { + .map_fd = fd, + .key = ptr_to_u64(key), + .next_key = ptr_to_u64(next_key), + }; + + return sys_bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr)); +} diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h index e8ba540..5ca834a 100644 --- a/tools/lib/bpf/bpf.h +++ b/tools/lib/bpf/bpf.h @@ -33,6 +33,4 @@ int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns, u32 kern_version, char *log_buf, size_t log_buf_sz); -int bpf_map_update_elem(int fd, void *key, void *value, - u64 flags); #endif diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index a18783b..dfb46d0 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -207,6 +207,11 @@ bpf_map__next(struct bpf_map *map, struct bpf_object *obj); int bpf_map__fd(struct bpf_map *map); const struct bpf_map_def *bpf_map__def(struct bpf_map *map); const char *bpf_map__name(struct bpf_map *map); +int bpf_map__update_elem(int fd, void *key, void *value, + uint64_t flags); +int bpf_map__lookup_elem(int fd, void *key, void *value); +int bpf_map__delete_elem(int fd, void *key); +int bpf_map__get_next_key(int fd, void *key, void *next_key); As what we have discussed, the newly introduced functions should be added in bpf.h. Thank you.
Re: [PATCH 4/8] tools lib bpf: export function to set type
On 2016/10/17 5:18, Eric Leblond wrote: Current API was not allowing the user to set a type like socket filter. To avoid a setter function for each type, the patch simply exports a set function that takes the type in parameter. Signed-off-by: Eric Leblond --- tools/lib/bpf/libbpf.c | 19 +-- tools/lib/bpf/libbpf.h | 3 +++ 2 files changed, 12 insertions(+), 10 deletions(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 90932f1..7cd341e 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -1336,26 +1336,25 @@ int bpf_program__nth_fd(struct bpf_program *prog, int n) return fd; } -static void bpf_program__set_type(struct bpf_program *prog, - enum bpf_prog_type type) +int bpf_program__set_type(struct bpf_program *prog, unsigned int type) { + if (!prog) + return -EINVAL; + if (type >= __MAX_BPF_PROG_TYPE) + return -EINVAL; + prog->type = type; + return 0; } int bpf_program__set_tracepoint(struct bpf_program *prog) { - if (!prog) - return -EINVAL; - bpf_program__set_type(prog, BPF_PROG_TYPE_TRACEPOINT); - return 0; + return bpf_program__set_type(prog, BPF_PROG_TYPE_TRACEPOINT); } int bpf_program__set_kprobe(struct bpf_program *prog) { - if (!prog) - return -EINVAL; - bpf_program__set_type(prog, BPF_PROG_TYPE_KPROBE); - return 0; + return bpf_program__set_type(prog, BPF_PROG_TYPE_KPROBE); } static bool bpf_program__is_type(struct bpf_program *prog, diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index e40c8d3..a18783b 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -173,6 +173,9 @@ int bpf_program__set_kprobe(struct bpf_program *prog); bool bpf_program__is_tracepoint(struct bpf_program *prog); bool bpf_program__is_kprobe(struct bpf_program *prog); +int bpf_program__set_type(struct bpf_program *prog, + unsigned int type); + Although you don't include uapi/linux/bpf.h in this patch, logically you add this dependency. Please continously add bpf_program__set_socket_filter() and bpf_program__is_socket_filter() like what we do for tracepoint. This way libbpf.h is indenpendent from kernel header. We can use macro in both .h and .c. Thank you.
Re: [PATCH 3/8] tools: Sync tools/include/uapi/linux/bpf.h with the kernel
On 2016/10/17 5:18, Eric Leblond wrote: Signed-off-by: Eric Leblond Commit message is required. Thank you. --- tools/include/uapi/linux/bpf.h | 52 ++ 1 file changed, 52 insertions(+) diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 9e5fc16..570287f 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -95,6 +95,8 @@ enum bpf_prog_type { BPF_PROG_TYPE_SCHED_ACT, BPF_PROG_TYPE_TRACEPOINT, BPF_PROG_TYPE_XDP, + BPF_PROG_TYPE_PERF_EVENT, + __MAX_BPF_PROG_TYPE, }; #define BPF_PSEUDO_MAP_FD 1 @@ -375,6 +377,56 @@ enum bpf_func_id { */ BPF_FUNC_probe_write_user, + /** +* bpf_current_task_under_cgroup(map, index) - Check cgroup2 membership of current task +* @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type +* @index: index of the cgroup in the bpf_map +* Return: +* == 0 current failed the cgroup2 descendant test +* == 1 current succeeded the cgroup2 descendant test +*< 0 error +*/ + BPF_FUNC_current_task_under_cgroup, + + /** +* bpf_skb_change_tail(skb, len, flags) +* The helper will resize the skb to the given new size, +* to be used f.e. with control messages. +* @skb: pointer to skb +* @len: new skb length +* @flags: reserved +* Return: 0 on success or negative error +*/ + BPF_FUNC_skb_change_tail, + + /** +* bpf_skb_pull_data(skb, len) +* The helper will pull in non-linear data in case the +* skb is non-linear and not all of len are part of the +* linear section. Only needed for read/write with direct +* packet access. +* @skb: pointer to skb +* @len: len to make read/writeable +* Return: 0 on success or negative error +*/ + BPF_FUNC_skb_pull_data, + + /** +* bpf_csum_update(skb, csum) +* Adds csum into skb->csum in case of CHECKSUM_COMPLETE. +* @skb: pointer to skb +* @csum: csum to add +* Return: csum on success or negative error +*/ + BPF_FUNC_csum_update, + + /** +* bpf_set_hash_invalid(skb) +* Invalidate current skb>hash. +* @skb: pointer to skb +*/ + BPF_FUNC_set_hash_invalid, + __BPF_FUNC_MAX_ID, };
Re: [PATCH 1/8] tools lib bpf: add error functions
On 2016/10/17 5:18, Eric Leblond wrote: The include of err.h is not explicitely needed in exported functions and it was causing include conflict with some existing code due to redefining some macros. To fix this, let's have error handling functions provided by the library. Furthermore this will allow user to have an homogeneous API. Signed-off-by: Eric Leblond --- tools/lib/bpf/libbpf.c | 11 +++ tools/lib/bpf/libbpf.h | 4 +++- 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index b699aea..90932f1 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -31,6 +31,7 @@ #include #include #include +#include #include #include @@ -1447,3 +1448,13 @@ bpf_object__find_map_by_name(struct bpf_object *obj, const char *name) } return NULL; } + +bool bpf__is_error(const void *ptr) Please use libbpf_is_error(), like libbpf_set_print. We use '__' because we want to use the OO concept. This utility is not OO. +{ + return IS_ERR(ptr); +} + +long bpf__get_error(const void *ptr) Same, please call it libbpf_get_error(). Thank you.
Re: [PATCH 1/3] perf, tools: Handle events including .c and .o
On 2016/10/7 4:18, Arnaldo Carvalho de Melo wrote: Em Wed, Oct 05, 2016 at 12:47:10PM -0700, Andi Kleen escreveu: From: Andi Kleen This is a generic bug fix, but it helps with Sukadev's JSON event tree where such events can happen. Any event inclduing a .c/.o/.bpf currently triggers BPF compilation or loading and then an error. This can happen for some Intel JSON events, which cannot be used. Fix the scanner to only match for .o or .c or .bpf at the end. This will prevent loading multiple BPF scripts separated with comma, but I assume this is acceptable. So, I tried it with the example provided in the thread for a previous version of this patch (IIRC) and it still fails: [acme@jouet linux]$ perf stat -e '{unc_p_clockticks,unc_p_power_state_occupancy.cores_c0}' -a -I 1000 ERROR: problems with path {unc_p_clockticks,unc_p_power_state_occupancy.c: No such file or directory event syntax error: '{unc_p_clockticks,unc_p_power_state_occupancy.cores_c0}' \___ Failed to load {unc_p_clockticks,unc_p_power_state_occupancy.c from source: Error when compiling BPF scriptlet (add -v to see detail) Run 'perf list' for a list of valid events Usage: perf stat [] [] -e, --eventevent selector. use 'perf list' to list available events [acme@jouet linux]$ And with another event that for sure is available on this machine: [acme@jouet linux]$ perf stat -e '{uops_executed.core_cycles_ge_2}' -I 1000 usleep 10 ERROR: problems with path {uops_executed.c: No such file or directory event syntax error: '{uops_executed.core_cycles_ge_2}' \___ Failed to load {uops_executed.c from source: Error when compiling BPF scriptlet (add -v to see detail) Run 'perf list' for a list of valid events Usage: perf stat [] [] -e, --eventevent selector. use 'perf list' to list available events [acme@jouet linux]$ I thought this was due to the Makefile not noticing the change in the .l files, but I made sure I deleted the build dir and rebuilt from scratch, same problem. - Arnaldo Tested again, and thank you for giving us another chance for fixing this :) The key problem here is not the ending '$' but the leading '{'. Flex's greedy maching policy makes this problem. According to the design of parse-events.l, when it see something like '...{...}...', it first matches a 'group' in '' scope, then rewind to INITIAL scope to match events in the group. In INITIAL scope, when it see a '{', flex consume this char and goes back to '' scope to match next event. It works well before match BPF file path using unlimited '.*\.c' because '.*' will match the leading '{' in INITIAL scope without consuming it. The simplest method for this problem is fixing the '.*' part: like what we define for 'event', don't match ',', '{' and '}'. Doesn't like 'event', '/' is required because this is a path. Will post a patch for it. Please test it again. Thank you.
Re: [PATCH v2 00/18] perf clang: Support compiling BPF script on the fly
On 2016/10/6 7:20, Arnaldo Carvalho de Melo wrote: Em Mon, Sep 26, 2016 at 07:26:54AM +, Wang Nan escreveu: This patch add builtin clang, allow perf compile BPF scripts on the fly. This is the first step to implement what I announced at LinuxCon 2016 NA: Ok, so I refreshed this series to apply against my latest perf/core and put it in a tmp.perf/builtin-clang, will continue testing it tomorrow after checking if fedora24 has those llvm-dev and libclang-dev that can be used with this, do you know if it works on that distro? Sorry for the late. I only tested on ubuntu. I can see a llvm-static package for fedora: https://apps.fedoraproject.org/packages/llvm-static/ but for clang I can find only dynamic libraries: https://apps.fedoraproject.org/packages/clang-libs/ I think we can make it work by changing 'libclang$(l).a' to 'libclang$(l).so' in commit 1125e7f6cf29. Thank you.
Re: [PATCH 00/14] perf clang: Support compiling BPF script use builtin clang
On 2016/9/27 7:58, Arnaldo Carvalho de Melo wrote: Le 26 sept. 2016 8:47 PM, "Alexei Starovoitov" mailto:alexei.starovoi...@gmail.com>> a écrit : > > On Mon, Sep 26, 2016 at 09:49:30AM +0800, Wangnan (F) wrote: > > > > > > On 2016/9/24 23:16, Alexei Starovoitov wrote: > > >On Fri, Sep 23, 2016 at 12:49:47PM +, Wang Nan wrote: > > >>This patch set is the first step to implement features I announced > > >>in LinuxCon NA 2016. See page 31 of: > > >> > > >> http://events.linuxfoundation.org/sites/events/files/slides/Performance%20Monitoring%20and%20Analysis%20Using%20perf%20and%20BPF_1.pdf > > >> > > >>This patch set links LLVM and Clang libraries to perf, so perf > > >>is able to compile BPF script to BPF object on the fly. > > >Nice! > > >So single perf binary won't have llvm external dependency anymore > > >or both ways will be maintained? > > >The command line stays the same? > > > > Yes. This patch set doesn't change interface. It compiles BPF script > > with builtin clang, and if it fail, fall back to external clang. > > > > >If I understand the patches correctly, this set is establishing > > >the basic functionality and more complex features coming? > > > > > > > Yes. Following steps are: > > > > 1. Ease of use improvement: automatically include BPF functions > > declaration and macros. > > +1 > > > 2. Perf's hook: compile part of BPF script into native code, run > > them in perf when something happen. Create a channel, coordinate > > BPF and native code use bpf-output event. > > +1 > > > 3. Define a new language to support common profiling task. I'm not > > very clear what the new language should be. It may looks like lua, > > perf converts it to C then to LLVM IR with builtin clang. > > Many tracing languages were invented in the past. > At this point I'm not sure what exactly new language will solve. > To make it easier to write bpf programs? > I think it will be more fruitful to tweak clang/llvm to add > good warnings/errors for cases where we know that C is not going > be compiled into the code that the kernel verifier will accept. > Like we can complain about loops, unitialized variables, > non-inlined and unkown helper functions... all from clang/llvm. > imo that would be the better path forward and will help > both tracing and networking users that write in this restricted C. ++1 > OK. Now let's focus on the first two goals. After that let's consider how to help writing BPF program. Thank you.
Re: [PATCH 00/14] perf clang: Support compiling BPF script use builtin clang
On 2016/9/24 23:16, Alexei Starovoitov wrote: On Fri, Sep 23, 2016 at 12:49:47PM +, Wang Nan wrote: This patch set is the first step to implement features I announced in LinuxCon NA 2016. See page 31 of: http://events.linuxfoundation.org/sites/events/files/slides/Performance%20Monitoring%20and%20Analysis%20Using%20perf%20and%20BPF_1.pdf This patch set links LLVM and Clang libraries to perf, so perf is able to compile BPF script to BPF object on the fly. Nice! So single perf binary won't have llvm external dependency anymore or both ways will be maintained? The command line stays the same? Yes. This patch set doesn't change interface. It compiles BPF script with builtin clang, and if it fail, fall back to external clang. If I understand the patches correctly, this set is establishing the basic functionality and more complex features coming? Yes. Following steps are: 1. Ease of use improvement: automatically include BPF functions declaration and macros. 2. Perf's hook: compile part of BPF script into native code, run them in perf when something happen. Create a channel, coordinate BPF and native code use bpf-output event. 3. Define a new language to support common profiling task. I'm not very clear what the new language should be. It may looks like lua, perf converts it to C then to LLVM IR with builtin clang. Thank you.
Re: [PATCH] perf record: Fix segfault when running with suid and kptr_restrict is 1
On 2016/9/22 23:41, Arnaldo Carvalho de Melo wrote: Em Wed, Sep 21, 2016 at 03:56:20AM +, Wang Nan escreveu: Before this patch perf panic if kptr_restrict set to 1 and perf is owned by root with suid set: $ whoami wangnan $ ls -l ./perf -rwsr-xr-x 1 root root 19781908 Sep 21 19:29 /home/wangnan/perf $ cat /proc/sys/kernel/kptr_restrict 1 $ cat /proc/sys/kernel/perf_event_paranoid -1 $ ./perf record -a Segmentation fault (core dumped) Trying to reproduce this here, and failing, where am I making a mistake? I tried again. Not related to paranoid. If /proc/sys/kernel/kptr_restrict is set to 1 and perf runs with (euid == 0 && uid != 0) it will panic at: int perf_event__synthesize_kernel_mmap(struct perf_tool *tool, ... kmap = map__kmap(map); size = snprintf(event->mmap.filename, sizeof(event->mmap.filename), "%s%s", mmap_name, kmap->ref_reloc_sym->name) + 1; <-- *kmap->ref_reloc_sym is NULL* ... } because following function: machine__create_kernel_maps machine__get_running_kernel_start returns NULL, so kmap->ref_reloc_sym is never set. No-root user never get the crash point because in perf_event__synthesize_kernel_mmap() symbol_conf.kptr_restrict is true so perf_event__synthesize_kernel_mmap directly returns fail. Have you tried following 'cat' test? The reason is perf assumes it is allowed to read kptr from /proc/kallsyms when euid is root, but in fact kernel doesn't allow it reading kptr when euid and uid are not match with each other: $ cp /bin/cat . $ sudo chown root:root ./cat $ sudo chmod u+s ./cat $ cat /proc/kallsyms | grep do_fork T _do_fork <--- kptr is hidden even euid is root $ sudo cat /proc/kallsyms | grep do_fork 81080230 T _do_fork See lib/vsprintf.c for kernel side code. This patch fixes this problem by checking both uid and euid. Humm, can't we just do: - value = (geteuid() != 0) ? + value = (getuid() != 0) ? No. though not very useful, by chown u+s one can make ((uid == 0) && (euid != 0)) by: # chown wangnan:wangnan ./perf # chmod u+s ./perf Then perf fail if run by root: # ./perf record -a Segmentation fault (core dumped) But okay for normal user. Thank you. I did it here and it seems to work: [acme@jouet linux]$ whoami acme [acme@jouet linux]$ ls -la ~/bin/perf -rwsr-xr-x. 2 root root 15539952 Sep 22 12:38 /home/acme/bin/perf [acme@jouet linux]$ cat /proc/sys/kernel/kptr_restrict 1 [acme@jouet linux]$ cat /proc/sys/kernel/perf_event_paranoid -1 [acme@jouet linux]$ ~acme/bin/perf record -a Warning: File /home/acme/.perfconfig not owned by current user or root, ignoring it. WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted, check /proc/sys/kernel/kptr_restrict. Samples in kernel functions may not be resolved if a suitable vmlinux file is not found in the buildid cache or in the vmlinux path. Samples in kernel modules won't be resolved at all. If some relocation was applied (e.g. kexec) symbols may be misresolved even with a suitable vmlinux or kallsyms file. Couldn't record kernel reference relocation symbol Symbol resolution may be skewed if relocation was used (e.g. kexec). Check /proc/kallsyms permission or run as root. ^C[ perf record: Woken up 17 times to write data ] [ perf record: Captured and wrote 6.474 MB perf.data (90771 samples) ] [acme@jouet linux]$ perf report -D 2>&1 | grep PERF_RECORD_SAMPLE | wc -l 90770 [acme@jouet linux]$ perf evlist Warning: File /home/acme/.perfconfig not owned by current user or root, ignoring it. cycles:ppp [acme@jouet linux]$ ls -la ~/.perfconfig -rw-rw-r--. 1 acme acme 27 Aug 12 17:57 /home/acme/.perfconfig [acme@jouet linux]$ - Arnaldo Signed-off-by: Wang Nan Cc: Arnaldo Carvalho de Melo --- Resend with a meanless blank like removed. --- tools/perf/util/symbol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c index 19c9c55..c55e781 100644 --- a/tools/perf/util/symbol.c +++ b/tools/perf/util/symbol.c @@ -1947,7 +1947,7 @@ static bool symbol__read_kptr_restrict(void) char line[8]; if (fgets(line, sizeof(line), fp) != NULL) - value = (geteuid() != 0) ? + value = ((geteuid() != 0) || (getuid() != 0)) ? (atoi(line) != 0) : (atoi(line) == 2); -- 1.8.3.4
Re: [PATCH] perf record: Fix segfault when running with suid and kptr_restrict is 1
On 2016/9/21 11:48, Wang Nan wrote: Before this patch perf panic if kptr_restrict set to 1 and perf is owned by root with suid set: $ whoami wangnan $ ls -l ./perf -rwsr-xr-x 1 root root 19781908 Sep 21 19:29 /home/wangnan/perf $ cat /proc/sys/kernel/kptr_restrict 1 $ cat /proc/sys/kernel/perf_event_paranoid -1 $ ./perf record -a Segmentation fault (core dumped) The reason is perf assumes it is allowed to read kptr from /proc/kallsyms when euid is root, but in fact kernel doesn't allow it reading kptr when euid and uid are not match with each other: $ cp /bin/cat . $ sudo chown root:root ./cat $ sudo chmod u+s ./cat $ cat /proc/kallsyms | grep do_fork T _do_fork <--- kptr is hidden even euid is root $ sudo cat /proc/kallsyms | grep do_fork 81080230 T _do_fork See lib/vsprintf.c for kernel side code. This patch fixes this problem by checking both uid and euid. Signed-off-by: Wang Nan Cc: Arnaldo Carvalho de Melo --- tools/perf/util/symbol.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c index 19c9c55..9528702 100644 --- a/tools/perf/util/symbol.c +++ b/tools/perf/util/symbol.c @@ -1946,8 +1946,9 @@ static bool symbol__read_kptr_restrict(void) if (fp != NULL) { char line[8]; + Sorry for this blank line. Will fix it. if (fgets(line, sizeof(line), fp) != NULL) - value = (geteuid() != 0) ? + value = ((geteuid() != 0) || (getuid() != 0)) ? (atoi(line) != 0) : (atoi(line) == 2);
Re: [PATCH] perf, tools: Handle events including .c and .o
On 2016/9/18 22:56, Andi Kleen wrote: On Sun, Sep 18, 2016 at 06:20:04PM +0800, Wangnan (F) wrote: On 2016/9/18 9:02, Andi Kleen wrote: From: Andi Kleen This is a generic bug fix, but it helps with Sukadev's JSON event tree where such events can happen. Any event inclduing a .c/.o/.bpf currently triggers BPF compilation or loading and then an error. This can happen for some Intel JSON events, which cannot be used. Fix the scanner to only match for .o or .c or .bpf at the end. This will prevent loading multiple BPF scripts separated with comma, but I assume this is acceptable. Cc: wangn...@huawei.com Cc: suka...@linux.vnet.ibm.com Signed-off-by: Andi Kleen I tested '.c' in middle of an event: # perf trace --event 'aaa.ccc' invalid or unsupported event: 'aaa.ccc' Run 'perf list' for a list of valid events ... It is not recongnized as a BPF source. So could you please provide an example to show how this potential bug breaks the parsing of new events? This is with the upcoming JSON uncore events: $ perf stat -e '{unc_p_clockticks,unc_p_power_state_occupancy.cores_c0}' -a -I 1000 ERROR: problems with path {unc_p_clockticks,unc_p_power_state_occupancy.c: No such file or directory event syntax error: '{unc_p_clockticks,unc_p_power_state_occupancy.cores_c0}' \___ Failed to load {unc_p_clockticks,unc_p_power_state_occupancy.c from source: Error when compiling BPF scriptlet (add -v to see detail) Run 'perf list' for a list of valid events -Andi I see, and your patch solve problem like this. Tested-by: Wang Nan