Re: perf tools: remove option --tail-synthesize ?

2018-11-21 Thread Wangnan (F)



On 2018/11/21 21:11, Arnaldo Carvalho de Melo wrote:
> Em Wed, Nov 21, 2018 at 07:45:28AM +, Song Liu escreveu:
>> Hi,
>>
>> I found perf-record --tail-synthesize without --overwrite breaks symbols
>> for perf-script, perf-report, etc. For example:
>>
>> [root@]# ~/perf record -ag --tail-synthesize -- sleep 1
>> [ perf record: Woken up 1 times to write data ]
>> [ perf record: Captured and wrote 1.129 MB perf.data (3531 samples) ]
>> [root@]# ~/perf script | head
>> swapper 0 [000] 1250675.051971:  1 cycles:ppp:
>> 81009e15 [unknown] ([unknown])
>> 81196b19 [unknown] ([unknown])
>> 81196579 [unknown] ([unknown])
>> 81110ca7 [unknown] ([unknown])
>> 81a01f4a [unknown] ([unknown])
>> 81a017bf [unknown] ([unknown])
>> 8180e17a [unknown] ([unknown])
>>
>> perf-record with --overwrite does NOT have this issue.
>>
>> After digging into this, I found this issue is introduced by commit
>> a73e24d240bc136619d382b1268f34d75c9d25ce.
>>
>> Reverting this commit does fix this issue. However, on a second thought,
>> I feel it is probably better just drop --tail-synthesize, as it doesn't
>> make much sense without --overwrite. All we need is to do tail_synthesize
>> when --overwrite is set.
>>

Some cases we use --overwrite without --tail-synthesize. How about setting
--tail-synthesize when selecting --overwrite by default, throw a warning
when --overwrite is not set and leave a --no-tail-synthesize option?

Thank you.

>> Thoughts?
> 
> Wang, wdyt?
> 
> - Arnaldo
> 



Re: [tools/perf] perf test LLVM failure on 4.9

2018-01-23 Thread Wangnan (F)



On 2018/1/23 20:37, Pintu Kumar wrote:

Hi All,

I am verifying all perf tests on Ubuntu-16 x86-64 platform using the
kernel version 4.9.20.
I have installed several others packages including: clang, llvm

But, when I run 'perf test' I get some FAILURE.
Specially, 'perf test LLVM' is failing.

Please check the below error logs:


# perf test LLVM
35: Test LLVM searching and compiling:
35.1: Basic BPF llvm compiling test  : FAILED!
35.2: Test kbuild searching  : Skip
35.3: Compile source for BPF prologue generation test: Skip
35.4: Compile source for BPF relocation test : Skip


When I run with -v I get this:

-
# perf test -v LLVM
35: Test LLVM searching and compiling:
35.1: Basic BPF llvm compiling test  :
--- start ---
test child forked, pid 3304
Unablt to get kernel version from uname '4.9--amd-x86-64'
WARNING:unable to get correct kernel building directory.
Hint:   Set correct kbuild directory using 'kbuild-dir' option in [llvm]
 section of ~/.perfconfig or set it to "" to suppress kbuild
 detection.

Unablt to get kernel version from uname '4.9--amd-x86-64'


int _version SEC("version") = LINUX_VERSION_CODE;
' | $CLANG_EXEC -D__KERNEL__ -D__NR_CPUS__=$NR_CPUS
-DLINUX_VERSION_CODE=$LINUX_VERSION_CODE $CLANG_OPTIONS
$KERNEL_INC_OPTIONS -Wno-unused-value -Wno-pointer-sign
-working-directory $WORKING_DIR -c "$CLANG_SOURCE" -target bpf -O2 -o
-
libbpf: 2129190-4a0 doesn't provide kernel version
Failed to parse test case 'Basic BPF llvm compiling test'
test child finished with -1
 end 
Test LLVM searching and compiling subtest 0: FAILED!
-


Basic BPF LLVM compiling test is failing.
It reports that bpf could not able to read the kernel version from uname.
Is it because of this that 'perf test LLVM' is failing ?

My uname says:
4.9--amd-x86-64


'perf test LLVM' requires a well formed uname string (%d.%d.%d). Please see
tools/perf/fetch_kernel_version . If your distro uses another method to 
report
kernel version number (from your uname string, sublevel can't be 
determined),

please help us improve that function like what we have done for ubuntu
(checking /proc/version_signature).



My kernel version is: 4.9.20 (from Makefile)

So, I dont think there is any problem with kernel version.

If any body have come across this problem please let me know how to
resolve this issue.



Thank You!

Regards,
Pintu





Re: perf test BPF failing on 4.15.0-rc6

2018-01-03 Thread Wangnan (F)



On 2018/1/4 4:13, Arnaldo Carvalho de Melo wrote:

Em Wed, Jan 03, 2018 at 03:33:07PM -0300, Arnaldo Carvalho de Melo escreveu:

Em Wed, Jan 03, 2018 at 03:27:01PM -0300, Arnaldo Carvalho de Melo escreveu:

Continuing investigation...

After applying the fallback patch to allow new tools to work with older
kernels:

[root@felicio ~]# perf test bpf
39: BPF filter:
39.1: Basic BPF filtering : Ok
39.2: BPF pinning : Ok
39.3: BPF prologue generation : Ok
39.4: BPF relocation checker  : Ok
[root@felicio ~]# uname -a
Linux felicio.ghostprotocols.net 4.13.0-rc7+ #1 SMP Mon Sep 11 13:56:18 -03 
2017 x86_64 x86_64 x86_64 GNU/Linux
[root@felicio ~]# rpm -q glibc
glibc-2.17-157.el7_3.2.x86_64
[root@felicio ~]#

After applying the patch below I get to, which is what I am trying to
fix now:

[root@jouet ~]# perf test bpf
39: BPF filter:
39.1: Basic BPF filtering : Ok
39.2: BPF pinning : Ok
39.3: BPF prologue generation : FAILED!
39.4: BPF relocation checker  : Skip
[root@jouet ~]#

Update the patch to the one at the end of this message to make it work
with older glibcs, so that we ask for epoll_pwait() and hook into that
as well().

Now checking why 39.3 fails...

Couldn't reproduce after fixing up some kernel build problems, the patch
below is all I need to have this working with both Fedora 27 and RHEL7,
please take a look and see if it continues to work on your systems,


It works for me. Thank you.

Since we test epoll_pwait, we'd better correct function names:


diff --git a/tools/perf/tests/bpf-script-example.c 
b/tools/perf/tests/bpf-script-example.c

index 268e5f8..e4123c1 100644
--- a/tools/perf/tests/bpf-script-example.c
+++ b/tools/perf/tests/bpf-script-example.c
@@ -31,8 +31,8 @@ struct bpf_map_def SEC("maps") flip_table = {
 .max_entries = 1,
 };

-SEC("func=SyS_epoll_wait")
-int bpf_func__SyS_epoll_wait(void *ctx)
+SEC("func=SyS_epoll_pwait")
+int bpf_func__SyS_epoll_pwait(void *ctx)
 {
 int ind =0;
 int *flag = bpf_map_lookup_elem(&flip_table, &ind);
diff --git a/tools/perf/tests/bpf.c b/tools/perf/tests/bpf.c
index 34c22cd..a8f9095 100644
--- a/tools/perf/tests/bpf.c
+++ b/tools/perf/tests/bpf.c
@@ -19,13 +19,13 @@

 #ifdef HAVE_LIBBPF_SUPPORT

-static int epoll_wait_loop(void)
+static int epoll_pwait_loop(void)
 {
 int i;

 /* Should fail NR_ITERS times */
 for (i = 0; i < NR_ITERS; i++)
-epoll_wait(-(i + 1), NULL, 0, 0);
+epoll_pwait(-(i + 1), NULL, 0, 0, NULL);
 return 0;
 }

@@ -68,7 +68,7 @@ static struct {
 "[basic_bpf_test]",
 "fix 'perf test LLVM' first",
 "load bpf object failed",
-&epoll_wait_loop,
+&epoll_pwait_loop,
 (NR_ITERS + 1) / 2,
 false,
 },
@@ -78,7 +78,7 @@ static struct {
 "[bpf_pinning]",
 "fix kbuild first",
 "check your vmlinux setting?",
-&epoll_wait_loop,
+&epoll_pwait_loop,
 (NR_ITERS + 1) / 2,
 true,
 },





Re: perf test BPF failing on 4.15.0-rc6

2018-01-02 Thread Wangnan (F)

And please check if the kprobe created by

 $ perf probe -v SyS_epoll_wait

works for the test program used by this testcase:

#include 
#include 

#define NR_ITERS 100

static int epoll_wait_loop(void)
{
int i;

/* Should fail NR_ITERS times */
for (i = 0; i < NR_ITERS; i++)
epoll_wait(-(i + 1), NULL, 0, 0);
return 0;
}

int main()
{
epoll_wait_loop();
return 0;
}

In my configuration it gives 100 samples:

$ sudo ./perf record -e probe:SyS_epoll_wait /tmp/a.out
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.016 MB perf.data (100 samples) ]

I guess in your configuration you would get 3 tracepoints, and can get 0 
sample from them.


Thank you.

On 2018/1/3 12:42, Wangnan (F) wrote:
Unable to reproduce. In my kernel configuration, SyS_epoll_wait is not 
inlined at all.


From the log you sent, it seems that all 3 instances are attached. 
This testcase should work if the last one (SyS_epoll_wait) get probed 
correctly.


Could you please have a look if the 3rd kprobe event 
(p:perf_bpf_probe/func_2 _text+2845824) works?


Thank you.

On 2018/1/3 2:59, Arnaldo Carvalho de Melo wrote:

Hi Wang,

I just updated my machine to Fedora 27 and 4.15.0-rc6 and the
only test failing for me is:

[root@jouet linux]# perf test bpf
39: BPF filter:
39.1: Basic BPF filtering : FAILED!
39.2: BPF pinning : Skip
39.3: BPF prologue generation : Skip
39.4: BPF relocation checker  : Skip
[root@jouet linux]#

I haven't checked but perhaps the problem is that SyS_epoll_wait seems
to now be inlined in three places and perhaps the eBPF proggie is being
added to just one of them?

Seemingly relevant excerpt:

Open Debuginfo file: /lib/modules/4.15.0-rc6/build/vmlinux
Try to find probe point from debuginfo.
Matched function: SyS_epoll_wait [2f40eb7]
found inline addr: 0x812b6ff1
Probe point found: compat_SyS_epoll_pwait+129
found inline addr: 0x812b6de7
Probe point found: SyS_epoll_pwait+135
found inline addr: 0x812b6c80
Probe point found: SyS_epoll_wait+0
Found 3 probe_trace_events.

- Arnaldo

P.S.: Full -v output:

[root@jouet linux]# uname -a
Linux jouet 4.15.0-rc6 #4 SMP Tue Jan 2 14:30:53 -03 2018 x86_64 
x86_64 x86_64 GNU/Linux

[root@jouet linux]# gcc -v
Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap 
--enable-languages=c,c++,objc,obj-c++,fortran,ada,go,lto 
--prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info 
--with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared 
--enable-threads=posix --enable-checking=release --enable-multilib 
--with-system-zlib --enable-__cxa_atexit 
--disable-libunwind-exceptions --enable-gnu-unique-object 
--enable-linker-build-id --with-gcc-major-version-only 
--with-linker-hash-style=gnu --enable-plugin --enable-initfini-array 
--with-isl --enable-libmpx --enable-offload-targets=nvptx-none 
--without-cuda-driver --enable-gnu-indirect-function 
--with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux

Thread model: posix
gcc version 7.2.1 20170915 (Red Hat 7.2.1-2) (GCC)
[root@jouet linux]# clang -v
clang version 6.0.0 (http://llvm.org/git/clang.git 
56cc8f8880db2ebc433eeb6b6a707c101467a186) 
(http://llvm.org/git/llvm.git 3656d83960a4f3fedf6d8f19043abf52379f78c3)

Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/local/bin
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/7
Selected GCC installation: /usr/lib/gcc/x86_64-redhat-linux/7
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
[root@jouet linux]# perf test -v BPF
39: BPF filter:
39.1: Basic BPF filtering :
--- start ---
test child forked, pid 24304
Kernel build dir is set to /lib/modules/4.15.0-rc6/build
set env: KBUILD_DIR=/lib/modules/4.15.0-rc6/build
unset env: KBUILD_OPTS
include option is set to  -nostdinc -isystem 
/usr/lib/gcc/x86_64-redhat-linux/7/include 
-I/home/acme/git/linux/arch/x86/include 
-I./arch/x86/include/generated  -I/home/acme/git/linux/include 
-I./include -I/home/acme/git/linux/arch/x86/include/uapi 
-I./arch/x86/include/generated/uapi 
-I/home/acme/git/linux/include/uapi -I./include/generated/uapi 
-include /home/acme/git/linux/include/linux/kconfig.h

set env: NR_CPUS=4
set env: LINUX_VERSION_CODE=0x40f00
set env: CLANG_EXEC=/usr/local/bin/clang
set env: CLANG_OPTIONS=-xc
set env: KERNEL_INC_OPTIONS= -nostdinc -isystem 
/usr/lib/gcc/x86_64-redhat-linux/7/include 
-I/home/acme/git/linux/arch/x86/include 
-I./arch/x86/include/generated 

Re: perf test BPF failing on 4.15.0-rc6

2018-01-02 Thread Wangnan (F)
Unable to reproduce. In my kernel configuration, SyS_epoll_wait is not 
inlined at all.


From the log you sent, it seems that all 3 instances are attached. This 
testcase should work if the last one (SyS_epoll_wait) get probed correctly.


Could you please have a look if the 3rd kprobe event 
(p:perf_bpf_probe/func_2 _text+2845824) works?


Thank you.

On 2018/1/3 2:59, Arnaldo Carvalho de Melo wrote:

Hi Wang,

I just updated my machine to Fedora 27 and 4.15.0-rc6 and the
only test failing for me is:

[root@jouet linux]# perf test bpf
39: BPF filter:
39.1: Basic BPF filtering : FAILED!
39.2: BPF pinning : Skip
39.3: BPF prologue generation : Skip
39.4: BPF relocation checker  : Skip
[root@jouet linux]#

I haven't checked but perhaps the problem is that SyS_epoll_wait seems
to now be inlined in three places and perhaps the eBPF proggie is being
added to just one of them?

Seemingly relevant excerpt:

Open Debuginfo file: /lib/modules/4.15.0-rc6/build/vmlinux
Try to find probe point from debuginfo.
Matched function: SyS_epoll_wait [2f40eb7]
found inline addr: 0x812b6ff1
Probe point found: compat_SyS_epoll_pwait+129
found inline addr: 0x812b6de7
Probe point found: SyS_epoll_pwait+135
found inline addr: 0x812b6c80
Probe point found: SyS_epoll_wait+0
Found 3 probe_trace_events.

- Arnaldo

P.S.: Full -v output:

[root@jouet linux]# uname -a
Linux jouet 4.15.0-rc6 #4 SMP Tue Jan 2 14:30:53 -03 2018 x86_64 x86_64 x86_64 
GNU/Linux
[root@jouet linux]# gcc -v
Using built-in specs.
COLLECT_GCC=/usr/bin/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap 
--enable-languages=c,c++,objc,obj-c++,fortran,ada,go,lto --prefix=/usr 
--mandir=/usr/share/man --infodir=/usr/share/info 
--with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared 
--enable-threads=posix --enable-checking=release --enable-multilib 
--with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions 
--enable-gnu-unique-object --enable-linker-build-id 
--with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin 
--enable-initfini-array --with-isl --enable-libmpx 
--enable-offload-targets=nvptx-none --without-cuda-driver 
--enable-gnu-indirect-function --with-tune=generic --with-arch_32=i686 
--build=x86_64-redhat-linux
Thread model: posix
gcc version 7.2.1 20170915 (Red Hat 7.2.1-2) (GCC)
[root@jouet linux]# clang -v
clang version 6.0.0 (http://llvm.org/git/clang.git 
56cc8f8880db2ebc433eeb6b6a707c101467a186) (http://llvm.org/git/llvm.git 
3656d83960a4f3fedf6d8f19043abf52379f78c3)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/local/bin
Found candidate GCC installation: /usr/lib/gcc/x86_64-redhat-linux/7
Selected GCC installation: /usr/lib/gcc/x86_64-redhat-linux/7
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
[root@jouet linux]# perf test -v BPF
39: BPF filter:
39.1: Basic BPF filtering :
--- start ---
test child forked, pid 24304
Kernel build dir is set to /lib/modules/4.15.0-rc6/build
set env: KBUILD_DIR=/lib/modules/4.15.0-rc6/build
unset env: KBUILD_OPTS
include option is set to  -nostdinc -isystem 
/usr/lib/gcc/x86_64-redhat-linux/7/include 
-I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated  
-I/home/acme/git/linux/include -I./include 
-I/home/acme/git/linux/arch/x86/include/uapi 
-I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi 
-I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h
set env: NR_CPUS=4
set env: LINUX_VERSION_CODE=0x40f00
set env: CLANG_EXEC=/usr/local/bin/clang
set env: CLANG_OPTIONS=-xc
set env: KERNEL_INC_OPTIONS= -nostdinc -isystem 
/usr/lib/gcc/x86_64-redhat-linux/7/include 
-I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated  
-I/home/acme/git/linux/include -I./include 
-I/home/acme/git/linux/arch/x86/include/uapi 
-I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi 
-I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h
set env: WORKING_DIR=/lib/modules/4.15.0-rc6/build
set env: CLANG_SOURCE=-
llvm compiling command template: echo '/*
  * bpf-script-example.c
  * Test basic LLVM building
  */
#ifndef LINUX_VERSION_CODE
# error Need LINUX_VERSION_CODE
# error Example: for 4.2 kernel, put 'clang-opt="-DLINUX_VERSION_CODE=0x40200" 
into llvm section of ~/.perfconfig'
#endif
#define BPF_ANY 0
#define BPF_MAP_TYPE_ARRAY 2
#define BPF_FUNC_map_lookup_elem 1
#define BPF_FUNC_map_update_elem 2

static void *(*bpf_map_lookup_elem)(void *map, void *key) =
(void *) BPF_FUNC_m

Re: Php-fpm will crash when perf runs with call graph option

2017-12-24 Thread Wangnan (F)

Have you tried updating your kernel to a recent version?


On 2017/12/25 14:58, ufo19890607 wrote:

From: yuzhoujian 

I use perf to analyze the performance overhead for the server.
There are several dockers in the server. The php-fpm in the docker
will crash as long as the perf collects samples for all the cpus with call 
graph option(perf record -ag). Below is the stack trace
in the coredump.

#0  0x7f044ff447bd in re_compile_fastmap_iter (bufp=0x7f044ff447bd 
,
fastmap=0x46 , init_state=, init_state=) at regcomp.c:407
407   if (__wcrtomb (buf, towlower (cset->mbchars[i]), 
&state)
(gdb) bt
#0  0x7f044ff447bd in re_compile_fastmap_iter (bufp=0x7f044ff447bd 
,
fastmap=0x46 , init_state=, 
init_state=) at regcomp.c:407
#1  0x00831160 in virtual_file_ex (state=0x7fff9c1a4f70, path=, verify_path=0x0, use_realpath=1)
at /home/xiaoju/phpng/php-7.0.6/Zend/zend_virtual_cwd.c:1335
#2  0x007aacee in expand_filepath_with_mode (
filepath=0x7f044d6020d8 
"/home/xiaoju/ep/as/store//toggles/beatles_api_discovery_is_open_by_app",
real_path=0x7fff9c1a4fc0 "\360X\032\234\377\177", 
relative_to=, relative_to_len=0, realpath_mode=1)
at 
/home/xiaoju/phpng/php-7.0.6/main/fopen_wrappers.c:812
#3  0x007c1536 in _php_stream_fopen (
filename=0x7f044d6020d8 
"/home/xiaoju/ep/as/store//toggles/beatles_api_discovery_is_open_by_app", mode=0xdbb1f1 
"rb",
opened_path=0x0, options=0) at 
/home/xiaoju/phpng/php-7.0.6/main/streams/plain_wrapper.c:970
#4  0x007bd084 in _php_stream_open_wrapper_ex (
path=0x7f044d6020d8 
"/home/xiaoju/ep/as/store//toggles/beatles_api_discovery_is_open_by_app", mode=0xdbb1f1 
"rb", options=8,
opened_path=0x0, context=0x7f044d65f4c0) at 
/home/xiaoju/phpng/php-7.0.6/main/streams/streams.c:2060
#5  0x0071722b in zif_file_get_contents (execute_data=, 
return_value=0x7f044d615540)
at 
/home/xiaoju/phpng/php-7.0.6/ext/standard/file.c:544
#6  0x0065387c in phar_file_get_contents (execute_data=0x7f044d615570, 
return_value=0x7f044d615540)
at 
/home/xiaoju/phpng/php-7.0.6/ext/phar/func_interceptors.c:224))

I add some output info in the php source code, and found that virtual_file_ex 
functions's rbp value is really strange,etc 0x1, 0x31. I guess when the perf 
collects samples for all the cpus with -g option, it may destroy the php-fpm's 
stack. When the perf is running without -g option, the php-fpm is normal. Who 
have ever encountered similar problems?

BTW, OS in the server: Centos7.3  , Kernel version: 3.10.0-514.16.1.el7.x86_64. 
php-fpm version: 7.0.6
Processor info: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
--
To unsubscribe from this list: send the line "unsubscribe linux-perf-users" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





Re: [PATCH V2 2/8] perf tools: rewrite perf mmap read for overwrite

2017-12-18 Thread Wangnan (F)



On 2017/12/19 3:07, Liang, Kan wrote:

-union perf_event *perf_mmap__read_backward(struct perf_mmap *map)
+union perf_event *perf_mmap__read_backward(struct perf_mmap *map,
+  u64 *start, u64 end)
   {
-   u64 head, end;
-   u64 start = map->prev;
-
-   /*
-* Check if event was unmapped due to a POLLHUP/POLLERR.
-*/
-   if (!refcount_read(&map->refcnt))
-   return NULL;
-

Is removing this checking safe?

It was duplicate as the one in perf_mmap__read_catchup.
Once planned to remove one. But it looks I removed both accidently.
Will keep one in V3.


-   head = perf_mmap__read_head(map);
-   if (!head)
-   return NULL;
+   union perf_event *event = NULL;

-   /*
-* 'head' pointer starts from 0. Kernel minus sizeof(record) form
-* it each time when kernel writes to it, so in fact 'head' is
-* negative. 'end' pointer is made manually by adding the size of
-* the ring buffer to 'head' pointer, means the validate data can
-* read is the whole ring buffer. If 'end' is positive, the ring
-* buffer has not fully filled, so we must adjust 'end' to 0.
-*
-* However, since both 'head' and 'end' is unsigned, we can't
-* simply compare 'end' against 0. Here we compare '-head' and
-* the size of the ring buffer, where -head is the number of bytes
-* kernel write to the ring buffer.
-*/
-   if (-head < (u64)(map->mask + 1))
-   end = 0;
-   else
-   end = head + map->mask + 1;
+   event = perf_mmap__read(map, *start, end, &map->prev);
+   *start = map->prev;

-   return perf_mmap__read(map, start, end, &map->prev);
+   return event;
   }

perf_mmap__read_backward() and perf_mmap__read_forward() become
asymmetrical,
but their names are symmetrical.



-void perf_mmap__read_catchup(struct perf_mmap *map)
+static int overwrite_rb_find_range(void *buf, int mask, u64 head,
+  u64 *start, u64 *end);
+
+int perf_mmap__read_catchup(struct perf_mmap *map,
+   bool overwrite,
+   u64 *start, u64 *end,
+   unsigned long *size)
   {
-   u64 head;
+   u64 head = perf_mmap__read_head(map);
+   u64 old = map->prev;
+   unsigned char *data = map->base + page_size;

-   if (!refcount_read(&map->refcnt))
-   return;
+   *start = overwrite ? head : old;
+   *end = overwrite ? old : head;

-   head = perf_mmap__read_head(map);
-   map->prev = head;
+   if (*start == *end)
+   return 0;
+
+   *size = *end - *start;
+   if (*size > (unsigned long)(map->mask) + 1) {
+   if (!overwrite) {
+   WARN_ONCE(1, "failed to keep up with mmap data.

(warn only once)\n");

+
+   map->prev = head;
+   perf_mmap__consume(map, overwrite);
+   return 0;
+   }
+
+   /*
+* Backward ring buffer is full. We still have a chance to read
+* most of data from it.
+*/
+   if (overwrite_rb_find_range(data, map->mask, head, start,

end))

+   return -1;
+   }
+
+   return 1;

Coding suggestion: this function returns 2 different value (1 and -1) in
two fail path. Should return 0 for success and -1 for failure.


For perf top, -1 is enough for failure.
But for perf_mmap__push, it looks two fail paths are needed, aren't they?
  


I see. I think this is not a good practice. Please try to avoid returning
3 states. For example, comparing start and end outside? If can't avoid, how
about returning an enum to notice reader about the difference?


You totally redefine perf_mmap__read_catchup(). The original meaning of
this function is adjust md->prev so following perf_mmap__read() can read
events one by one safely.  Only backward ring buffer needs catchup:
kernel's 'head' pointer keeps moving (backward), while md->prev keeps
unchanged. For forward ring buffer, md->prev will catchup each time when
reading from the ring buffer.

Your patch totally changes its meaning. Now its meaning is finding the
available data from a ring buffer (report start and end). At least we
need a better name.

Sure, I will introduce a new function to do it in V3.


In addition, if I understand your code correctly, we
don't need to report 'size' to caller, because size is determined by
start and end.

Sure, I will remove the 'size' in V3.


In addition, since the concept of backward and overwrite are now
unified, we can avoid updating map->prev during one-by-one reading
(because backward reading don't need consume operation). I think we can
decouple the updating of map->prev and these two basic read functions.
This can makes the operations on map->prev clearer.  Now the moving

Re: [PATCH V2 6/8] perf top: add overwrite fall back

2017-12-18 Thread Wangnan (F)



On 2017/12/7 7:33, kan.li...@intel.com wrote:

From: Kan Liang 

Switch to non-overwrite mode if kernel doesnot support overwrite
ringbuffer.

It's only effect when overwrite mode is supported.
No change to current behavior.

Signed-off-by: Kan Liang 
---
  tools/perf/builtin-top.c | 35 +++
  1 file changed, 35 insertions(+)

diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 5e15d27..7c462d6 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -931,6 +931,27 @@ static int perf_top_overwrite_check(struct perf_top *top)
return 0;
  }
  
+static int perf_top_overwrite_fallback(struct perf_top *top,

+  struct perf_evsel *evsel)
+{
+   struct record_opts *opts = &top->record_opts;
+   struct perf_evlist *evlist = top->evlist;
+   struct perf_evsel *counter;
+
+   if (!opts->overwrite)
+   return 0;
+
+   /* only fall back when first event fails */
+   if (evsel != perf_evlist__first(evlist))
+   return 0;
+
+   evlist__for_each_entry(evlist, counter)
+   counter->attr.write_backward = false;
+   opts->overwrite = false;
+   ui__warning("fall back to non-overwrite mode\n");
+   return 1;


You can check perf_missing_features.write_backward here. Only do the 
fallback

when we know the problem is caused by missing of write_backward.


+}
+
  static int perf_top__start_counters(struct perf_top *top)
  {
char msg[BUFSIZ];
@@ -954,6 +975,20 @@ static int perf_top__start_counters(struct perf_top *top)
  try_again:
if (perf_evsel__open(counter, top->evlist->cpus,
 top->evlist->threads) < 0) {
+
+   /*
+* Specially handle overwrite fall back.
+* Because perf top is the only tool which has
+* overwrite mode by default, support
+* both overwrite and non-overwrite mode, and
+* require consistent mode for all events.
+*
+* May move it to generic code with more tools
+* have similar attribute.
+*/
+   if (perf_top_overwrite_fallback(top, counter))
+   goto try_again;
+


It will unconditionally remove backward bit even if the problem
is caused by other missing feature.

Thank you.



Re: [PATCH V2 2/8] perf tools: rewrite perf mmap read for overwrite

2017-12-18 Thread Wangnan (F)



On 2017/12/7 7:32, kan.li...@intel.com wrote:

From: Kan Liang 

perf_mmap__read_catchup and perf_mmap__read_backward are used to read
events one by one from ring buffer under overwrite mode.
It always read the stale buffer which is already processed.
Because the previous location of processed ring buffer is discarded.

Introducing the perf_mmap__read_done, which update the map->prev to
indicate the position of processed buffer.

Refining perf_mmap__read_catchup to calculate the start and the end of
ring buffer. It only need to do the calculation once at the beginning,
because the ring buffer is paused in overwrite mode.
Doesn't need to do the calculation in perf_mmap__read_backward.

For now, the only user is backward-ring-buffer in perf test.

Signed-off-by: Kan Liang 
---
  tools/perf/tests/backward-ring-buffer.c |  9 +++-
  tools/perf/util/mmap.c  | 88 +++--
  tools/perf/util/mmap.h  |  7 ++-
  3 files changed, 63 insertions(+), 41 deletions(-)

diff --git a/tools/perf/tests/backward-ring-buffer.c 
b/tools/perf/tests/backward-ring-buffer.c
index 4035d43..66d9e54 100644
--- a/tools/perf/tests/backward-ring-buffer.c
+++ b/tools/perf/tests/backward-ring-buffer.c
@@ -31,10 +31,14 @@ static int count_samples(struct perf_evlist *evlist, int 
*sample_count,
int i;
  
  	for (i = 0; i < evlist->nr_mmaps; i++) {

+   struct perf_mmap *map = &evlist->overwrite_mmap[i];
union perf_event *event;
+   unsigned long size;
+   u64 start, end;
  
-		perf_mmap__read_catchup(&evlist->overwrite_mmap[i]);

-   while ((event = 
perf_mmap__read_backward(&evlist->overwrite_mmap[i])) != NULL) {
+   perf_mmap__read_catchup(map, true, &start, &end, &size);
+   while ((event = perf_mmap__read_backward(map, &start, end)) !=
+  NULL) {
const u32 type = event->header.type;
  
  			switch (type) {

@@ -49,6 +53,7 @@ static int count_samples(struct perf_evlist *evlist, int 
*sample_count,
return TEST_FAIL;
}
}
+   perf_mmap__read_done(map);
}
return TEST_OK;
  }
diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index 05076e6..bf67460 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -85,51 +85,65 @@ union perf_event *perf_mmap__read_forward(struct perf_mmap 
*map)
return perf_mmap__read(map, old, head, &map->prev);
  }
  
-union perf_event *perf_mmap__read_backward(struct perf_mmap *map)

+union perf_event *perf_mmap__read_backward(struct perf_mmap *map,
+  u64 *start, u64 end)
  {
-   u64 head, end;
-   u64 start = map->prev;
-
-   /*
-* Check if event was unmapped due to a POLLHUP/POLLERR.
-*/
-   if (!refcount_read(&map->refcnt))
-   return NULL;
-


Is removing this checking safe?


-   head = perf_mmap__read_head(map);
-   if (!head)
-   return NULL;
+   union perf_event *event = NULL;
  
-	/*

-* 'head' pointer starts from 0. Kernel minus sizeof(record) form
-* it each time when kernel writes to it, so in fact 'head' is
-* negative. 'end' pointer is made manually by adding the size of
-* the ring buffer to 'head' pointer, means the validate data can
-* read is the whole ring buffer. If 'end' is positive, the ring
-* buffer has not fully filled, so we must adjust 'end' to 0.
-*
-* However, since both 'head' and 'end' is unsigned, we can't
-* simply compare 'end' against 0. Here we compare '-head' and
-* the size of the ring buffer, where -head is the number of bytes
-* kernel write to the ring buffer.
-*/
-   if (-head < (u64)(map->mask + 1))
-   end = 0;
-   else
-   end = head + map->mask + 1;
+   event = perf_mmap__read(map, *start, end, &map->prev);
+   *start = map->prev;
  
-	return perf_mmap__read(map, start, end, &map->prev);

+   return event;
  }


perf_mmap__read_backward() and perf_mmap__read_forward() become 
asymmetrical,

but their names are symmetrical.


  
-void perf_mmap__read_catchup(struct perf_mmap *map)

+static int overwrite_rb_find_range(void *buf, int mask, u64 head,
+  u64 *start, u64 *end);
+
+int perf_mmap__read_catchup(struct perf_mmap *map,
+   bool overwrite,
+   u64 *start, u64 *end,
+   unsigned long *size)
  {
-   u64 head;
+   u64 head = perf_mmap__read_head(map);
+   u64 old = map->prev;
+   unsigned char *data = map->base + page_size;
  
-	if (!refcount_read(&map->refcnt))

-   return;
+   *start = overwrite ? head : old;
+   *end = overwrite ? old : head;
  
-	head = perf_mmap_

Re: [PATCH V2 1/8] perf tools: remove stale perf evlist mmap read for backward

2017-12-17 Thread Wangnan (F)



On 2017/12/7 7:32, kan.li...@intel.com wrote:

From: Kan Liang 

perf_evlist__mmap_read_catchup and perf_evlist__mmap_read_backward are
only for overwrite mode.
But they read the evlist->mmap buffer which is for non-overwrite mode.

It did not bring any serious problem yet, because there is no one use
it.

Remove the unused interfaces.

Signed-off-by: Kan Liang 


Acked-by: Wang Nan 


---
  tools/perf/util/evlist.c | 17 -
  tools/perf/util/evlist.h |  4 
  2 files changed, 21 deletions(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 3570355..4b6a06d 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -714,28 +714,11 @@ union perf_event *perf_evlist__mmap_read_forward(struct 
perf_evlist *evlist, int
return perf_mmap__read_forward(md);
  }
  
-union perf_event *perf_evlist__mmap_read_backward(struct perf_evlist *evlist, int idx)

-{
-   struct perf_mmap *md = &evlist->mmap[idx];
-
-   /*
-* No need to check messup for backward ring buffer:
-* We can always read arbitrary long data from a backward
-* ring buffer unless we forget to pause it before reading.
-*/
-   return perf_mmap__read_backward(md);
-}
-
  union perf_event *perf_evlist__mmap_read(struct perf_evlist *evlist, int idx)
  {
return perf_evlist__mmap_read_forward(evlist, idx);
  }
  
-void perf_evlist__mmap_read_catchup(struct perf_evlist *evlist, int idx)

-{
-   perf_mmap__read_catchup(&evlist->mmap[idx]);
-}
-
  void perf_evlist__mmap_consume(struct perf_evlist *evlist, int idx)
  {
perf_mmap__consume(&evlist->mmap[idx], false);
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 7516066..a80fd47 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -132,10 +132,6 @@ union perf_event *perf_evlist__mmap_read(struct 
perf_evlist *evlist, int idx);
  
  union perf_event *perf_evlist__mmap_read_forward(struct perf_evlist *evlist,

 int idx);
-union perf_event *perf_evlist__mmap_read_backward(struct perf_evlist *evlist,
- int idx);
-void perf_evlist__mmap_read_catchup(struct perf_evlist *evlist, int idx);
-
  void perf_evlist__mmap_consume(struct perf_evlist *evlist, int idx);
  
  int perf_evlist__open(struct perf_evlist *evlist);





Re: [PATCH 4/5] perf top: switch to overwrite mode

2017-12-05 Thread Wangnan (F)



On 2017/12/6 6:39, kan.li...@intel.com wrote:

From: Kan Liang 

perf_top__mmap_read has severe performance issue in
Knights Landing/Mill, when monitoring in heavy load system. It costs
several minutes to finish, which is unacceptable.

Currently, perf top is non overwrite mode. For non overwrite mode, it
tries to read everything in the ringbuffer and doesn't pause the
ringbuffer. Once there are lots of samples delivered persistently,
the processing time could be very long. Also, the latest samples could
be lost when the ringbuffer is full.

For overwrite mode, it takes a snapshot for the system by pausing the
ringbuffer, which could significantly reducing the processing time.
Also, the overwrite mode always keep the latest samples.
Considering the real time requirement for perf top, the overwrite mode
is more suitable for perf top.

Actually, perf top was overwrite mode. It is changed to non overwrite
mode since commit 93fc64f14472 ("perf top: Switch to non overwrite
mode"). It's better to change it back to overwrite mode.

There would be some records lost in overwrite mode because of pausing
the ringbuffer. It has little impact for the accuracy of the snapshot
and could be tolerant.
The lost events checking is removed.

Unconditionally wait 100 ms before each snapshot. It also reduce the
overhead caused by pausing ringbuffer, especially on light load system.

Signed-off-by: Kan Liang 
---
  tools/perf/builtin-top.c   | 30 ++
  tools/perf/ui/browsers/hists.c | 12 +---
  2 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 540461f..721d786 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -283,16 +283,6 @@ static void perf_top__print_sym_table(struct perf_top *top)
  
  	printf("%-*.*s\n", win_width, win_width, graph_dotted_line);
  
-	if (hists->stats.nr_lost_warned !=

-   hists->stats.nr_events[PERF_RECORD_LOST]) {
-   hists->stats.nr_lost_warned =
- hists->stats.nr_events[PERF_RECORD_LOST];
-   color_fprintf(stdout, PERF_COLOR_RED,
- "WARNING: LOST %d chunks, Check IO/CPU overload",
- hists->stats.nr_lost_warned);
-   ++printed;
-   }
-
if (top->sym_filter_entry) {
perf_top__show_details(top);
return;
@@ -807,14 +797,19 @@ static void perf_event__process_sample(struct perf_tool 
*tool,
  
  static void perf_top__mmap_read_idx(struct perf_top *top, int idx)

  {
+   struct perf_mmap *md = &top->evlist->overwrite_mmap[idx];
struct perf_sample sample;
struct perf_evsel *evsel;
struct perf_session *session = top->session;
union perf_event *event;
struct machine *machine;
+   unsigned long size;
+   u64 end, start;
int ret;
  
-	while ((event = perf_evlist__mmap_read(top->evlist, idx)) != NULL) {

+   perf_mmap__read_catchup(md, true, &start, &end, &size);
+
+   while ((event = perf_mmap__read_backward(md, &start, end)) != NULL) {
ret = perf_evlist__parse_sample(top->evlist, event, &sample);
if (ret) {
pr_err("Can't parse sample, err = %d\n", ret);
@@ -869,16 +864,21 @@ static void perf_top__mmap_read_idx(struct perf_top *top, 
int idx)
} else
++session->evlist->stats.nr_unknown_events;
  next_event:
-   perf_evlist__mmap_consume(top->evlist, idx);
+   perf_mmap__consume(md, true);
}
+
+   perf_mmap__read_done(md);
  }
  
  static void perf_top__mmap_read(struct perf_top *top)

  {
int i;
  
+	perf_evlist__toggle_bkw_mmap(top->evlist, BKW_MMAP_DATA_PENDING);

for (i = 0; i < top->evlist->nr_mmaps; i++)
perf_top__mmap_read_idx(top, i);
+   perf_evlist__toggle_bkw_mmap(top->evlist, BKW_MMAP_EMPTY);
+   perf_evlist__toggle_bkw_mmap(top->evlist, BKW_MMAP_RUNNING);
  }
  
  static int perf_top__start_counters(struct perf_top *top)

@@ -1029,12 +1029,9 @@ static int __cmd_top(struct perf_top *top)
}
  
  	while (!done) {

-   u64 hits = top->samples;
-
perf_top__mmap_read(top);
  
-		if (hits == top->samples)

-   ret = perf_evlist__poll(top->evlist, 100);
+   ret = perf_evlist__poll(top->evlist, 100);
  
  		if (resize) {

perf_top__resize(top);
@@ -1127,6 +1124,7 @@ int cmd_top(int argc, const char **argv)
.uses_mmap   = true,
},
.proc_map_timeout= 500,
+   .overwrite  = 1,


This breaks old kernel:

# ./perf top
Error:
Reading from overwrite event is not supported by this kernel.
# uname -r
3.0.13-0.27-default

We need a way to fall back to normal ring buffer when kernel
doesn'

Re: [RFC PATCH] mm, oom_reaper: gather each vma to prevent leaking TLB entry

2017-11-06 Thread Wangnan (F)



On 2017/11/6 19:57, Michal Hocko wrote:

On Mon 06-11-17 19:03:34, Wangnan (F) wrote:


On 2017/11/6 18:40, Michal Hocko wrote:

On Mon 06-11-17 17:59:54, Wangnan (F) wrote:

On 2017/11/6 16:52, Michal Hocko wrote:

On Mon 06-11-17 15:04:40, Bob Liu wrote:

On Mon, Nov 6, 2017 at 11:36 AM, Wang Nan  wrote:

tlb_gather_mmu(&tlb, mm, 0, -1) means gathering all virtual memory space.
In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush
TLB when tlb->fullmm is true:

 commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1").


CC'ed Will Deacon.


Which makes leaking of tlb entries. For example, when oom_reaper
selects a task and reaps its virtual memory space, another thread
in this task group may still running on another core and access
these already freed memory through tlb entries.

No threads should be running in userspace by the time the reaper gets to
unmap their address space. So the only potential case is they are
accessing the user memory from the kernel when we should fault and we
have MMF_UNSTABLE to cause a SIGBUS. So is the race you are describing
real?


This patch gather each vma instead of gathering full vm space,
tlb->fullmm is not true. The behavior of oom reaper become similar
to munmapping before do_exit, which should be safe for all archs.

I do not have any objections to do per vma tlb flushing because it would
free gathered pages sooner but I am not sure I see any real problem
here. Have you seen any real issues or this is more of a review driven
fix?

We saw the problem when we try to reuse oom reaper's code in
another situation. In our situation, we allow reaping a task
before all other tasks in its task group finish their exiting
procedure.

I'd like to know what ensures "No threads should be running in
userspace by the time the reaper"?

All tasks are killed by the time. So they should be taken out to the
kernel.

Sorry. I read oom_kill_process() but still unable to understand
why all tasks are killed.

oom_kill_process() kill victim by sending SIGKILL. It will be
broadcast to all tasks in its task group, but it is asynchronized.
In the following case, race can happen (Thread1 in Task1's task group):

core 1core 2
Thread1 running   oom_kill_process() selects Task1 as victim
   oom_kill_process() sends SIGKILL to Task1
   oom_kill_process() sends SIGKILL to Thread1
   oom_kill_process() wakes up oom reaper
   switch to oom_reaper
   __oom_reap_task_mm
   tlb_gather_mmu
   unmap_page_range, reap Task1
   tlb_finish_mmu
Write page
be kicked off from core
Receives SIGKILL

So what makes Thread1 being kicked off from core 1 before core 2
starting unmapping?

complete_signal should call signal_wake_up on all threads because this
is a group fatal signal and that should send an IPI to all of the cpus
they run on to. Even if we do not wait for IPI to complete the race
window should be few instructions only while it takes quite some time to
hand over to the oom reaper.


If the complete_signal is the mechanism we rely on to ensure
all threads are exited, then I'm sure it is not enough. As
you said, we still have a small race window. In some platform,
an IPI from one core to another core takes a little bit longer
than you may expect, and the core who receive the IPI may in
a very low frequency.

In our situation, we put the reaper code in do_exit after receiving
SIGKILL, and observe TLB entry leaking. Since this is a SIGKILL,
complete_signal should have been executed. So I think oom_reaper
have similar problem.

Thank you.



Re: [RFC PATCH] mm, oom_reaper: gather each vma to prevent leaking TLB entry

2017-11-06 Thread Wangnan (F)



On 2017/11/6 18:40, Michal Hocko wrote:

On Mon 06-11-17 17:59:54, Wangnan (F) wrote:


On 2017/11/6 16:52, Michal Hocko wrote:

On Mon 06-11-17 15:04:40, Bob Liu wrote:

On Mon, Nov 6, 2017 at 11:36 AM, Wang Nan  wrote:

tlb_gather_mmu(&tlb, mm, 0, -1) means gathering all virtual memory space.
In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush
TLB when tlb->fullmm is true:

commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1").


CC'ed Will Deacon.


Which makes leaking of tlb entries. For example, when oom_reaper
selects a task and reaps its virtual memory space, another thread
in this task group may still running on another core and access
these already freed memory through tlb entries.

No threads should be running in userspace by the time the reaper gets to
unmap their address space. So the only potential case is they are
accessing the user memory from the kernel when we should fault and we
have MMF_UNSTABLE to cause a SIGBUS. So is the race you are describing
real?


This patch gather each vma instead of gathering full vm space,
tlb->fullmm is not true. The behavior of oom reaper become similar
to munmapping before do_exit, which should be safe for all archs.

I do not have any objections to do per vma tlb flushing because it would
free gathered pages sooner but I am not sure I see any real problem
here. Have you seen any real issues or this is more of a review driven
fix?

We saw the problem when we try to reuse oom reaper's code in
another situation. In our situation, we allow reaping a task
before all other tasks in its task group finish their exiting
procedure.

I'd like to know what ensures "No threads should be running in
userspace by the time the reaper"?

All tasks are killed by the time. So they should be taken out to the
kernel.


Sorry. I read oom_kill_process() but still unable to understand
why all tasks are killed.

oom_kill_process() kill victim by sending SIGKILL. It will be
broadcast to all tasks in its task group, but it is asynchronized.
In the following case, race can happen (Thread1 in Task1's task group):

core 1core 2
Thread1 running   oom_kill_process() selects Task1 as victim
  oom_kill_process() sends SIGKILL to Task1
  oom_kill_process() sends SIGKILL to Thread1
  oom_kill_process() wakes up oom reaper
  switch to oom_reaper
  __oom_reap_task_mm
  tlb_gather_mmu
  unmap_page_range, reap Task1
  tlb_finish_mmu
Write page
be kicked off from core
Receives SIGKILL

So what makes Thread1 being kicked off from core 1 before core 2
starting unmapping?

Thank you.



Re: [RFC PATCH] mm, oom_reaper: gather each vma to prevent leaking TLB entry

2017-11-06 Thread Wangnan (F)



On 2017/11/6 16:52, Michal Hocko wrote:

On Mon 06-11-17 15:04:40, Bob Liu wrote:

On Mon, Nov 6, 2017 at 11:36 AM, Wang Nan  wrote:

tlb_gather_mmu(&tlb, mm, 0, -1) means gathering all virtual memory space.
In this case, tlb->fullmm is true. Some archs like arm64 doesn't flush
TLB when tlb->fullmm is true:

   commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1").


CC'ed Will Deacon.


Which makes leaking of tlb entries. For example, when oom_reaper
selects a task and reaps its virtual memory space, another thread
in this task group may still running on another core and access
these already freed memory through tlb entries.

No threads should be running in userspace by the time the reaper gets to
unmap their address space. So the only potential case is they are
accessing the user memory from the kernel when we should fault and we
have MMF_UNSTABLE to cause a SIGBUS. So is the race you are describing
real?


This patch gather each vma instead of gathering full vm space,
tlb->fullmm is not true. The behavior of oom reaper become similar
to munmapping before do_exit, which should be safe for all archs.

I do not have any objections to do per vma tlb flushing because it would
free gathered pages sooner but I am not sure I see any real problem
here. Have you seen any real issues or this is more of a review driven
fix?


We saw the problem when we try to reuse oom reaper's code in
another situation. In our situation, we allow reaping a task
before all other tasks in its task group finish their exiting
procedure.

I'd like to know what ensures "No threads should be running in
userspace by the time the reaper"?

Thank you.


Signed-off-by: Wang Nan 
Cc: Bob Liu 
Cc: Michal Hocko 
Cc: Andrew Morton 
Cc: Michal Hocko 
Cc: David Rientjes 
Cc: Ingo Molnar 
Cc: Roman Gushchin 
Cc: Konstantin Khlebnikov 
Cc: Andrea Arcangeli 
---
  mm/oom_kill.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dee0f75..18c5b35 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -532,7 +532,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
  */
 set_bit(MMF_UNSTABLE, &mm->flags);

-   tlb_gather_mmu(&tlb, mm, 0, -1);
 for (vma = mm->mmap ; vma; vma = vma->vm_next) {
 if (!can_madv_dontneed_vma(vma))
 continue;
@@ -547,11 +546,13 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, 
struct mm_struct *mm)
  * we do not want to block exit_mmap by keeping mm ref
  * count elevated without a good reason.
  */
-   if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+   if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
+   tlb_gather_mmu(&tlb, mm, vma->vm_start, vma->vm_end);
 unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
  NULL);
+   tlb_finish_mmu(&tlb, vma->vm_start, vma->vm_end);
+   }
 }
-   tlb_finish_mmu(&tlb, 0, -1);
 pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB\n",
 task_pid_nr(tsk), tsk->comm,
 K(get_mm_counter(mm, MM_ANONPAGES)),

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"d...@kvack.org";> em...@kvack.org 





Re: [PATCH 1/2] perf mmap: Fix perf backward recording

2017-11-01 Thread Wangnan (F)



On 2017/11/1 21:57, Liang, Kan wrote:

On 2017/11/1 20:00, Namhyung Kim wrote:

On Wed, Nov 01, 2017 at 06:32:50PM +0800, Wangnan (F) wrote:

On 2017/11/1 17:49, Namhyung Kim wrote:

Hi,

On Wed, Nov 01, 2017 at 05:53:26AM +, Wang Nan wrote:

perf record backward recording doesn't work as we expected: it never
overwrite when ring buffer full.

Test:

(Run a busy printing python task background like this:

while True:
print 123

send SIGUSR2 to perf to capture snapshot.)

[SNIP]


Signed-off-by: Wang Nan 
---
tools/perf/util/evlist.c | 8 +++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index c6c891e..4c5daba 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -799,22 +799,28 @@ perf_evlist__should_poll(struct perf_evlist

*evlist __maybe_unused,

}
static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int

idx,

-  struct mmap_params *mp, int cpu_idx,
+  struct mmap_params *_mp, int cpu_idx,
   int thread, int *_output, int

*_output_backward)

{
struct perf_evsel *evsel;
int revent;
int evlist_cpu = cpu_map__cpu(evlist->cpus, cpu_idx);
+   struct mmap_params *mp;
evlist__for_each_entry(evlist, evsel) {
struct perf_mmap *maps = evlist->mmap;
+   struct mmap_params rdonly_mp;
int *output = _output;
int fd;
int cpu;
+   mp = _mp;
if (evsel->attr.write_backward) {
output = _output_backward;
maps = evlist->backward_mmap;
+   rdonly_mp = *_mp;
+   rdonly_mp.prot &= ~PROT_WRITE;
+   mp = &rdonly_mp;
if (!maps) {
maps = perf_evlist__alloc_mmap(evlist);
--

What about this instead (not tested)?


diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index c6c891e154a6..27ebe355e794 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -838,6 +838,11 @@ static int perf_evlist__mmap_per_evsel(struct

perf_evlist *evlist, int idx,

   if (*output == -1) {
   *output = fd;
+   if (evsel->attr.write_backward)
+   mp->prot = PROT_READ;
+   else
+   mp->prot = PROT_READ | PROT_WRITE;
+

If evlist->overwrite is true, PROT_WRITE should be unset even if
write_backward is
not set. If you want to delay the setting of mp->prot, you need to consider
both evlist->overwrite and evsel->attr.write_backward.

I thought evsel->attr.write_backward should be set when
evlist->overwrite is set.  Do you mean following case?

perf record --overwrite -e 'cycles/no-overwrite/'


No. evlist->overwrite is unrelated to '--overwrite'. This is why I
said the concept of 'overwrite' and 'backward' is ambiguous.


Yes, I think we should make it clear.

As we discussed previously, there are four possible combinations
to access ring buffer , 'forward non-overwrite', 'forward overwrite',
'backward non-overwrite' and 'backward overwrite'.

Actually, not all of the combinations are necessary.
- 'forward overwrite' mode brings some problems which were mentioned
   in commit ID 9ecda41acb97 ("perf/core: Add ::write_backward attribute
   to perf event").
- 'backward non-overwrite' mode is very similar as 'forward non-overwrite'.
There is no extra advantage. Only keep one non-overwrite mode is enough.
So 'forward non-overwrite' and 'backward overwrite' are enough for all perf 
tools.

Furthermore, 'forward' and 'backward' only indicate the direction of the
ring buffer. They don't impact the result and performance. It is not
important as the concept of overwrite/non-overwrite.

To simplify the concept, only 'non-overwrite' mode and 'overwrite' mode should
be kept. 'non-overwrite' mode indicates the forward ring buffer. 'overwrite' 
mode
indicates the backward ring buffer.


perf record never sets 'evlist->overwrite'. What '--overwrite' actually
does is setting write_backward. Some testcases needs overwrite evlist.


There are only four test cases which set overwrite, sw-clock,task-exit,
mmap-basic, backward-ring-buffer.
Only backward-ring-buffer is 'backward overwrite'.
The rest three are all 'forward overwrite'. We just need to set write_backward
to convert them to 'backward overwrite'.
I think it's not hard to clean up.


If we add a new rule that overwrite ring buffers are always backward
then it is not hard to cleanup. However, the support of forward
overwrite ring buffer has a long history and the code is not written
by me. I'd like to check if there is some reason to keep support this
configuration?

Thank you.



Re: [PATCH 2/2] perf record: Replace 'overwrite' by 'flightrecorder' for better naming

2017-11-01 Thread Wangnan (F)



On 2017/11/1 23:04, Liang, Kan wrote:

On 2017/11/1 22:22, Liang, Kan wrote:

On 2017/11/1 21:26, Liang, Kan wrote:

The meaning of perf record's "overwrite" option and many "overwrite"
in source code are not clear. In perf's code, the 'overwrite' has 2

meanings:

1. Make ringbuffer readonly (perf_evlist__mmap_ex's argument).
2. Set evsel's "backward" attribute (in apply_config_terms).

perf record doesn't use meaning 1 at all, but have a overwrite
option, its real meaning is setting backward.


I don't understand here.
'overwrite' has 2 meanings. perf record only support 1.
It should be a bug, and need to be fixed.

Not a bug, but ambiguous.

Perf record doesn't need overwrite main channel (we have two channels:
evlist->mmap is main channel and evlist->backward_mmap is backward
evlist->channel),
but some testcases require it, and your new patchset may require it.
'perf record --overwrite' doesn't set main channel overwrite. What it does

is

moving all evsels to backward channel, and we can move some evsels

back to

the main channel by /no-overwrite/ setting. This behavior is hard to
understand.


As my understanding, the 'main channel' should depends on what user sets.
If --overwrite is applied, then evlist->backward_mmap should be the
'main channel'. evlist->overwrite should be set to true as well.

Then it introduces a main channel switching mechanism, and we need
checking evlist->overwrite and another factor to determine which
one is the main channel. Make things more complex.

We should check the evlist->overwrite.
Now, all perf tools force evlist->overwrite = false. I think it doesn’t make 
sense.

What is another factor?


If we support mixed channel as well as forward overwrite ring buffer,
evlist->overwrite is not enough.


I don't think it will be too complex.

In perf_evlist__mmap_ex, we just need to add a check.
If (!overwrite)
evlist->mmap = perf_evlist__alloc_mmap(evlist);
else
evlist->backward_mmap = perf_evlist__alloc_mmap(evlist);

In perf_evlist__mmap_per_evsel, we already handle per-event overwrite.
It just need to add some similar codes to handler per-event nonoverwrite.


Then the logic becomes:

 if (check write_backward) {
maps = evlist->backward_mmap;
if (!maps) {
  maps = perf_evlist__alloc_mmap(...);
  if (!maps) {
  /* error processing */
  }
  evlist->backward_mmap = maps;
  if (evlist->bkw_mmap_state == BKW_MMAP_NOTREADY)
perf_evlist__toggle_bkw_mmap(evlist, BKW_MMAP_RUNNING);
}
 } else {
maps = evlist->mmap;
if (!maps) {
  maps = perf_evlist__alloc_mmap(...);
  if (!maps) {
  /* error processing */
  }
  evlist->mmap = maps;
  
}
 }


For other codes, they should already check evlist->mmap and 
evlist->backward_mmap.
So they probably don't need to change.


Thanks,
Kan







Re: [PATCH 2/2] perf record: Replace 'overwrite' by 'flightrecorder' for better naming

2017-11-01 Thread Wangnan (F)



On 2017/11/1 22:22, Liang, Kan wrote:

On 2017/11/1 21:26, Liang, Kan wrote:

The meaning of perf record's "overwrite" option and many "overwrite"
in source code are not clear. In perf's code, the 'overwrite' has 2 meanings:
   1. Make ringbuffer readonly (perf_evlist__mmap_ex's argument).
   2. Set evsel's "backward" attribute (in apply_config_terms).

perf record doesn't use meaning 1 at all, but have a overwrite
option, its real meaning is setting backward.


I don't understand here.
'overwrite' has 2 meanings. perf record only support 1.
It should be a bug, and need to be fixed.

Not a bug, but ambiguous.

Perf record doesn't need overwrite main channel (we have two channels:
evlist->mmap is main channel and evlist->backward_mmap is backward
evlist->channel),
but some testcases require it, and your new patchset may require it.
'perf record --overwrite' doesn't set main channel overwrite. What it does is
moving all evsels to backward channel, and we can move some evsels back to
the main channel by /no-overwrite/ setting. This behavior is hard to
understand.


As my understanding, the 'main channel' should depends on what user sets.
If --overwrite is applied, then evlist->backward_mmap should be the
'main channel'. evlist->overwrite should be set to true as well.


Then it introduces a main channel switching mechanism, and we need
checking evlist->overwrite and another factor to determine which
one is the main channel. Make things more complex.


/no-overwrite/ setting is per-event setting.
Only when we finish the global setting, then the per-event setting will be
considered. You may refer to apply_config_terms.


Yes.


Thanks,
Kan









Re: [PATCH 2/2] perf record: Replace 'overwrite' by 'flightrecorder' for better naming

2017-11-01 Thread Wangnan (F)



On 2017/11/1 21:26, Liang, Kan wrote:

The meaning of perf record's "overwrite" option and many "overwrite" in
source code are not clear. In perf's code, the 'overwrite' has 2 meanings:
  1. Make ringbuffer readonly (perf_evlist__mmap_ex's argument).
  2. Set evsel's "backward" attribute (in apply_config_terms).

perf record doesn't use meaning 1 at all, but have a overwrite option, its
real meaning is setting backward.


I don't understand here.
'overwrite' has 2 meanings. perf record only support 1.
It should be a bug, and need to be fixed.


Not a bug, but ambiguous.

Perf record doesn't need overwrite main channel (we have two channels:
evlist->mmap is main channel and evlist->backward_mmap is backward channel),
but some testcases require it, and your new patchset may require it.
'perf record --overwrite' doesn't set main channel overwrite. What it does
is moving all evsels to backward channel, and we can move some evsels back
to the main channel by /no-overwrite/ setting. This behavior is hard to
understand.

Thank you.



Re: [PATCH 1/2] perf mmap: Fix perf backward recording

2017-11-01 Thread Wangnan (F)



On 2017/11/1 20:39, Jiri Olsa wrote:

On Wed, Nov 01, 2017 at 08:10:49PM +0800, Wangnan (F) wrote:

SNIP


diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index c6c891e154a6..27ebe355e794 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -838,6 +838,11 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist 
*evlist, int idx,
   if (*output == -1) {
   *output = fd;
+   if (evsel->attr.write_backward)
+   mp->prot = PROT_READ;
+   else
+   mp->prot = PROT_READ | PROT_WRITE;
+

If evlist->overwrite is true, PROT_WRITE should be unset even if
write_backward is
not set. If you want to delay the setting of mp->prot, you need to consider
both evlist->overwrite and evsel->attr.write_backward.

I thought evsel->attr.write_backward should be set when
evlist->overwrite is set.  Do you mean following case?

perf record --overwrite -e 'cycles/no-overwrite/'


No. evlist->overwrite is unrelated to '--overwrite'. This is why I
said the concept of 'overwrite' and 'backward' is ambiguous.

perf record never sets 'evlist->overwrite'. What '--overwrite' actually
does is setting write_backward. Some testcases needs overwrite evlist.

did not check deeply, but so why can't we do the below?

jirka


---
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index f4d9fc54b382..4643c487bd29 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -300,7 +300,7 @@ static int record__mmap_evlist(struct record *rec,
struct record_opts *opts = &rec->opts;
char msg[512];
  
-	if (perf_evlist__mmap_ex(evlist, opts->mmap_pages, false,

+   if (perf_evlist__mmap_ex(evlist, opts->mmap_pages, opts->overwrite,
 opts->auxtrace_mmap_pages,
 opts->auxtrace_snapshot_mode) < 0) {


perf_evlist__mmap_ex's overwrite argument is used to control evlist->mmap.

Consider Namhyung's example:

  perf record --overwrite -e 'cycles/no-overwrite/'

In this case both evlist->mmap and evlist->backward_mmap would be set to 
overwrite.
'cycles' will be put into evlist->mmap, so it will be set to overwrite 
incorrectly.

Patch 2/2 clarifies these concepts. What we want is the flight recorder mode,
not only a read only ring buffer. Even flight recorder mode is selected, we can
still have a normal ring buffer which keep output data.

Thank you.




Re: [PATCH 1/2] perf mmap: Fix perf backward recording

2017-11-01 Thread Wangnan (F)



On 2017/11/1 20:00, Namhyung Kim wrote:

On Wed, Nov 01, 2017 at 06:32:50PM +0800, Wangnan (F) wrote:


On 2017/11/1 17:49, Namhyung Kim wrote:

Hi,

On Wed, Nov 01, 2017 at 05:53:26AM +, Wang Nan wrote:

perf record backward recording doesn't work as we expected: it never
overwrite when ring buffer full.

Test:

(Run a busy printing python task background like this:

   while True:
   print 123

send SIGUSR2 to perf to capture snapshot.)

[SNIP]


Signed-off-by: Wang Nan 
---
   tools/perf/util/evlist.c | 8 +++-
   1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index c6c891e..4c5daba 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -799,22 +799,28 @@ perf_evlist__should_poll(struct perf_evlist *evlist 
__maybe_unused,
   }
   static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx,
-  struct mmap_params *mp, int cpu_idx,
+  struct mmap_params *_mp, int cpu_idx,
   int thread, int *_output, int 
*_output_backward)
   {
struct perf_evsel *evsel;
int revent;
int evlist_cpu = cpu_map__cpu(evlist->cpus, cpu_idx);
+   struct mmap_params *mp;
evlist__for_each_entry(evlist, evsel) {
struct perf_mmap *maps = evlist->mmap;
+   struct mmap_params rdonly_mp;
int *output = _output;
int fd;
int cpu;
+   mp = _mp;
if (evsel->attr.write_backward) {
output = _output_backward;
maps = evlist->backward_mmap;
+   rdonly_mp = *_mp;
+   rdonly_mp.prot &= ~PROT_WRITE;
+   mp = &rdonly_mp;
if (!maps) {
maps = perf_evlist__alloc_mmap(evlist);
--

What about this instead (not tested)?


diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index c6c891e154a6..27ebe355e794 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -838,6 +838,11 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist 
*evlist, int idx,
  if (*output == -1) {
  *output = fd;
+   if (evsel->attr.write_backward)
+   mp->prot = PROT_READ;
+   else
+   mp->prot = PROT_READ | PROT_WRITE;
+

If evlist->overwrite is true, PROT_WRITE should be unset even if
write_backward is
not set. If you want to delay the setting of mp->prot, you need to consider
both evlist->overwrite and evsel->attr.write_backward.

I thought evsel->attr.write_backward should be set when
evlist->overwrite is set.  Do you mean following case?

   perf record --overwrite -e 'cycles/no-overwrite/'



No. evlist->overwrite is unrelated to '--overwrite'. This is why I
said the concept of 'overwrite' and 'backward' is ambiguous.

perf record never sets 'evlist->overwrite'. What '--overwrite' actually
does is setting write_backward. Some testcases needs overwrite evlist.

Thank you.



Re: [PATCH 1/2] perf mmap: Fix perf backward recording

2017-11-01 Thread Wangnan (F)



On 2017/11/1 17:49, Namhyung Kim wrote:

Hi,

On Wed, Nov 01, 2017 at 05:53:26AM +, Wang Nan wrote:

perf record backward recording doesn't work as we expected: it never
overwrite when ring buffer full.

Test:

(Run a busy printing python task background like this:

  while True:
  print 123

send SIGUSR2 to perf to capture snapshot.)


[SNIP]



Signed-off-by: Wang Nan 
---
  tools/perf/util/evlist.c | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index c6c891e..4c5daba 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -799,22 +799,28 @@ perf_evlist__should_poll(struct perf_evlist *evlist 
__maybe_unused,
  }
  
  static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx,

-  struct mmap_params *mp, int cpu_idx,
+  struct mmap_params *_mp, int cpu_idx,
   int thread, int *_output, int 
*_output_backward)
  {
struct perf_evsel *evsel;
int revent;
int evlist_cpu = cpu_map__cpu(evlist->cpus, cpu_idx);
+   struct mmap_params *mp;
  
  	evlist__for_each_entry(evlist, evsel) {

struct perf_mmap *maps = evlist->mmap;
+   struct mmap_params rdonly_mp;
int *output = _output;
int fd;
int cpu;
  
+		mp = _mp;

if (evsel->attr.write_backward) {
output = _output_backward;
maps = evlist->backward_mmap;
+   rdonly_mp = *_mp;
+   rdonly_mp.prot &= ~PROT_WRITE;
+   mp = &rdonly_mp;
  
  			if (!maps) {

maps = perf_evlist__alloc_mmap(evlist);
--

What about this instead (not tested)?


diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index c6c891e154a6..27ebe355e794 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -838,6 +838,11 @@ static int perf_evlist__mmap_per_evsel(struct perf_evlist 
*evlist, int idx,
 if (*output == -1) {
 *output = fd;
  
+   if (evsel->attr.write_backward)

+   mp->prot = PROT_READ;
+   else
+   mp->prot = PROT_READ | PROT_WRITE;
+


If evlist->overwrite is true, PROT_WRITE should be unset even if 
write_backward is

not set. If you want to delay the setting of mp->prot, you need to consider
both evlist->overwrite and evsel->attr.write_backward.

Then why not removing mp->prot totally, and add a prot argument to 
perf_mmap__mmap,

since prot setting is always decided independently?


 if (perf_mmap__mmap(&maps[idx], mp, *output)  < 0)
 return -1;
 } else {
@@ -1058,9 +1063,7 @@ int perf_evlist__mmap_ex(struct perf_evlist *evlist, 
unsigned int pages,
 struct perf_evsel *evsel;
 const struct cpu_map *cpus = evlist->cpus;
 const struct thread_map *threads = evlist->threads;
-   struct mmap_params mp = {
-   .prot = PROT_READ | (overwrite ? 0 : PROT_WRITE),
-   };
+   struct mmap_params mp;
  
 if (!evlist->mmap)

 evlist->mmap = perf_evlist__alloc_mmap(evlist);


The 'overwrite' argument in perf_evlist__mmap_ex() controls 
evlist->overwrite.


Thank you.



Re: [PATCH 2/2] perf record: Replace 'overwrite' by 'flightrecorder' for better naming

2017-11-01 Thread Wangnan (F)



On 2017/11/1 18:03, Namhyung Kim wrote:

On Wed, Nov 01, 2017 at 05:53:27AM +, Wang Nan wrote:

The meaning of perf record's "overwrite" option and many "overwrite" in
source code are not clear. In perf's code, the 'overwrite' has 2 meanings:
  1. Make ringbuffer readonly (perf_evlist__mmap_ex's argument).
  2. Set evsel's "backward" attribute (in apply_config_terms).

perf record doesn't use meaning 1 at all, but have a overwrite option, its
real meaning is setting backward.

This patch separates these two concepts, introduce 'flightrecorder' mode
which is what we really want. It combines these 2 concept together, wraps
them into a record mode. In flight recorder mode, perf only dumps data before
something happen.

I'm ok with the it but removing old name looks not good.  How about
keeping them for a while (as deprecated)?.


Is there a way to hide '--overwrite' from 'perf record --help' and print 
something

when user really use it?


And 'flightrecorder' seems too long.  Maybe you can use an acronym
like FDR or fdr-mode?


fdr-mode is a good name.

Thank you.



Re: [PATCH 2/2] perf, tools: Don't force MetricExprs to lower case

2017-10-12 Thread Wangnan (F)



On 2017/10/12 23:59, Jiri Olsa wrote:

On Thu, Oct 12, 2017 at 11:31:51PM +0800, Wangnan (F) wrote:

SNIP


Ok. If it works it's fine for me.

well it works, but it means that bpf file cannot contains any directory
part.. which im not sure is ok with bpf folks ;-) anyone?

Sorry I didn't see this thread these days.

Do you think adding a special escape character to suppress BPF
name parsing in a event is a good idea? for example:

% perf stat -e cpu/uops_executed.core,cmask=1/ true
bpf: builtin compilation failed: -95, try external compiler
ERROR: problems with path cpu/uops_executed.c: No such file or directory
event syntax error: 'cpu/uops_executed.core,cmask=1/'
  \___ Failed to load cpu/uops_executed.c from source:
Error when compiling BPF scriptlet. Add a leading '@' to avoid BPF syntax

% perf stat -e @cpu/uops_executed.core,cmask=1/ true
...


if we go this way, I'd rather mark the bpf syntax
instead of changing the generic event format,
like Andi suggested in some earlier email

but maybe we can workaround this with patch
I sent in my last email


Great. I'm also working on a patch which check the existence of
BPF file and have it rejected if the file is not exist. The only
inconvenient is we need to call stat() many times because bpf_object
matchs c[a-zA-Z0-9._]*. I think you have solved the problem in your
patch.

Thank you.




Re: [PATCH 2/2] perf, tools: Don't force MetricExprs to lower case

2017-10-12 Thread Wangnan (F)



On 2017/10/9 22:39, Jiri Olsa wrote:

On Mon, Oct 09, 2017 at 11:12:58AM -0300, Arnaldo Carvalho de Melo wrote:

Em Mon, Oct 09, 2017 at 07:07:29AM -0700, Andi Kleen escreveu:

On Mon, Oct 09, 2017 at 03:41:51PM +0200, Jiri Olsa wrote:

On Wed, Oct 04, 2017 at 09:27:11AM -0700, Andi Kleen wrote:

On Wed, Oct 04, 2017 at 12:30:52PM +0200, Jiri Olsa wrote:

On Tue, Oct 03, 2017 at 01:06:05PM -0300, Arnaldo Carvalho de Melo wrote:

Em Tue, Sep 12, 2017 at 12:56:43PM -0700, Andi Kleen escreveu:

From: Andi Kleen 

There are still problems with BPF misinterpreting some events
that include .c. An earlier fix made it work for stand alone
aliases, but it still fails for more complex constructs.

Hi Wang, Jiri,

Can you please take a look at this and see if there is something
we can do to help Andi?

- Arnaldo
  

REJECT keeps trying and trying a shorter string until
.c is matched and it appears like a valid BPF path.

% perf stat -e cpu/uops_executed.core,cmask=1/ true
bpf: builtin compilation failed: -95, try external compiler
ERROR: problems with path cpu/uops_executed.c: No such file or directory
event syntax error: 'cpu/uops_executed.core,cmask=1/'
  \___ Failed to load cpu/uops_executed.c from source: 
Error when compiling BPF scriptlet

I tried to fix it, but it exceeds my flex knowledge, because
REJECT does not interact well with BEGIN states.

The BPF syntax in its current form really causes an ambigious
grammar.

right, it looks like we allow whole path (including / char)
for BPF file, which messes up with out pmu/.../ syntax

do we need that? (Cc-ed some bpf folks)

if not attached patch seems to fix things.. otherwise
we need to come up with another fix

I tried similar patches, but I always ran into more complex
situations where it still matched incorrectly.

e.g. try it with cpu/uops_executed.core,... vs uops_executed.core

hm, both works for me with the change:

   perf stat -e cpu/uops_executed.core/ ls
   perf stat -e uops_executed.core ls

Ok. If it works it's fine for me.

well it works, but it means that bpf file cannot contains any directory
part.. which im not sure is ok with bpf folks ;-) anyone?


Sorry I didn't see this thread these days.

Do you think adding a special escape character to suppress BPF
name parsing in a event is a good idea? for example:

% perf stat -e cpu/uops_executed.core,cmask=1/ true
bpf: builtin compilation failed: -95, try external compiler
ERROR: problems with path cpu/uops_executed.c: No such file or directory
event syntax error: 'cpu/uops_executed.core,cmask=1/'
 \___ Failed to load cpu/uops_executed.c from 
source: Error when compiling BPF scriptlet. Add a leading '@' to avoid 
BPF syntax


% perf stat -e @cpu/uops_executed.core,cmask=1/ true
...



Re: [PATCH] perf tool: Don't discard prev in backward mode

2017-10-12 Thread Wangnan (F)

Hi Kan, Please see if this patch works for you.


On 2017/10/14 7:16, Wang Nan wrote:

Perf record can switch output. The new output should only store the
data after switching. However, in overwrite backward mode, the new
output still have the data from old output. That also brings extra overhead.

At the end of mmap_read, the position of processed ring buffer is
saved in md->prev. Next mmap_read should be end in md->prev if it is
not overwriten. That avoids to process duplicate data.
However, the md->prev is discarded. So next mmap_read has to process
whole valid ring buffer, which probably include the old processed
data.

Avoid calling backward_rb_find_range() when md->prev is still
available.

Signed-off-by: Wang Nan 
Cc: Liang Kan 
---
  tools/perf/util/mmap.c | 33 +++--
  1 file changed, 15 insertions(+), 18 deletions(-)

diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index 9fe5f9c..df1de55 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -287,18 +287,6 @@ static int backward_rb_find_range(void *buf, int mask, u64 
head, u64 *start, u64
return -1;
  }
  
-static int rb_find_range(void *data, int mask, u64 head, u64 old,

-u64 *start, u64 *end, bool backward)
-{
-   if (!backward) {
-   *start = old;
-   *end = head;
-   return 0;
-   }
-
-   return backward_rb_find_range(data, mask, head, start, end);
-}
-
  int perf_mmap__push(struct perf_mmap *md, bool overwrite, bool backward,
void *to, int push(void *to, void *buf, size_t size))
  {
@@ -310,19 +298,28 @@ int perf_mmap__push(struct perf_mmap *md, bool overwrite, 
bool backward,
void *buf;
int rc = 0;
  
-	if (rb_find_range(data, md->mask, head, old, &start, &end, backward))

-   return -1;
+   start = backward ? head : old;
+   end = backward ? old : head;
  
  	if (start == end)

return 0;
  
  	size = end - start;

if (size > (unsigned long)(md->mask) + 1) {
-   WARN_ONCE(1, "failed to keep up with mmap data. (warn only 
once)\n");
+   if (!backward) {
+   WARN_ONCE(1, "failed to keep up with mmap data. (warn only 
once)\n");
  
-		md->prev = head;

-   perf_mmap__consume(md, overwrite || backward);
-   return 0;
+   md->prev = head;
+   perf_mmap__consume(md, overwrite || backward);
+   return 0;
+   }
+
+   /*
+* Backward ring buffer is full. We still have a chance to read
+* most of data from it.
+*/
+   if (backward_rb_find_range(data, md->mask, head, &start, &end))
+   return -1;
}
  
  	if ((start & md->mask) + size != (end & md->mask)) {





Re: [PATCH] perf tools: fix: Force backward ring buffer mapped readonly

2017-10-12 Thread Wangnan (F)



On 2017/10/12 22:46, Wangnan (F) wrote:



On 2017/10/12 22:45, Liang, Kan wrote:

On 2017/10/12 20:56, Liang, Kan wrote:

On 2017/10/11 21:16, Liang, Kan wrote:

perf record's --overwrite option doesn't work as we expect.
For example:

[SNIP]


In the above example we get same records from the backward ring
buffer all the time. Overwriting is not triggered.

This commit maps backward ring buffers readonly, make it 
overwritable.
It is safe because we assume backward ring buffer always 
overwritable

in other part of code.

After applying this patch:

   $ ~/linux/tools/perf$ sudo ./perf record -m 4 -e 
raw_syscalls:*

-g -- overwrite \
   --switch-output=1s --tail-synthesize   dd
if=/dev/zero of=/dev/null

[SNIP]


Signed-off-by: Wang Nan 
Cc: Liang Kan 
Cc: Arnaldo Carvalho de Melo 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Alexander Shishkin 
Cc: Adrian Hunter 
Cc: Andi Kleen 
Cc: Li Zefan 
---
tools/perf/util/evlist.c | 7 ++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index c6c891e..a86b0d2 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -799,12 +799,14 @@ perf_evlist__should_poll(struct perf_evlist
*evlist __maybe_unused,
}

static int perf_evlist__mmap_per_evsel(struct perf_evlist 
*evlist, int

idx,

-   struct mmap_params *mp, int cpu_idx,
+   struct mmap_params *_mp, int cpu_idx,
   int thread, int *_output, int
*_output_backward)
{
struct perf_evsel *evsel;
int revent;
int evlist_cpu = cpu_map__cpu(evlist->cpus, cpu_idx);
+struct mmap_params *mp = _mp;
+struct mmap_params backward_mp;

evlist__for_each_entry(evlist, evsel) {
struct perf_mmap *maps = evlist->mmap; @@ -815,6 +817,9

@@ static

int perf_evlist__mmap_per_evsel(struct
perf_evlist *evlist, int idx,
if (evsel->attr.write_backward) {
output = _output_backward;
maps = evlist->backward_mmap;
+backward_mp = *mp;
+backward_mp.prot &= ~PROT_WRITE;
+mp = &backward_mp;

if (!maps) {
maps = perf_evlist__alloc_mmap(evlist);

So it's trying to support per-event overwrite.
How about the global --overwrite option?
Not only the per-event overwrite. See the example above. The 
overall --
overwrite option is also respected. In perf_evsel__config, 
per-event evsel
'backward' setting is set based on overall '--overwrite' and 
per-event

'/overwrite/' setting.
But how about evlist->overwrite? I think it still keeps the wrong 
setting.

The overwrite is implicitly applied. Some settings are inconsistent.

Is there any drawback if you use opts->overwrite for

perf_evlist__mmap_ex?

We will always face such inconsistency, because we have
an /no-overwrite/ option which can be set per-evsel.
Setting evlist->overwrite won't make things more consistent,
because in a evlist, different evsel can have different
overwrite setting. A simple solution is making evlist
non-overwrite by default, and watch all overwrite evsels
a special cases. Then we have only 2 cases to consider:

1. overwrite evsel in a non-overwrite evlist.
2. non-overwrite evsel in a non-overwrite evlist.


If evlist->overwrite is always non-overwrite, why not remove it?


Some testcases require it.



Sorry. I think removing it is reasonable now, but we need to solve
the relationship between overwrite and backward first. I suggest remove
the whole 'backward' concept, and makes evsels backward if it is
overwrite. Is there any usecases that:
1. overwrite but not backward ring buffer: it will be unparsable after 
ring buffer full.

2. backward but not overwrite ring buffer: I don't see any advantage.

Thank you.


Thank you.





Re: [PATCH] perf tools: fix: Force backward ring buffer mapped readonly

2017-10-12 Thread Wangnan (F)



On 2017/10/12 22:45, Liang, Kan wrote:

On 2017/10/12 20:56, Liang, Kan wrote:

On 2017/10/11 21:16, Liang, Kan wrote:

perf record's --overwrite option doesn't work as we expect.
For example:

[SNIP]


In the above example we get same records from the backward ring
buffer all the time. Overwriting is not triggered.

This commit maps backward ring buffers readonly, make it overwritable.
It is safe because we assume backward ring buffer always overwritable
in other part of code.

After applying this patch:

   $ ~/linux/tools/perf$ sudo ./perf record -m 4 -e raw_syscalls:*
-g -- overwrite \
   --switch-output=1s --tail-synthesize   dd
if=/dev/zero of=/dev/null

[SNIP]


Signed-off-by: Wang Nan 
Cc: Liang Kan 
Cc: Arnaldo Carvalho de Melo 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Alexander Shishkin 
Cc: Adrian Hunter 
Cc: Andi Kleen 
Cc: Li Zefan 
---
tools/perf/util/evlist.c | 7 ++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index c6c891e..a86b0d2 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -799,12 +799,14 @@ perf_evlist__should_poll(struct perf_evlist
*evlist __maybe_unused,
}

static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int

idx,

-  struct mmap_params *mp, int cpu_idx,
+  struct mmap_params *_mp, int cpu_idx,
   int thread, int *_output, int
*_output_backward)
{
struct perf_evsel *evsel;
int revent;
int evlist_cpu = cpu_map__cpu(evlist->cpus, cpu_idx);
+   struct mmap_params *mp = _mp;
+   struct mmap_params backward_mp;

evlist__for_each_entry(evlist, evsel) {
struct perf_mmap *maps = evlist->mmap; @@ -815,6 +817,9

@@ static

int perf_evlist__mmap_per_evsel(struct
perf_evlist *evlist, int idx,
if (evsel->attr.write_backward) {
output = _output_backward;
maps = evlist->backward_mmap;
+   backward_mp = *mp;
+   backward_mp.prot &= ~PROT_WRITE;
+   mp = &backward_mp;

if (!maps) {
maps = perf_evlist__alloc_mmap(evlist);

So it's trying to support per-event overwrite.
How about the global --overwrite option?

Not only the per-event overwrite. See the example above. The overall --
overwrite option is also respected. In perf_evsel__config, per-event evsel
'backward' setting is set based on overall '--overwrite' and per-event
'/overwrite/' setting.

But how about evlist->overwrite? I think it still keeps the wrong setting.
The overwrite is implicitly applied. Some settings are inconsistent.

Is there any drawback if you use opts->overwrite for

perf_evlist__mmap_ex?

We will always face such inconsistency, because we have
an /no-overwrite/ option which can be set per-evsel.
Setting evlist->overwrite won't make things more consistent,
because in a evlist, different evsel can have different
overwrite setting. A simple solution is making evlist
non-overwrite by default, and watch all overwrite evsels
a special cases. Then we have only 2 cases to consider:

1. overwrite evsel in a non-overwrite evlist.
2. non-overwrite evsel in a non-overwrite evlist.


If evlist->overwrite is always non-overwrite, why not remove it?
  


Some testcases require it.

Thank you.



Re: [PATCH 02/10] perf tool: fix: Don't discard prev in backward mode

2017-10-12 Thread Wangnan (F)



On 2017/10/12 20:49, Liang, Kan wrote:

  From 8b058ea6977a97e5705aa2f64bdd014fd76d1247 Mon Sep 17

00:00:00

2001
From: Kan Liang 
Date: Wed, 11 Oct 2017 07:39:34 -0700
Subject: [PATCH] perf tool: fix: Don't discard prev in backward mode

Perf record can switch output. The new output should only store the
data after switching. However, in overwrite backward mode, the new
output still have the data from old output. That also brings extra overhead.

At the end of mmap_read, the position of processed ring buffer is
saved in md->prev. Next mmap_read should be end in md->prev if it is
not overwriten. That avoids to process duplicate data.
However, the md->prev is discarded. So next mmap_read has to process
whole valid ring buffer, which probably include the old processed
data.

Introduce fast path for backward_rb_find_range. Stop searching when
md->prev is detected.

Signed-off-by: Kan Liang 
---
   tools/perf/util/mmap.c | 11 +--
   1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c index
9fe5f9c..36b459a 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -254,7 +254,8 @@ int perf_mmap__mmap(struct perf_mmap *map,

struct mmap_params *mp, int fd)

return 0;
   }

-static int backward_rb_find_range(void *buf, int mask, u64 head, u64
*start, u64 *end)
+static int backward_rb_find_range(void *buf, int mask, u64 head,
+ u64 old, u64 *start, u64 *end)
   {
struct perf_event_header *pheader;
u64 evt_head = head;
@@ -282,6 +283,12 @@ static int backward_rb_find_range(void *buf, int
mask, u64 head, u64 *start, u64

evt_head += pheader->size;
pr_debug3("move evt_head: %"PRIx64"\n", evt_head);
+
+   /* fast path: avoid to process duplicate data */
+   if (old == evt_head) {
+   *end = evt_head;
+   return 0;
+   }

You still need to parse the whole ring buffer.

I don't think so. I'm not using (old & mask).
The upper bit of old is good enough to tell if there is overwriting.

Here are some debugging logs.
The start and end is the result from backward_rb_find_range.
If there is overwriting, it finishes with 'rewind', not 'fast path'.

sudo ./perf record -m1 -e cycles:P -C0 --overwrite --switch-output=1s
backward_rb_find_range: buf=0x7f9ea7a93000, head=f550
Finished reading backward ring buffer: fast path
start   0xf550 end 0x0

backward_rb_find_range: buf=0x7f9ea7a93000, head=ab30
Finished reading backward ring buffer: rewind
start   0xab30 end 0xbb10

backward_rb_find_range: buf=0x7f9ea7a93000, head=28b0
Finished reading backward ring buffer: rewind
start   0x28b0 end 0x3898

backward_rb_find_range: buf=0x7f9ea7a93000, head=fffe3e40
Finished reading backward ring buffer: rewind
start   0xfffe3e40 end 0xfffe4e40

backward_rb_find_range: buf=0x7f9ea7a93000, head=fffe3aa0
Finished reading backward ring buffer: fast path
start   0xfffe3aa0 end 0xfffe3e40

backward_rb_find_range: buf=0x7f9ea7a93000, head=fffe3470
Finished reading backward ring buffer: fast path
start   0xfffe3470 end 0xfffe3aa0

backward_rb_find_range: buf=0x7f9ea7a93000, head=fffe0610
Finished reading backward ring buffer: rewind
start   0xfffe0610 end 0xfffe1610

backward_rb_find_range: buf=0x7f9ea7a93000, head=fffe00d0
Finished reading backward ring buffer: fast path
start   0xfffe00d0 end 0xfffe0610

backward_rb_find_range: buf=0x7f9ea7a93000, head=fffdfe90
Finished reading backward ring buffer: fast path
start   0xfffdfe90 end 0xfffe00d0


I'm not saying your code is incorrect. What I mean is
you can avoid calling backward_rb_find_range() which
parses the ring buffer.

Please see rb_find_range. In case of forward ring buffer,
it never reads anything for positioning. Just passing start
and end pointers. Your and my implementation requires
reading the ring buffer to find the last record. It needs
a while loop.

When 'old - head' less than the size of the ring buffer,
even backward ring buffer can also avoid such reading.
backward_rb_find_range() should be called when 'old - head'
larger than the size of ring buffer only.

Thank you.



Re: [PATCH] perf tools: fix: Force backward ring buffer mapped readonly

2017-10-12 Thread Wangnan (F)



On 2017/10/12 20:56, Liang, Kan wrote:

On 2017/10/11 21:16, Liang, Kan wrote:

perf record's --overwrite option doesn't work as we expect.
For example:

[SNIP]


In the above example we get same records from the backward ring
buffer all the time. Overwriting is not triggered.

This commit maps backward ring buffers readonly, make it overwritable.
It is safe because we assume backward ring buffer always overwritable
in other part of code.

After applying this patch:

  $ ~/linux/tools/perf$ sudo ./perf record -m 4 -e raw_syscalls:*
-g -- overwrite \
   --switch-output=1s --tail-synthesize   dd
if=/dev/zero of=/dev/null

[SNIP]



Signed-off-by: Wang Nan 
Cc: Liang Kan 
Cc: Arnaldo Carvalho de Melo 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Alexander Shishkin 
Cc: Adrian Hunter 
Cc: Andi Kleen 
Cc: Li Zefan 
---
   tools/perf/util/evlist.c | 7 ++-
   1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index c6c891e..a86b0d2 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -799,12 +799,14 @@ perf_evlist__should_poll(struct perf_evlist
*evlist __maybe_unused,
   }

   static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx,
-  struct mmap_params *mp, int cpu_idx,
+  struct mmap_params *_mp, int cpu_idx,
   int thread, int *_output, int
*_output_backward)
   {
struct perf_evsel *evsel;
int revent;
int evlist_cpu = cpu_map__cpu(evlist->cpus, cpu_idx);
+   struct mmap_params *mp = _mp;
+   struct mmap_params backward_mp;

evlist__for_each_entry(evlist, evsel) {
struct perf_mmap *maps = evlist->mmap; @@ -815,6 +817,9

@@ static

int perf_evlist__mmap_per_evsel(struct
perf_evlist *evlist, int idx,
if (evsel->attr.write_backward) {
output = _output_backward;
maps = evlist->backward_mmap;
+   backward_mp = *mp;
+   backward_mp.prot &= ~PROT_WRITE;
+   mp = &backward_mp;

if (!maps) {
maps = perf_evlist__alloc_mmap(evlist);

So it's trying to support per-event overwrite.
How about the global --overwrite option?

Not only the per-event overwrite. See the example above. The overall --
overwrite option is also respected. In perf_evsel__config, per-event evsel
'backward' setting is set based on overall '--overwrite' and per-event
'/overwrite/' setting.

But how about evlist->overwrite? I think it still keeps the wrong setting.
The overwrite is implicitly applied. Some settings are inconsistent.

Is there any drawback if you use opts->overwrite for perf_evlist__mmap_ex?


We will always face such inconsistency, because we have
an /no-overwrite/ option which can be set per-evsel.
Setting evlist->overwrite won't make things more consistent,
because in a evlist, different evsel can have different
overwrite setting. A simple solution is making evlist
non-overwrite by default, and watch all overwrite evsels
a special cases. Then we have only 2 cases to consider:

1. overwrite evsel in a non-overwrite evlist.
2. non-overwrite evsel in a non-overwrite evlist.

If we reset evlist->overwrite according to --overwrite, we
will have 4 cases to consider:

1. overwrite evsel in a overwrite evlist.
2. non-overwrite evsel in a overwrite evlist.
3. overwrite evsel in a non-overwrite evlist.
4. non-overwrite evsel in a non-overwrite evlist.

The real problem is: there's 'overwrite' and 'backward'
concepts in our code, but these two concepts are neither
independent nor identical.

Thank you.



Thanks,
Kan

I think we should use opts->overwrite to replace the hard code 'false'
for perf_evlist__mmap_ex as well.

Thanks,
Kan





Re: [PATCH 02/10] perf tool: fix: Don't discard prev in backward mode

2017-10-11 Thread Wangnan (F)



On 2017/10/11 22:57, Liang, Kan wrote:

If you really want to avoid record duplication, you need to
changes record__mmap_read()'s logic. Now it complains "failed to
keep up with mmap data" and avoid dumping data when size of newly
generated data is larger than the size of the ring buffer. It is
reasonable for forward ring buffer because in this case you lost
the head of the first record, the whole ring buffer is
unparseable. However, it is wrong in backward case. What you
should do in this case is dumping the whole ring buffer.


I think what you want should be something like this: (not tested)


No. That's not what I want.
My test code never trigger the WARN_ONCE.

The existing code never trigger that warning because the size computed
by rb_find_range is never larger than size of ring buffer. After
applying your patch, I believe it will trigger this WARN_ONCE and drop
the whole ring buffer. Please set a smaller ring buffer and try again.


I think you will see the problem, if you simply run the command as below.
sudo ./perf record -e cycles:P -C0 --overwrite --switch-output=1s

The output size keep increasing. Because the new output always
include

the old outputs.

What I want is the 'start' and 'end' for the increase, not everything.


This is my test result: add a '-m 1' for 'perf record' for shrinking
ring buffer, start a while loop on CPU 0 to increase data rate.

It stops increasing after the ring buffer is full:

$:~/linux/tools/perf$ sudo ./perf record -m1 -e cycles:P -C0
--overwrite --switch-output=1s
Warning: File /home/w00229757/.perfconfig not owned by current user
or root, ignoring it.
[ perf record: dump data: Woken up 1 times ] [ perf record: Dump
perf.data.2017101212165072 ] [ perf record: dump data: Woken up 1
times ] [ perf record: Dump perf.data.2017101212165175 ] [ perf
record: dump data: Woken up 1 times ] [ perf record: Dump
perf.data.2017101212165278 ] [ perf record: dump data: Woken up 1
times ] [ perf record: Dump perf.data.2017101212165381 ] [ perf
record: dump data: Woken up 1 times ] [ perf record: Dump
perf.data.2017101212165484 ] [ perf record: dump data: Woken up 1
times ] [ perf record: Dump perf.data.2017101212165586 ] ^C[ perf
record: Woken up 1 times to write data ] [ perf record: Dump
perf.data.2017101212165653 ] [ perf record: Captured and wrote 1.013
MB perf.data. ]

$ ls -l ./perf.data*
-rw--- 1 root root  538988 Oct 12 12:16
./perf.data.2017101212165072
-rw--- 1 root root  538988 Oct 12 12:16
./perf.data.2017101212165175
-rw--- 1 root root  538988 Oct 12 12:16
./perf.data.2017101212165278
-rw--- 1 root root  538988 Oct 12 12:16
./perf.data.2017101212165381
-rw--- 1 root root  538988 Oct 12 12:16
./perf.data.2017101212165484
-rw--- 1 root root  538988 Oct 12 12:16
./perf.data.2017101212165586
-rw--- 1 root root 1067812 Oct 12 12:16
./perf.data.2017101212165653

You see the result keep getting larger because the ring buffer is
never full in your case.

The increasing file size in my case indicates that the old processed data is
dumped into the new output.
I don't think it’s right. Because we should not process the same data multiple
times.
That definitely increases the overhead of perf record.


For the issue, I mentioned above.
What do think about the patch as below?
It tries to avoid the duplicate data.

 From 8b058ea6977a97e5705aa2f64bdd014fd76d1247 Mon Sep 17 00:00:00 2001
From: Kan Liang 
Date: Wed, 11 Oct 2017 07:39:34 -0700
Subject: [PATCH] perf tool: fix: Don't discard prev in backward mode

Perf record can switch output. The new output should only store the data
after switching. However, in overwrite backward mode, the new output
still have the data from old output. That also brings extra overhead.

At the end of mmap_read, the position of processed ring buffer is saved
in md->prev. Next mmap_read should be end in md->prev if it is not
overwriten. That avoids to process duplicate data.
However, the md->prev is discarded. So next mmap_read has to process
whole valid ring buffer, which probably include the old processed
data.

Introduce fast path for backward_rb_find_range. Stop searching when
md->prev is detected.

Signed-off-by: Kan Liang 
---
  tools/perf/util/mmap.c | 11 +--
  1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/mmap.c b/tools/perf/util/mmap.c
index 9fe5f9c..36b459a 100644
--- a/tools/perf/util/mmap.c
+++ b/tools/perf/util/mmap.c
@@ -254,7 +254,8 @@ int perf_mmap__mmap(struct perf_mmap *map, struct 
mmap_params *mp, int fd)
return 0;
  }
  
-static int backward_rb_find_range(void *buf, int mask, u64 head, u64 *start, u64 *end)

+static int backward_rb_find_range(void *buf, int mask, u64 head,
+ u64 old, u64 *start, u64 *end)
  {
struct perf_event_header *pheader;
u64 evt_head = head;
@@ -282,6 +283,12 @@ static int backward_rb_find_range(void *buf, int mask, u64 
head, u64 *start, u64
  
  		evt_head += pheader->size;

Re: [PATCH] perf tools: fix: Force backward ring buffer mapped readonly

2017-10-11 Thread Wangnan (F)



On 2017/10/11 21:16, Liang, Kan wrote:

perf record's --overwrite option doesn't work as we expect.
For example:


[SNIP]



In the above example we get same records from the backward ring buffer
all the time. Overwriting is not triggered.

This commit maps backward ring buffers readonly, make it overwritable.
It is safe because we assume backward ring buffer always overwritable
in other part of code.

After applying this patch:

 $ ~/linux/tools/perf$ sudo ./perf record -m 4 -e raw_syscalls:* -g --
overwrite \
   --switch-output=1s --tail-synthesize   dd
if=/dev/zero of=/dev/null


[SNIP]




Signed-off-by: Wang Nan 
Cc: Liang Kan 
Cc: Arnaldo Carvalho de Melo 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Alexander Shishkin 
Cc: Adrian Hunter 
Cc: Andi Kleen 
Cc: Li Zefan 
---
  tools/perf/util/evlist.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index c6c891e..a86b0d2 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -799,12 +799,14 @@ perf_evlist__should_poll(struct perf_evlist *evlist
__maybe_unused,
  }

  static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx,
-  struct mmap_params *mp, int cpu_idx,
+  struct mmap_params *_mp, int cpu_idx,
   int thread, int *_output, int
*_output_backward)
  {
struct perf_evsel *evsel;
int revent;
int evlist_cpu = cpu_map__cpu(evlist->cpus, cpu_idx);
+   struct mmap_params *mp = _mp;
+   struct mmap_params backward_mp;

evlist__for_each_entry(evlist, evsel) {
struct perf_mmap *maps = evlist->mmap;
@@ -815,6 +817,9 @@ static int perf_evlist__mmap_per_evsel(struct
perf_evlist *evlist, int idx,
if (evsel->attr.write_backward) {
output = _output_backward;
maps = evlist->backward_mmap;
+   backward_mp = *mp;
+   backward_mp.prot &= ~PROT_WRITE;
+   mp = &backward_mp;

if (!maps) {
maps = perf_evlist__alloc_mmap(evlist);

So it's trying to support per-event overwrite.
How about the global --overwrite option?


Not only the per-event overwrite. See the example above. The overall 
--overwrite
option is also respected. In perf_evsel__config, per-event evsel 
'backward' setting

is set based on overall '--overwrite' and per-event '/overwrite/' setting.


I think we should use opts->overwrite to replace the hard code 'false' for
perf_evlist__mmap_ex as well.

Thanks,
Kan





Re: [PATCH 03/10] perf tool: new iterfaces to read event from ring buffer

2017-10-10 Thread Wangnan (F)



On 2017/10/11 3:55, Liang, Kan wrote:

On 2017/10/11 3:17, Arnaldo Carvalho de Melo wrote:

Em Wed, Oct 11, 2017 at 03:10:37AM +0800, Wangnan (F) escreveu:

On 2017/10/11 3:00, Arnaldo Carvalho de Melo wrote:

Em Tue, Oct 10, 2017 at 03:36:28PM -0300, Arnaldo Carvalho de Melo

escreveu:

Em Tue, Oct 10, 2017 at 03:34:55PM -0300, Arnaldo Carvalho de Melo

escreveu:

Em Tue, Oct 10, 2017 at 06:28:18PM +, Liang, Kan escreveu:

Em Tue, Oct 10, 2017 at 10:20:16AM -0700, kan.li...@intel.com

escreveu:

From: Kan Liang 

The perf_evlist__mmap_read only support forward mode. It needs
a common function to support both forward and backward mode.
The perf_evlist__mmap_read_backward is buggy.

So, what is the bug? You state that it is buggy, but don't spell
out the bug, please do so.


union perf_event *perf_evlist__mmap_read_backward(struct
perf_evlist *evlist, int idx) {
struct perf_mmap *md = &evlist->mmap[idx];  <--- it should

be

backward_mmap


If it fixes an existing bug, then it should go separate from this

patchkit, right?

There is no one use perf_evlist__mmap_read_backward. So it

doesn't trigger any issue.

There is no one at the end of your patchkit? Or no user _right
now_? If there is a user now, lemme see... yeah, no user right
now, so _that_ is yet another bug, i.e. it should be used, no? If
this is just a left over, then we should just throw it away, now, its a

cleanup.

Wang, can you take a look at these two issues?

So it looks leftover that should've been removed by the following cset,

right Wang?

commit a0c6f451f90204847ce5f91c3268d83a76bde1b6
Author: Wang Nan 
Date:   Thu Jul 14 08:34:41 2016 +
   perf evlist: Drop evlist->backward
   Now there's no real user of evlist->backward. Drop it. We are going to
   use evlist->backward_mmap as a container for backward ring buffer.

Yes, it should be removed, but then there will be no corresponding
function to perf_evlist__mmap_read(), which read an record from
forward ring buffer.
I think Kan wants to become the first user of this function because
he is trying to make 'perf top' utilizing backward ring buffer. It
needs perf_evlist__mmap_read_backward(), and he triggers the bug use
his unpublished patch set.
I think we can remove it now, let Kan fix and add it back in his 'perf top'
patch set.

Well, if there will be a user, perhaps we should fix it, as it seems
interesting to have now for, as you said, a counterpart for the
forward ring buffer, and one that we have plans for using soon, right?

Right if I understand Kan's patch 00/10 correctly. He said:

...
 But perf top need to switch to overwrite backward mode for good
 performance.
...


Yes, it will be used for perf top optimization.
That will be great if you can fix it.


You can fix it and post your fix together with your perf top
patch set. I can write the code but I can't test it because
there's no user now.


I still think it's a good idea to have common interfaces for both perf record
and other tools like perf top.
rb_find_range/backward_rb_find_range could be shared.
The mmap read codes are similar.


Agree. Please keep going.


Thanks,
Kan





Re: [PATCH 02/10] perf tool: fix: Don't discard prev in backward mode

2017-10-10 Thread Wangnan (F)



On 2017/10/11 3:50, Liang, Kan wrote:

On 2017/10/11 2:23, Wangnan (F) wrote:


On 2017/10/11 1:20, kan.li...@intel.com wrote:

From: Kan Liang 

Perf record can switch output. The new output should only store the data
after switching. However, in overwrite backward mode, the new output
still have the data from old output.

At the end of mmap_read, the position of processed ring buffer is saved
in md->prev. Next mmap_read should be end in md->prev.
However, the md->prev is discarded. So next mmap_read has to process
whole valid ring buffer, which definitely include the old processed
data.

Set the prev as the end of the range in backward mode.

Signed-off-by: Kan Liang 
---
   tools/perf/util/evlist.c | 14 +-
   1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 33b8837..7d23cf5 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -742,13 +742,25 @@ static int
   rb_find_range(void *data, int mask, u64 head, u64 old,
 u64 *start, u64 *end, bool backward)
   {
+int ret;
+
   if (!backward) {
   *start = old;
   *end = head;
   return 0;
   }
   -return backward_rb_find_range(data, mask, head, start, end);
+ret = backward_rb_find_range(data, mask, head, start, end);
+
+/*
+ * The start and end from backward_rb_find_range is the range
for all
+ * valid data in ring buffer.
+ * However, part of the data is processed previously.
+ * Reset the end to drop the processed data
+ */
+*end = old;
+

[SNIP]


If you really want to avoid record duplication, you need to changes
record__mmap_read()'s
logic. Now it complains "failed to keep up with mmap data" and avoid
dumping data when
size of newly generated data is larger than the size of the ring
buffer. It is reasonable
for forward ring buffer because in this case you lost the head of the
first record, the
whole ring buffer is unparseable. However, it is wrong in backward
case. What you
should do in this case is dumping the whole ring buffer.


I think what you want should be something like this: (not tested)


No. That's not what I want.
My test code never trigger the WARN_ONCE.


The existing code never trigger that warning because the size computed
by rb_find_range is never larger than size of ring buffer. After applying
your patch, I believe it will trigger this WARN_ONCE and drop the whole
ring buffer. Please set a smaller ring buffer and try again.


I think you will see the problem, if you simply run the command as below.
sudo ./perf record -e cycles:P -C0 --overwrite --switch-output=1s

The output size keep increasing. Because the new output always include the old 
outputs.
What I want is the 'start' and 'end' for the increase, not everything.



This is my test result: add a '-m 1' for 'perf record' for shrinking 
ring buffer,

start a while loop on CPU 0 to increase data rate.

It stops increasing after the ring buffer is full:

$:~/linux/tools/perf$ sudo ./perf record -m1 -e cycles:P -C0 --overwrite 
--switch-output=1s
  Warning: File /home/w00229757/.perfconfig not owned by current user 
or root, ignoring it.

[ perf record: dump data: Woken up 1 times ]
[ perf record: Dump perf.data.2017101212165072 ]
[ perf record: dump data: Woken up 1 times ]
[ perf record: Dump perf.data.2017101212165175 ]
[ perf record: dump data: Woken up 1 times ]
[ perf record: Dump perf.data.2017101212165278 ]
[ perf record: dump data: Woken up 1 times ]
[ perf record: Dump perf.data.2017101212165381 ]
[ perf record: dump data: Woken up 1 times ]
[ perf record: Dump perf.data.2017101212165484 ]
[ perf record: dump data: Woken up 1 times ]
[ perf record: Dump perf.data.2017101212165586 ]
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Dump perf.data.2017101212165653 ]
[ perf record: Captured and wrote 1.013 MB perf.data. ]

$ ls -l ./perf.data*
-rw--- 1 root root  538988 Oct 12 12:16 ./perf.data.2017101212165072
-rw--- 1 root root  538988 Oct 12 12:16 ./perf.data.2017101212165175
-rw--- 1 root root  538988 Oct 12 12:16 ./perf.data.2017101212165278
-rw--- 1 root root  538988 Oct 12 12:16 ./perf.data.2017101212165381
-rw--- 1 root root  538988 Oct 12 12:16 ./perf.data.2017101212165484
-rw--- 1 root root  538988 Oct 12 12:16 ./perf.data.2017101212165586
-rw--- 1 root root 1067812 Oct 12 12:16 ./perf.data.2017101212165653

You see the result keep getting larger because the ring buffer
is never full in your case.

Thank you.



Re: [PATCH 03/10] perf tool: new iterfaces to read event from ring buffer

2017-10-10 Thread Wangnan (F)



On 2017/10/11 3:17, Arnaldo Carvalho de Melo wrote:

Em Wed, Oct 11, 2017 at 03:10:37AM +0800, Wangnan (F) escreveu:


On 2017/10/11 3:00, Arnaldo Carvalho de Melo wrote:

Em Tue, Oct 10, 2017 at 03:36:28PM -0300, Arnaldo Carvalho de Melo escreveu:

Em Tue, Oct 10, 2017 at 03:34:55PM -0300, Arnaldo Carvalho de Melo escreveu:

Em Tue, Oct 10, 2017 at 06:28:18PM +, Liang, Kan escreveu:

Em Tue, Oct 10, 2017 at 10:20:16AM -0700, kan.li...@intel.com escreveu:

From: Kan Liang 

The perf_evlist__mmap_read only support forward mode. It needs a
common function to support both forward and backward mode.
The perf_evlist__mmap_read_backward is buggy.

So, what is the bug? You state that it is buggy, but don't spell out the bug,
please do so.


union perf_event *perf_evlist__mmap_read_backward(struct perf_evlist *evlist, 
int idx)
{
struct perf_mmap *md = &evlist->mmap[idx];  <--- it should be 
backward_mmap


If it fixes an existing bug, then it should go separate from this patchkit, 
right?

There is no one use perf_evlist__mmap_read_backward. So it doesn't trigger any 
issue.

There is no one at the end of your patchkit? Or no user _right now_? If
there is a user now, lemme see... yeah, no user right now, so _that_ is
yet another bug, i.e. it should be used, no? If this is just a left
over, then we should just throw it away, now, its a cleanup.

Wang, can you take a look at these two issues?

So it looks leftover that should've been removed by the following cset, right 
Wang?
commit a0c6f451f90204847ce5f91c3268d83a76bde1b6
Author: Wang Nan 
Date:   Thu Jul 14 08:34:41 2016 +
  perf evlist: Drop evlist->backward
  Now there's no real user of evlist->backward. Drop it. We are going to
  use evlist->backward_mmap as a container for backward ring buffer.
  

Yes, it should be removed, but then there will be no corresponding
function to perf_evlist__mmap_read(), which read an record from forward
ring buffer.
  

I think Kan wants to become the first user of this function because
he is trying to make 'perf top' utilizing backward ring buffer. It needs
perf_evlist__mmap_read_backward(), and he triggers the bug use his
unpublished patch set.
  

I think we can remove it now, let Kan fix and add it back in his 'perf top'
patch set.

Well, if there will be a user, perhaps we should fix it, as it seems
interesting to have now for, as you said, a counterpart for the forward
ring buffer, and one that we have plans for using soon, right?


Right if I understand Kan's patch 00/10 correctly. He said:

...
   But perf top need to switch to overwrite backward mode for good
   performance.
...

Thank you.



Re: [PATCH 03/10] perf tool: new iterfaces to read event from ring buffer

2017-10-10 Thread Wangnan (F)



On 2017/10/11 3:00, Arnaldo Carvalho de Melo wrote:

Em Tue, Oct 10, 2017 at 03:36:28PM -0300, Arnaldo Carvalho de Melo escreveu:

Em Tue, Oct 10, 2017 at 03:34:55PM -0300, Arnaldo Carvalho de Melo escreveu:

Em Tue, Oct 10, 2017 at 06:28:18PM +, Liang, Kan escreveu:

Em Tue, Oct 10, 2017 at 10:20:16AM -0700, kan.li...@intel.com escreveu:

From: Kan Liang 

The perf_evlist__mmap_read only support forward mode. It needs a
common function to support both forward and backward mode.
The perf_evlist__mmap_read_backward is buggy.

So, what is the bug? You state that it is buggy, but don't spell out the bug,
please do so.


union perf_event *perf_evlist__mmap_read_backward(struct perf_evlist *evlist, 
int idx)
{
struct perf_mmap *md = &evlist->mmap[idx];  <--- it should be 
backward_mmap


If it fixes an existing bug, then it should go separate from this patchkit, 
right?

There is no one use perf_evlist__mmap_read_backward. So it doesn't trigger any 
issue.

There is no one at the end of your patchkit? Or no user _right now_? If
there is a user now, lemme see... yeah, no user right now, so _that_ is
yet another bug, i.e. it should be used, no? If this is just a left
over, then we should just throw it away, now, its a cleanup.

Wang, can you take a look at these two issues?

So it looks leftover that should've been removed by the following cset, right 
Wang?

- Arnaldo

commit a0c6f451f90204847ce5f91c3268d83a76bde1b6
Author: Wang Nan 
Date:   Thu Jul 14 08:34:41 2016 +

 perf evlist: Drop evlist->backward
 
 Now there's no real user of evlist->backward. Drop it. We are going to

 use evlist->backward_mmap as a container for backward ring buffer.



Yes, it should be removed, but then there will be no corresponding
function to perf_evlist__mmap_read(), which read an record from forward
ring buffer.

I think Kan wants to become the first user of this function because
he is trying to make 'perf top' utilizing backward ring buffer. It needs
perf_evlist__mmap_read_backward(), and he triggers the bug use his
unpublished patch set.

I think we can remove it now, let Kan fix and add it back in his 'perf top'
patch set.

Thank you.



Re: [PATCH 02/10] perf tool: fix: Don't discard prev in backward mode

2017-10-10 Thread Wangnan (F)



On 2017/10/11 2:23, Wangnan (F) wrote:



On 2017/10/11 1:20, kan.li...@intel.com wrote:

From: Kan Liang 

Perf record can switch output. The new output should only store the data
after switching. However, in overwrite backward mode, the new output
still have the data from old output.

At the end of mmap_read, the position of processed ring buffer is saved
in md->prev. Next mmap_read should be end in md->prev.
However, the md->prev is discarded. So next mmap_read has to process
whole valid ring buffer, which definitely include the old processed
data.

Set the prev as the end of the range in backward mode.

Signed-off-by: Kan Liang 
---
  tools/perf/util/evlist.c | 14 +-
  1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 33b8837..7d23cf5 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -742,13 +742,25 @@ static int
  rb_find_range(void *data, int mask, u64 head, u64 old,
u64 *start, u64 *end, bool backward)
  {
+int ret;
+
  if (!backward) {
  *start = old;
  *end = head;
  return 0;
  }
  -return backward_rb_find_range(data, mask, head, start, end);
+ret = backward_rb_find_range(data, mask, head, start, end);
+
+/*
+ * The start and end from backward_rb_find_range is the range 
for all

+ * valid data in ring buffer.
+ * However, part of the data is processed previously.
+ * Reset the end to drop the processed data
+ */
+*end = old;
+


[SNIP]



If you really want to avoid record duplication, you need to changes 
record__mmap_read()'s
logic. Now it complains "failed to keep up with mmap data" and avoid 
dumping data when
size of newly generated data is larger than the size of the ring 
buffer. It is reasonable
for forward ring buffer because in this case you lost the head of the 
first record, the
whole ring buffer is unparseable. However, it is wrong in backward 
case. What you

should do in this case is dumping the whole ring buffer.



I think what you want should be something like this: (not tested)

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index ee7d0a8..f621a8e 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -173,7 +173,9 @@ rb_find_range(void *data, int mask, u64 head, u64 old,
 return 0;
 }

-return backward_rb_find_range(data, mask, head, start, end);
+*start = head;
+*end = old;
+return 0;
 }

 static int
@@ -199,10 +201,15 @@ record__mmap_read(struct record *rec, struct 
perf_mmap *md,


 size = end - start;
 if (size > (unsigned long)(md->mask) + 1) {
-WARN_ONCE(1, "failed to keep up with mmap data. (warn only 
once)\n");

+if (!backward) {
+WARN_ONCE(1, "failed to keep up with mmap data. (warn only 
once)\n");


-md->prev = head;
-perf_mmap__consume(md, overwrite || backward);
+md->prev = head;
+perf_mmap__consume(md, overwrite || backward);
+} else {
+backward_rb_find_range(data, md->mask, head, start, end);
+/* FIXME: error processing */
+}
 return 0;
 }


Use 'head' and 'old' to locate data position in ring buffer by default. 
If overwrite happen, use
backward_rb_find_range() to fetch the last available record and dump the 
whole ring buffer.


Thank you.




Re: [PATCH 02/10] perf tool: fix: Don't discard prev in backward mode

2017-10-10 Thread Wangnan (F)



On 2017/10/11 1:20, kan.li...@intel.com wrote:

From: Kan Liang 

Perf record can switch output. The new output should only store the data
after switching. However, in overwrite backward mode, the new output
still have the data from old output.

At the end of mmap_read, the position of processed ring buffer is saved
in md->prev. Next mmap_read should be end in md->prev.
However, the md->prev is discarded. So next mmap_read has to process
whole valid ring buffer, which definitely include the old processed
data.

Set the prev as the end of the range in backward mode.

Signed-off-by: Kan Liang 
---
  tools/perf/util/evlist.c | 14 +-
  1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 33b8837..7d23cf5 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -742,13 +742,25 @@ static int
  rb_find_range(void *data, int mask, u64 head, u64 old,
  u64 *start, u64 *end, bool backward)
  {
+   int ret;
+
if (!backward) {
*start = old;
*end = head;
return 0;
}
  
-	return backward_rb_find_range(data, mask, head, start, end);

+   ret = backward_rb_find_range(data, mask, head, start, end);
+
+   /*
+* The start and end from backward_rb_find_range is the range for all
+* valid data in ring buffer.
+* However, part of the data is processed previously.
+* Reset the end to drop the processed data
+*/
+   *end = old;
+


I don't understand this patch. What rb_find_range() wants to do is to 
find start
and end pointer, so record__mmap_read() knows where to start reading and 
where to stop.
In backward mode, 'start' pointer is clearly 'head' (because 'head' is 
the header of the
last record kernel writes to the ring buffer), but the 'end' pointer is 
not very easy to
find ('end' pointer is the last (the earliest) available record in the 
ring buffer).
We have to parse the whole ring buffer to find the last available record 
and set its
tail into 'end', this is what backward_rb_find_range() tries hard to do. 
However,
your patch unconditionally overwrites *end (which is set by 
backward_rb_find_range()),

makes backward_rb_find_range() meaningless.

In my mind when you decide to use backward ring buffer you must make it 
overwrite
because you want the kernel silently overwrite old records without 
waking up perf, while
still able to parse the ring buffer even if overwriting happens. The use 
case
should be: perf runs in background, kernel silently ignores old records 
in the ring
buffer until something bad happens, then perf dumps 'events before the 
bad thing'.
Similar to fight recorder. In this use case, each dumping is 
independent. Even
if one record is appears in previous dumping, it is harmless (and 
useful) to make

it appears in following dumping again.

If you really want to avoid record duplication, you need to changes 
record__mmap_read()'s
logic. Now it complains "failed to keep up with mmap data" and avoid 
dumping data when
size of newly generated data is larger than the size of the ring buffer. 
It is reasonable
for forward ring buffer because in this case you lost the head of the 
first record, the
whole ring buffer is unparseable. However, it is wrong in backward case. 
What you

should do in this case is dumping the whole ring buffer.


+   return ret;
  }
  
  /*





Re: [PATCHv2] perf bpf: Fix endianness problem when loading parameters in prologue

2017-08-15 Thread Wangnan (F)



On 2017/8/15 14:42, Thomas-Mich Richter wrote:

On 08/14/2017 06:39 PM, Arnaldo Carvalho de Melo wrote:

Em Mon, Aug 14, 2017 at 01:46:44PM +0200, Thomas Richter escreveu:

Perf BPF prologue generator unconditionally fetches 8 bytes for function
parameters, which causes problem on big endian machine. Thomas gives a
detail analysis for this problem:

  
http://lkml.kernel.org/r/968ebda5-abe4-8830-8d69-49f62529d...@linux.vnet.ibm.com

This patch parses the type of each argument and converts data from
memory to expected type.

Now the test runs successfully on 4.13.0-rc5:
[root@s8360046 perf]# ./perf test  bpf
38: BPF filter :
38.1: Basic BPF filtering  : Ok
38.2: BPF pinning  : Ok
38.3: BPF prologue generation  : Ok
38.4: BPF relocation checker   : Ok
[root@s8360046 perf]#

Signed-off-by: Thomas Richter 
Signed-off-by: Wang Nan 
Acked-by: Wang Nan 
Tested-by: Wang Nan 


That is strange, who is the author of the patch? Also I think Tested-by
is enough, being an even stronger form of Acked-by?

But then you also have Signed-off-by: Wang in there...

 From Documentation/process/submitting-patches.rst:

-

12) When to use Acked-by: and Cc:
-

The Signed-off-by: tag indicates that the signer was involved in the
development of the patch, or that he/she was in the patch's delivery path.

If a person was not directly involved in the preparation or handling of a
patch but wishes to signify and record their approval of it then they can
ask to have an Acked-by: line added to the patch's changelog.

Acked-by: is often used by the maintainer of the affected code when that
maintainer neither contributed to nor forwarded the patch.

--

If Wang wrote the original patch and you made it better working together
with him, probably having both of you in Signed-off-by lines should be
enough?

- Arnaldo


Ok, my fault then.
Wang wrote to patch in the first place, I just fixed one line.
Should I resend the patch and delete the Acked-by/Tested-by lines
in the commit message?


Yes, please resend it.


Thanks






Re: perf bpf: Reworked fix endianness problem when loading parameters in prologue

2017-08-14 Thread Wangnan (F)

OK.


On 2017/8/14 18:58, Thomas-Mich Richter wrote:

On 08/14/2017 12:30 PM, Wangnan (F) wrote:

Hi Thomas,

Your patch looks good to me. I've tested in my environment and it works.
Please resend it to lkml and let Arnaldo to collect it.

Thank you.


Will do, I take this as an Acked-by and Tested-by, ok?

Thanks





Re: perf bpf: Reworked fix endianness problem when loading parameters in prologue

2017-08-14 Thread Wangnan (F)

Hi Thomas,

Your patch looks good to me. I've tested in my environment and it works.
Please resend it to lkml and let Arnaldo to collect it.

Thank you.

On 2017/8/14 17:47, Thomas Richter wrote:

Perf BPF prologue generator unconditionally fetches 8 bytes for function
parameters, which causes problem on big endian machine. Thomas gives a
detail analysis for this problem:

  
http://lkml.kernel.org/r/968ebda5-abe4-8830-8d69-49f62529d...@linux.vnet.ibm.com

This patch parses the type of each argument and converts data from
memory to expected type.

Now the test runs successfully on 4.13.0-rc5:
[root@s8360046 perf]# ./perf test  bpf
38: BPF filter :
38.1: Basic BPF filtering  : Ok
38.2: BPF pinning  : Ok
38.3: BPF prologue generation  : Ok
38.4: BPF relocation checker   : Ok
[root@s8360046 perf]#

Signed-off-by: Thomas Richter 
Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
Cc: Thomas Richter 
Cc: Alexei Starovoitov 
Cc: Hendrik Brueckner 
Cc: Li Zefan 
---
  tools/perf/util/bpf-prologue.c | 49 --
  1 file changed, 47 insertions(+), 2 deletions(-)

diff --git a/tools/perf/util/bpf-prologue.c b/tools/perf/util/bpf-prologue.c
index 1356220..827f914 100644
--- a/tools/perf/util/bpf-prologue.c
+++ b/tools/perf/util/bpf-prologue.c
@@ -58,6 +58,46 @@ check_pos(struct bpf_insn_pos *pos)
return 0;
  }
  
+/*

+ * Convert type string (u8/u16/u32/u64/s8/s16/s32/s64 ..., see
+ * Documentation/trace/kprobetrace.txt) to size field of BPF_LDX_MEM
+ * instruction (BPF_{B,H,W,DW}).
+ */
+static int
+argtype_to_ldx_size(const char *type)
+{
+   int arg_size = type ? atoi(&type[1]) : 64;
+
+   switch (arg_size) {
+   case 8:
+   return BPF_B;
+   case 16:
+   return BPF_H;
+   case 32:
+   return BPF_W;
+   case 64:
+   default:
+   return BPF_DW;
+   }
+}
+
+static const char *
+insn_sz_to_str(int insn_sz)
+{
+   switch (insn_sz) {
+   case BPF_B:
+   return "BPF_B";
+   case BPF_H:
+   return "BPF_H";
+   case BPF_W:
+   return "BPF_W";
+   case BPF_DW:
+   return "BPF_DW";
+   default:
+   return "UNKNOWN";
+   }
+}
+
  /* Give it a shorter name */
  #define ins(i, p) append_insn((i), (p))
  
@@ -258,9 +298,14 @@ gen_prologue_slowpath(struct bpf_insn_pos *pos,

}
  
  	/* Final pass: read to registers */

-   for (i = 0; i < nargs; i++)
-   ins(BPF_LDX_MEM(BPF_DW, BPF_PROLOGUE_START_ARG_REG + i,
+   for (i = 0; i < nargs; i++) {
+   int insn_sz = (args[i].ref) ? argtype_to_ldx_size(args[i].type) 
: BPF_DW;
+
+   pr_debug("prologue: load arg %d, insn_sz is %s\n",
+i, insn_sz_to_str(insn_sz));
+   ins(BPF_LDX_MEM(insn_sz, BPF_PROLOGUE_START_ARG_REG + i,
BPF_REG_FP, -BPF_REG_SIZE * (i + 1)), pos);
+   }
  
  	ins(BPF_JMP_IMM(BPF_JA, BPF_REG_0, 0, JMP_TO_SUCCESS_CODE), pos);
  





Re: [PATCH] perf bpf: Fix endianness problem when loading parameters in prologue

2017-08-11 Thread Wangnan (F)

Hi Thomas,

Please try this patch on your machine and give me the result.

Thank you.

On 2017/8/13 2:49, Wang Nan wrote:

Perf BPF prologue generator unconditionally fetches 8 bytes for function
parameters, which causes problem on big endian machine. Thomas gives a
detail analysis for this problem:

  
http://lkml.kernel.org/r/968ebda5-abe4-8830-8d69-49f62529d...@linux.vnet.ibm.com

This patch parses the type of each argument and converts data from
memory to expected type.

Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
Cc: Thomas Richter 
Cc: Alexei Starovoitov 
Cc: Hendrik Brueckner 
Cc: Li Zefan 
---
  tools/perf/tests/bpf-script-test-prologue.c |  4 ++-
  tools/perf/util/bpf-prologue.c  | 49 +++--
  2 files changed, 50 insertions(+), 3 deletions(-)

diff --git a/tools/perf/tests/bpf-script-test-prologue.c 
b/tools/perf/tests/bpf-script-test-prologue.c
index b4ebc75..43f1e16 100644
--- a/tools/perf/tests/bpf-script-test-prologue.c
+++ b/tools/perf/tests/bpf-script-test-prologue.c
@@ -26,9 +26,11 @@ static void (*bpf_trace_printk)(const char *fmt, int 
fmt_size, ...) =
(void *) 6;
  
  SEC("func=null_lseek file->f_mode offset orig")

-int bpf_func__null_lseek(void *ctx, int err, unsigned long f_mode,
+int bpf_func__null_lseek(void *ctx, int err, unsigned long _f_mode,
 unsigned long offset, unsigned long orig)
  {
+   fmode_t f_mode = (fmode_t)_f_mode;
+
if (err)
return 0;
if (f_mode & FMODE_WRITE)
diff --git a/tools/perf/util/bpf-prologue.c b/tools/perf/util/bpf-prologue.c
index 6cdbee1..ce28993 100644
--- a/tools/perf/util/bpf-prologue.c
+++ b/tools/perf/util/bpf-prologue.c
@@ -57,6 +57,46 @@ check_pos(struct bpf_insn_pos *pos)
return 0;
  }
  
+/*

+ * Convert type string (u8/u16/u32/u64/s8/s16/s32/s64 ..., see
+ * Documentation/trace/kprobetrace.txt) to size field of BPF_LDX_MEM
+ * instruction (BPF_{B,H,W,DW}).
+ */
+static int
+argtype_to_ldx_size(const char *type)
+{
+   int arg_size = type ? atoi(&type[1]) : 64;
+
+   switch (arg_size) {
+   case 8:
+   return BPF_B;
+   case 16:
+   return BPF_H;
+   case 32:
+   return BPF_W;
+   case 64:
+   default:
+   return BPF_DW;
+   }
+}
+
+static const char *
+insn_sz_to_str(int insn_sz)
+{
+   switch (insn_sz) {
+   case BPF_B:
+   return "BPF_B";
+   case BPF_H:
+   return "BPF_H";
+   case BPF_W:
+   return "BPF_W";
+   case BPF_DW:
+   return "BPF_DW";
+   default:
+   return "UNKNOWN";
+   }
+}
+
  /* Give it a shorter name */
  #define ins(i, p) append_insn((i), (p))
  
@@ -257,9 +297,14 @@ gen_prologue_slowpath(struct bpf_insn_pos *pos,

}
  
  	/* Final pass: read to registers */

-   for (i = 0; i < nargs; i++)
-   ins(BPF_LDX_MEM(BPF_DW, BPF_PROLOGUE_START_ARG_REG + i,
+   for (i = 0; i < nargs; i++) {
+   int insn_sz = argtype_to_ldx_size(args[i].type);
+
+   pr_debug("prologue: load arg %d, insn_sz is %s\n",
+i, insn_sz_to_str(insn_sz));
+   ins(BPF_LDX_MEM(insn_sz, BPF_PROLOGUE_START_ARG_REG + i,
BPF_REG_FP, -BPF_REG_SIZE * (i + 1)), pos);
+   }
  
  	ins(BPF_JMP_IMM(BPF_JA, BPF_REG_0, 0, JMP_TO_SUCCESS_CODE), pos);
  





Re: perf test BPF subtest bpf-prologue test fails on s390x

2017-08-11 Thread Wangnan (F)



On 2017/8/11 17:17, Thomas-Mich Richter wrote:

On 08/11/2017 07:19 AM, Wangnan (F) wrote:


On 2017/8/11 2:13, Arnaldo Carvalho de Melo wrote:

Thomas, please try to find who wrote that test and CC him, I'm doing it
this time, Wang, can you please take a look at this?

I also added lkml to the CC list, here we have more users of perf, lkml
is the more developer centric perf list, as perf touches way too many
places in the kernel.

Best regards,

- Arnaldo

Em Wed, Aug 09, 2017 at 11:24:19AM +0200, Thomas Richter escreveu:

I investigate perf test BPF for s390x and have a question regarding
the 38.3 subtest (bpf-prologue test) which fails on s390x.

When I turn on trace_printk in tests/bpf-script-test-prologue.c
I see this output in /sys/kernel/debug/tracing/trace:

[root@s8360047 perf]# cat /sys/kernel/debug/tracing/trace
perf-30229 [000] d..2 170161.535791: : f_mode 2001d offset:0 orig:0
perf-30229 [000] d..2 170161.535809: : f_mode 6001f offset:0 orig:0
perf-30229 [000] d..2 170161.535815: : f_mode 6001f offset:1 orig:0
perf-30229 [000] d..2 170161.535819: : f_mode 2001d offset:1 orig:0
perf-30229 [000] d..2 170161.535822: : f_mode 2001d offset:2 orig:1
perf-30229 [000] d..2 170161.535825: : f_mode 6001f offset:2 orig:1
perf-30229 [000] d..2 170161.535828: : f_mode 6001f offset:3 orig:1
perf-30229 [000] d..2 170161.535832: : f_mode 2001d offset:3 orig:1
perf-30229 [000] d..2 170161.535835: : f_mode 2001d offset:4 orig:0
perf-30229 [000] d..2 170161.535841: : f_mode 6001f offset:4 orig:0

[...]

There are 3 parameters the eBPF program tests/bpf-script-test-prologue.c
accesses: f_mode (member of struct file at offset 140) offset and orig.
They are parameters of the lseek() system call triggered in this
test case in function llseek_loop().

What is really strange is the value of f_mode. It is an 8 byte
value, whereas in the probe event it is defined as a 4 byte value.
The lower 4 bytes are all zero and do not belong to member f_mode.
The correct value should be 2001d for read-only and 6001f for
read-write open mode.

Here is the output of the 'perf test -vv bpf' trace:
 Try to find probe point from debuginfo.
 Matched function: null_lseek [2d9310d]
 Probe point found: null_lseek+0
 Searching 'file' variable in context.
 Converting variable file into trace event.
 converting f_mode in file
 f_mode type is unsigned int.
 Opening /sys/kernel/debug/tracing//README write=0
 Searching 'offset' variable in context.
 Converting variable offset into trace event.
 offset type is long long int.
 Searching 'orig' variable in context.
 Converting variable orig into trace event.
 orig type is int.
 Found 1 probe_trace_events.
 Opening /sys/kernel/debug/tracing//kprobe_events write=1
 Writing event: p:perf_bpf_probe/func _text+8794224 f_mode=+140(%r2):x32
 offset=%r3:s64 orig=%r4:s32

Thank you for your information. This is an endianess problem.

Here f_mode is x32, so perf probe and debuginfo in vmlinux is correct.
However, BPF prologue generator doesn't obey type, but unconditionally
fetch 8 bytes (on 64-bit machine) and pass it to parameter of
bpf_func__null_lseek. This is reasonable, because all args should be
unsigned long. However, to recover its original value, we need a casting.

Please help me verify if the following fix works on your platform:

diff --git a/tools/perf/tests/bpf-script-test-prologue.c 
b/tools/perf/tests/bpf-script-test-prologue.c
index b4ebc75..43f1e16 100644
--- a/tools/perf/tests/bpf-script-test-prologue.c
+++ b/tools/perf/tests/bpf-script-test-prologue.c
@@ -26,9 +26,11 @@ static void (*bpf_trace_printk)(const char *fmt, int 
fmt_size, ...) =
 (void *) 6;

  SEC("func=null_lseek file->f_mode offset orig")
-int bpf_func__null_lseek(void *ctx, int err, unsigned long f_mode,
+int bpf_func__null_lseek(void *ctx, int err, unsigned long _f_mode,
  unsigned long offset, unsigned long orig)
  {
+   fmode_t f_mode = (fmode_t)_f_mode;
+
 if (err)
 return 0;
 if (f_mode & FMODE_WRITE)

Thank you.


Thanks for this patch, unfortunately it does not work. The result is the same
as before. Here is the trace output from /sys/kernel/debug/tracing/trace:
perf-18119 [001] d..2 66832.765154: : f_mode 0 offset:0 orig:0
perf-18119 [001] d..2 66832.765189: : f_mode 0 offset:0 orig:0
perf-18119 [001] d..2 66832.765203: : f_mode 0 offset:1 orig:0
perf-18119 [001] d..2 66832.765216: : f_mode 0 offset:1 orig:0
perf-18119 [001] d..2 66832.765229: : f_mode 0 offset:2 orig:1
perf-18119 [001] d..2 66832.765242: : f_mode 0 offset:2 orig:1
 


So casting in big-endian platform doesn't work as I thought.


However when I alter your patch and right shift _f_mode by 32 bits
like this:



Re: perf test BPF subtest bpf-prologue test fails on s390x

2017-08-10 Thread Wangnan (F)



On 2017/8/11 2:13, Arnaldo Carvalho de Melo wrote:

Thomas, please try to find who wrote that test and CC him, I'm doing it
this time, Wang, can you please take a look at this?

I also added lkml to the CC list, here we have more users of perf, lkml
is the more developer centric perf list, as perf touches way too many
places in the kernel.

Best regards,

- Arnaldo

Em Wed, Aug 09, 2017 at 11:24:19AM +0200, Thomas Richter escreveu:

I investigate perf test BPF for s390x and have a question regarding
the 38.3 subtest (bpf-prologue test) which fails on s390x.

When I turn on trace_printk in tests/bpf-script-test-prologue.c
I see this output in /sys/kernel/debug/tracing/trace:

[root@s8360047 perf]# cat /sys/kernel/debug/tracing/trace
perf-30229 [000] d..2 170161.535791: : f_mode 2001d offset:0 orig:0
perf-30229 [000] d..2 170161.535809: : f_mode 6001f offset:0 orig:0
perf-30229 [000] d..2 170161.535815: : f_mode 6001f offset:1 orig:0
perf-30229 [000] d..2 170161.535819: : f_mode 2001d offset:1 orig:0
perf-30229 [000] d..2 170161.535822: : f_mode 2001d offset:2 orig:1
perf-30229 [000] d..2 170161.535825: : f_mode 6001f offset:2 orig:1
perf-30229 [000] d..2 170161.535828: : f_mode 6001f offset:3 orig:1
perf-30229 [000] d..2 170161.535832: : f_mode 2001d offset:3 orig:1
perf-30229 [000] d..2 170161.535835: : f_mode 2001d offset:4 orig:0
perf-30229 [000] d..2 170161.535841: : f_mode 6001f offset:4 orig:0

[...]

There are 3 parameters the eBPF program tests/bpf-script-test-prologue.c
accesses: f_mode (member of struct file at offset 140) offset and orig.
They are parameters of the lseek() system call triggered in this
test case in function llseek_loop().

What is really strange is the value of f_mode. It is an 8 byte
value, whereas in the probe event it is defined as a 4 byte value.
The lower 4 bytes are all zero and do not belong to member f_mode.
The correct value should be 2001d for read-only and 6001f for
read-write open mode.

Here is the output of the 'perf test -vv bpf' trace:
Try to find probe point from debuginfo.
Matched function: null_lseek [2d9310d]
Probe point found: null_lseek+0
Searching 'file' variable in context.
Converting variable file into trace event.
converting f_mode in file
f_mode type is unsigned int.
Opening /sys/kernel/debug/tracing//README write=0
Searching 'offset' variable in context.
Converting variable offset into trace event.
offset type is long long int.
Searching 'orig' variable in context.
Converting variable orig into trace event.
orig type is int.
Found 1 probe_trace_events.
Opening /sys/kernel/debug/tracing//kprobe_events write=1
Writing event: p:perf_bpf_probe/func _text+8794224 f_mode=+140(%r2):x32
offset=%r3:s64 orig=%r4:s32


Thank you for your information. This is an endianess problem.

Here f_mode is x32, so perf probe and debuginfo in vmlinux is correct.
However, BPF prologue generator doesn't obey type, but unconditionally
fetch 8 bytes (on 64-bit machine) and pass it to parameter of
bpf_func__null_lseek. This is reasonable, because all args should be
unsigned long. However, to recover its original value, we need a casting.

Please help me verify if the following fix works on your platform:

diff --git a/tools/perf/tests/bpf-script-test-prologue.c 
b/tools/perf/tests/bpf-script-test-prologue.c

index b4ebc75..43f1e16 100644
--- a/tools/perf/tests/bpf-script-test-prologue.c
+++ b/tools/perf/tests/bpf-script-test-prologue.c
@@ -26,9 +26,11 @@ static void (*bpf_trace_printk)(const char *fmt, int 
fmt_size, ...) =

(void *) 6;

 SEC("func=null_lseek file->f_mode offset orig")
-int bpf_func__null_lseek(void *ctx, int err, unsigned long f_mode,
+int bpf_func__null_lseek(void *ctx, int err, unsigned long _f_mode,
 unsigned long offset, unsigned long orig)
 {
+   fmode_t f_mode = (fmode_t)_f_mode;
+
if (err)
return 0;
if (f_mode & FMODE_WRITE)

Thank you.



Re: change uprobe_events default ? Was: [PATCH] perf: Rename CONFIG_[UK]PROBE_EVENT to CONFIG_[UK]PROBE_EVENTS

2017-03-15 Thread Wangnan (F)



On 2017/3/16 9:06, Arnaldo Carvalho de Melo wrote:

Added more people to the CC list.

Em Wed, Mar 15, 2017 at 05:58:19PM -0700, Alexei Starovoitov escreveu:

On Thu, Feb 16, 2017 at 05:00:50PM +1100, Anton Blanchard wrote:

We have uses of CONFIG_UPROBE_EVENT and CONFIG_KPROBE_EVENT as
well as CONFIG_UPROBE_EVENTS and CONFIG_KPROBE_EVENTS. Consistently
use the plurals.
  

this rename made me notice that UPROBE_EVENTS still defaults to 'n'.
this is key feature that all distros enable, so having default 'n'
is kinda saying that it's not something that should be turned on
or used widely. which is obviously not the case.
imo it's time to change it to 'y'.
Thoughts?

Agreed, I also found it strange that it was disabled by default when I
recently did a 'make oldconfig' :-\

- Arnaldo


+1.

Some internal distros in my company now have started
discussing supporting user space probing. Turning
uprobes on for upstream kernel helps us saving a lot
of discussion.

Thank you.



Re: [PATCH net-next v1] bpf: Remove redundant ifdef

2017-02-14 Thread Wangnan (F)



On 2017/2/15 1:07, David Miller wrote:

From: "Wangnan (F)" 
Date: Mon, 13 Feb 2017 09:53:49 +0800



On 2017/2/12 3:37, Mickaël Salaün wrote:

Remove a useless ifdef __NR_bpf as requested by Wang Nan.

Inline one-line static functions as it was in the bpf_sys.h file.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: David S. Miller 
Cc: Wang Nan 
Link:
https://lkml.kernel.org/r/828ab1ff-4dcf-53ff-c97b-074adb895...@huawei.com
---
   tools/lib/bpf/bpf.c | 12 +++-
   1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 50e04cc5..2de9c386989a 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -42,21 +42,15 @@
   # endif
   #endif
   -static __u64 ptr_to_u64(const void *ptr)
+static inline __u64 ptr_to_u64(const void *ptr)
   {
return (__u64) (unsigned long) ptr;
   }
   -static int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
-  unsigned int size)
+static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
+ unsigned int size)
   {
-#ifdef __NR_bpf
return syscall(__NR_bpf, cmd, attr, size);
-#else
-   fprintf(stderr, "No bpf syscall, kernel headers too old?\n");
-   errno = ENOSYS;
-   return -1;
-#endif
   }
 int bpf_create_map(enum bpf_map_type map_type, int key_size,

Acked-by: Wang Nan 

However, it is better to merge this patch with commit
702498a1426bc95b6f49f9c5fba616110cbd3947.

I don't know where this commit ID is.  Since this patch is targetting
net-next I would expect a commit ID with not context to be in that
tree.  Please always specify where the commit ID you mention is.


It is "bpf: Remove bpf_sys.h from selftests" in net-next.



Futhermore, commits in net-next are permanent so it is not possible
afterwards to "merge this patch with commit X".


I understand. Maintainers sometime reset
his/her head to an early version and amend
the commit to make the history clean, but
clearly net-next never do this.

Thank you.



Re: [PATCH net-next v1] bpf: Remove redundant ifdef

2017-02-12 Thread Wangnan (F)



On 2017/2/12 3:37, Mickaël Salaün wrote:

Remove a useless ifdef __NR_bpf as requested by Wang Nan.

Inline one-line static functions as it was in the bpf_sys.h file.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: David S. Miller 
Cc: Wang Nan 
Link: https://lkml.kernel.org/r/828ab1ff-4dcf-53ff-c97b-074adb895...@huawei.com
---
  tools/lib/bpf/bpf.c | 12 +++-
  1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 50e04cc5..2de9c386989a 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -42,21 +42,15 @@
  # endif
  #endif
  
-static __u64 ptr_to_u64(const void *ptr)

+static inline __u64 ptr_to_u64(const void *ptr)
  {
return (__u64) (unsigned long) ptr;
  }
  
-static int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,

-  unsigned int size)
+static inline int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
+ unsigned int size)
  {
-#ifdef __NR_bpf
return syscall(__NR_bpf, cmd, attr, size);
-#else
-   fprintf(stderr, "No bpf syscall, kernel headers too old?\n");
-   errno = ENOSYS;
-   return -1;
-#endif
  }
  
  int bpf_create_map(enum bpf_map_type map_type, int key_size,


Acked-by: Wang Nan 

However, it is better to merge this patch with commit 
702498a1426bc95b6f49f9c5fba616110cbd3947.


Thank you.



Re: [PATCH v4 0/3] Miscellaneous fixes for BPF (perf tree)

2017-02-12 Thread Wangnan (F)



On 2017/2/9 4:27, Mickaël Salaün wrote:

This series brings some fixes and small improvements to the BPF samples.

This is intended for the perf tree and apply on 7a5980f9c006 ("tools lib bpf:
Add missing header to the library").

Changes since v3:
* remove applied patch 1/5
* remove patch 2/5 on bpf_load_program() as requested by Wang Nan

Changes since v2:
* add this cover letter

Changes since v1:
* exclude patches not intended for the perf tree

Regards,

Mickaël Salaün (3):
   samples/bpf: Ignore already processed ELF sections
   samples/bpf: Reset global variables
   samples/bpf: Add missing header

  samples/bpf/bpf_load.c | 7 +++
  samples/bpf/tracex5_kern.c | 1 +
  2 files changed, 8 insertions(+)


Looks good to me.

Thank you.



Re: [PATCH net-next v5 04/11] bpf: Use bpf_load_program() from the library

2017-02-09 Thread Wangnan (F)



On 2017/2/10 10:25, Wangnan (F) wrote:



On 2017/2/10 7:21, Mickaël Salaün wrote:

Replace bpf_prog_load() with bpf_load_program() calls.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Shuah Khan 
---
  tools/lib/bpf/bpf.c |  6 +++---
  tools/lib/bpf/bpf.h |  4 ++--
  tools/testing/selftests/bpf/Makefile|  4 +++-
  tools/testing/selftests/bpf/bpf_sys.h   | 21 -
  tools/testing/selftests/bpf/test_tag.c  |  6 --
  tools/testing/selftests/bpf/test_verifier.c |  8 +---
  6 files changed, 17 insertions(+), 32 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 3ddb58a36d3c..58ce252073fa 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -42,7 +42,7 @@
  # endif
  #endif
  -static __u64 ptr_to_u64(void *ptr)
+static __u64 ptr_to_u64(const void *ptr)
  {
  return (__u64) (unsigned long) ptr;
  }
@@ -69,8 +69,8 @@ int bpf_create_map(enum bpf_map_type map_type, int 
key_size,

  return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
  }
  -int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns,
- size_t insns_cnt, char *license,
+int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn 
*insns,

+ size_t insns_cnt, const char *license,
   __u32 kern_version, char *log_buf, size_t log_buf_sz)
  {
  int fd;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index a2f9853dd882..bc959a2de023 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -28,8 +28,8 @@ int bpf_create_map(enum bpf_map_type map_type, int 
key_size, int value_size,

/* Recommend log buffer size */
  #define BPF_LOG_BUF_SIZE 65536
-int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns,
- size_t insns_cnt, char *license,
+int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn 
*insns,

+ size_t insns_cnt, const char *license,
   __u32 kern_version, char *log_buf,
   size_t log_buf_sz);


For libbpf changes:


And for similar code in patch 5-8:

Acked-by Wang Nan 

Thank you.


Acked-by Wang Nan 

Thank you.







Re: [PATCH net-next v5 10/11] bpf: Remove bpf_sys.h from selftests

2017-02-09 Thread Wangnan (F)



On 2017/2/10 7:21, Mickaël Salaün wrote:

Add require dependency headers.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Shuah Khan 
---
  tools/lib/bpf/bpf.c |  6 ++
  tools/testing/selftests/bpf/bpf_sys.h   | 27 ---
  tools/testing/selftests/bpf/test_lpm_map.c  |  1 -
  tools/testing/selftests/bpf/test_lru_map.c  |  1 -
  tools/testing/selftests/bpf/test_maps.c |  1 -
  tools/testing/selftests/bpf/test_tag.c  |  3 +--
  tools/testing/selftests/bpf/test_verifier.c |  4 ++--
  7 files changed, 9 insertions(+), 34 deletions(-)
  delete mode 100644 tools/testing/selftests/bpf/bpf_sys.h

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index f8a2b7fa7741..50e04cc5 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -50,7 +50,13 @@ static __u64 ptr_to_u64(const void *ptr)
  static int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
   unsigned int size)
  {
+#ifdef __NR_bpf
return syscall(__NR_bpf, cmd, attr, size);
+#else
+   fprintf(stderr, "No bpf syscall, kernel headers too old?\n");
+   errno = ENOSYS;
+   return -1;
+#endif
  }
  


We don't need check __NR_bpf again. It has already
been checked at the header of this file:

#ifndef __NR_bpf
# if defined(__i386__)
#  define __NR_bpf 357
# elif defined(__x86_64__)
#  define __NR_bpf 321
# elif defined(__aarch64__)
#  define __NR_bpf 280
# else
#  error __NR_bpf not defined. libbpf does not support your arch.
# endif
#endif

Thank you.



Re: [PATCH net-next v5 04/11] bpf: Use bpf_load_program() from the library

2017-02-09 Thread Wangnan (F)



On 2017/2/10 7:21, Mickaël Salaün wrote:

Replace bpf_prog_load() with bpf_load_program() calls.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: Shuah Khan 
---
  tools/lib/bpf/bpf.c |  6 +++---
  tools/lib/bpf/bpf.h |  4 ++--
  tools/testing/selftests/bpf/Makefile|  4 +++-
  tools/testing/selftests/bpf/bpf_sys.h   | 21 -
  tools/testing/selftests/bpf/test_tag.c  |  6 --
  tools/testing/selftests/bpf/test_verifier.c |  8 +---
  6 files changed, 17 insertions(+), 32 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 3ddb58a36d3c..58ce252073fa 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -42,7 +42,7 @@
  # endif
  #endif
  
-static __u64 ptr_to_u64(void *ptr)

+static __u64 ptr_to_u64(const void *ptr)
  {
return (__u64) (unsigned long) ptr;
  }
@@ -69,8 +69,8 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size,
return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
  }
  
-int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns,

-size_t insns_cnt, char *license,
+int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns,
+size_t insns_cnt, const char *license,
 __u32 kern_version, char *log_buf, size_t log_buf_sz)
  {
int fd;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index a2f9853dd882..bc959a2de023 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -28,8 +28,8 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size, 
int value_size,
  
  /* Recommend log buffer size */

  #define BPF_LOG_BUF_SIZE 65536
-int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns,
-size_t insns_cnt, char *license,
+int bpf_load_program(enum bpf_prog_type type, const struct bpf_insn *insns,
+size_t insns_cnt, const char *license,
 __u32 kern_version, char *log_buf,
 size_t log_buf_sz);
  


For libbpf changes:

Acked-by Wang Nan 

Thank you.



Re: [PATCH v2 1/5] bpf: Add missing header to the library

2017-02-07 Thread Wangnan (F)

Please add me into the cc list of all of the 5 patches.

Thank you.

On 2017/2/7 4:40, Mickaël Salaün wrote:

Include stddef.h to define size_t.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Arnaldo Carvalho de Melo 
Cc: Daniel Borkmann 
Cc: Wang Nan 
---
  tools/lib/bpf/bpf.h | 1 +
  1 file changed, 1 insertion(+)

diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index a2f9853dd882..df6e186da788 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -22,6 +22,7 @@
  #define __BPF_BPF_H
  
  #include 

+#include 
  
  int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,

   int max_entries, __u32 map_flags);





Re: [PATCH v3 1/5] bpf: Add missing header to the library

2017-02-07 Thread Wangnan (F)



On 2017/2/8 4:56, Mickaël Salaün wrote:

Include stddef.h to define size_t.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Arnaldo Carvalho de Melo 
Cc: Daniel Borkmann 
Cc: Wang Nan 
---
  tools/lib/bpf/bpf.h | 1 +
  1 file changed, 1 insertion(+)

diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index a2f9853dd882..df6e186da788 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -22,6 +22,7 @@
  #define __BPF_BPF_H
  
  #include 

+#include 
  
  int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,

   int max_entries, __u32 map_flags);

Looks good to me.

Thank you.



Re: [PATCH v3 2/5] bpf: Simplify bpf_load_program() error handling in the library

2017-02-07 Thread Wangnan (F)



On 2017/2/8 4:56, Mickaël Salaün wrote:

Do not call a second time bpf(2) when a program load failed.


BPF_PROG_LOAD should success most of the time. Setting log_level to
0 by default and require log buffer when failure can make it faster
in normal case.

Thank you.


Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Arnaldo Carvalho de Melo 
Cc: Daniel Borkmann 
Cc: Wang Nan 
---
  tools/lib/bpf/bpf.c | 18 ++
  1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 3ddb58a36d3c..fda3f494f1cd 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -73,7 +73,6 @@ int bpf_load_program(enum bpf_prog_type type, struct bpf_insn 
*insns,
 size_t insns_cnt, char *license,
 __u32 kern_version, char *log_buf, size_t log_buf_sz)
  {
-   int fd;
union bpf_attr attr;
  
  	bzero(&attr, sizeof(attr));

@@ -81,20 +80,15 @@ int bpf_load_program(enum bpf_prog_type type, struct 
bpf_insn *insns,
attr.insn_cnt = (__u32)insns_cnt;
attr.insns = ptr_to_u64(insns);
attr.license = ptr_to_u64(license);
-   attr.log_buf = ptr_to_u64(NULL);
-   attr.log_size = 0;
-   attr.log_level = 0;
+   attr.log_buf = ptr_to_u64(log_buf);
+   attr.log_size = log_buf_sz;
attr.kern_version = kern_version;
  
-	fd = sys_bpf(BPF_PROG_LOAD, &attr, sizeof(attr));

-   if (fd >= 0 || !log_buf || !log_buf_sz)
-   return fd;
+   if (log_buf && log_buf_sz > 0) {
+   attr.log_level = 1;
+   log_buf[0] = 0;
+   }
  
-	/* Try again with log */

-   attr.log_buf = ptr_to_u64(log_buf);
-   attr.log_size = log_buf_sz;
-   attr.log_level = 1;
-   log_buf[0] = 0;
return sys_bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
  }
  





Re: [PATCHv2 perf/core 5/7] tools lib bpf: Add bpf_program__pin()

2017-01-24 Thread Wangnan (F)



On 2017/1/25 9:16, Joe Stringer wrote:

On 24 January 2017 at 17:06, Wangnan (F)  wrote:


On 2017/1/25 9:04, Wangnan (F) wrote:



On 2017/1/23 9:11, Joe Stringer wrote:

Add a new API to pin a BPF program to the filesystem. The user can
specify the path full path within a BPF filesystem to pin the program.
Programs with multiple instances are pinned as 'foo', 'foo_1', 'foo_2',
and so on.

Signed-off-by: Joe Stringer 
---
v2: Don't automount BPF filesystem
  Split program, map, object pinning into separate APIs and separate
  patches.
---


[SNIP]


+int bpf_program__pin(struct bpf_program *prog, const char *path)


In your next patch please let caller select one instance:

int bpf_program__pin(struct bpf_program *prog, int instance, const char
*path)
(please choose a better name)

Then your can wrap it with another function to pin all instances,
implement
naming schema (%s_%d) there.

Then implement naming schema in bpf_object__pin like:

%s/objectname_mapname
%s/objectname_progname_%d


Is it possible to use directory tree instead?

%s/object/mapname
%s/object/prog/instance

I don't think objects have names, so let's assume an object with two
program instances named foo, and one map named bar.

A call of bpf_object__pin(obj, "/sys/fs/bpf/myobj") would mount with
the following files and directories:
/sys/fs/bpf/myobj/foo/1
/sys/fs/bpf/myobj/foo/2
/sys/fs/bpf/myobj/bar

Alternatively, if you want to control exactly where you want the
progs/maps to be pinned, you can call eg
bpf_program__pin_instance(prog, "/sys/fs/bpf/wherever", 0) and that
instance will be mounted to /sys/fs/bpf/wherever, or alternatively
bpf_program__pin(prog, "/sys/fs/bpf/foo"), and you will end up with
/sys/fs/bpf/foo/{0,1}.

This looks pretty reasonable to me.


It looks good to me.

Thank you.



Re: [PATCHv2 perf/core 5/7] tools lib bpf: Add bpf_program__pin()

2017-01-24 Thread Wangnan (F)



On 2017/1/25 9:04, Wangnan (F) wrote:



On 2017/1/23 9:11, Joe Stringer wrote:

Add a new API to pin a BPF program to the filesystem. The user can
specify the path full path within a BPF filesystem to pin the program.
Programs with multiple instances are pinned as 'foo', 'foo_1', 'foo_2',
and so on.

Signed-off-by: Joe Stringer 
---
v2: Don't automount BPF filesystem
 Split program, map, object pinning into separate APIs and separate
 patches.
---


[SNIP]


+int bpf_program__pin(struct bpf_program *prog, const char *path)


In your next patch please let caller select one instance:

int bpf_program__pin(struct bpf_program *prog, int instance, const 
char *path)

(please choose a better name)

Then your can wrap it with another function to pin all instances, 
implement

naming schema (%s_%d) there.

Then implement naming schema in bpf_object__pin like:

%s/objectname_mapname
%s/objectname_progname_%d



Is it possible to use directory tree instead?

%s/object/mapname
%s/object/prog/instance

Thank you.



Re: [PATCHv2 perf/core 5/7] tools lib bpf: Add bpf_program__pin()

2017-01-24 Thread Wangnan (F)



On 2017/1/23 9:11, Joe Stringer wrote:

Add a new API to pin a BPF program to the filesystem. The user can
specify the path full path within a BPF filesystem to pin the program.
Programs with multiple instances are pinned as 'foo', 'foo_1', 'foo_2',
and so on.

Signed-off-by: Joe Stringer 
---
v2: Don't automount BPF filesystem
 Split program, map, object pinning into separate APIs and separate
 patches.
---
  tools/lib/bpf/libbpf.c | 76 ++
  tools/lib/bpf/libbpf.h |  1 +
  2 files changed, 77 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index e6cd62b1264b..eea5c74808f7 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -4,6 +4,7 @@
   * Copyright (C) 2013-2015 Alexei Starovoitov 
   * Copyright (C) 2015 Wang Nan 
   * Copyright (C) 2015 Huawei Inc.
+ * Copyright (C) 2017 Nicira, Inc.
   *
   * This program is free software; you can redistribute it and/or
   * modify it under the terms of the GNU Lesser General Public
@@ -22,6 +23,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -31,7 +33,10 @@
  #include 
  #include 
  #include 
+#include 
  #include 
+#include 
+#include 
  #include 
  #include 
  
@@ -1237,6 +1242,77 @@ int bpf_object__load(struct bpf_object *obj)

return err;
  }
  
+static int check_path(const char *path)

+{
+   struct statfs st_fs;
+   char *dname, *dir;
+   int err = 0;
+
+   if (path == NULL)
+   return -EINVAL;
+
+   dname = strdup(path);
+   dir = dirname(dname);
+   if (statfs(dir, &st_fs)) {
+   pr_warning("failed to statfs %s: %s\n", dir, strerror(errno));
+   err = -errno;
+   }
+   free(dname);
+
+   if (!err && st_fs.f_type != BPF_FS_MAGIC) {
+   pr_warning("specified path %s is not on BPF FS\n", path);
+   err = -EINVAL;
+   }
+
+   return err;
+}
+
+int bpf_program__pin(struct bpf_program *prog, const char *path)


In your next patch please let caller select one instance:

int bpf_program__pin(struct bpf_program *prog, int instance, const char 
*path)

(please choose a better name)

Then your can wrap it with another function to pin all instances, implement
naming schema (%s_%d) there.

Then implement naming schema in bpf_object__pin like:

%s/objectname_mapname
%s/objectname_progname_%d

and make a bpf_{map,program}__pin the raw pinning functions with no 
naming schema.


Thank you.



Re: [PATCH perf/core 0/6] Libbpf improvements

2017-01-19 Thread Wangnan (F)



On 2017/1/19 7:57, Joe Stringer wrote:

Patch 1 fixes an issue when using drastically different BPF map definitions
inside ELFs from a client using libbpf, vs the map definition libbpf uses.

Patch 2 is a trivial typo fix.

Patches 3-5 add some simple, useful helper functions for setting prog type
and retrieving libbpf errors without depending on kernel headers from
userspace programs.

Patch 6 adds a new pinning functionality for an entire object. Calling
bpf_object__pin(obj, subpath) will mount all programs from the BPF object
to $bpf_fs_path/$subpath/progs/$progname, and all maps from the BPF object
to $bpf_fs_path/$subpath/maps/$mapname. The first program with a particular
name will be mounted with its progname and the suffix "_0"; subsequent
programs with the same will have "_1", and so on; duplicate maps with the
same name are disallowed.

Joe Stringer (6):
   tools lib bpf: Fix map offsets in relocation
   tools lib bpf: Fix grammar in map_idx warning
   tools lib bpf: Define prog_type fns with macro
   tools lib bpf: Add set/is helpers for all prog types
   tools lib bpf: Add libbpf_get_error()
   tools lib bpf: Add bpf_object__pin()

  tools/lib/bpf/libbpf.c  | 197 +---
  tools/lib/bpf/libbpf.h  |  15 +++-
  tools/perf/tests/llvm.c |   2 +-
  3 files changed, 185 insertions(+), 29 deletions(-)


For patch 3,4 and 5:

Acked-by: Wang Nan 

See my comment to commits 1, 2 and 6.

Thank you.



Re: [PATCH perf/core 6/6] tools lib bpf: Add bpf_object__pin()

2017-01-19 Thread Wangnan (F)



On 2017/1/19 7:57, Joe Stringer wrote:

Add a new API to pin a BPF object to the filesystem. The user can
specify a subdirectory under the BPF filesystem to pin these programs.

For example, with the subdirectory 'foo', programs and maps are pinned:
/sys/fs/bpf/foo/progs/PROG_NAME
/sys/fs/bpf/foo/maps/MAP_NAME

If the user has specified an alternative BPF filesystem mountpoint via
/proc/mounts, that will be read and used instead.

Signed-off-by: Joe Stringer 
---
  tools/lib/bpf/libbpf.c | 136 +
  tools/lib/bpf/libbpf.h |   1 +
  2 files changed, 137 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 6b651c19870d..181dca0bdacb 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -4,6 +4,7 @@
   * Copyright (C) 2013-2015 Alexei Starovoitov 
   * Copyright (C) 2015 Wang Nan 
   * Copyright (C) 2015 Huawei Inc.
+ * Copyright (C) 2016 Nicira, Inc.
   *
   * This program is free software; you can redistribute it and/or
   * modify it under the terms of the GNU Lesser General Public
@@ -31,7 +32,12 @@
  #include 
  #include 
  #include 
+#include 
  #include 
+#include 
+#include 
+#include 
+#include 
  #include 
  #include 
  
@@ -1231,6 +1237,136 @@ int bpf_object__load(struct bpf_object *obj)

return err;
  }
  
+#define stringize(x) #x

+
+static int mount_bpf(char *path)


We can use tools/lib/api/fs/fs.c to help us mounting filesystems.
Try not reinventing it.


+{
+   const char *mounts_fmt = "%*s %"stringize(PATH_MAX)"s 
%#"stringize(NAME_MAX)"s %*s\n";
+   struct statfs st_fs;
+   char type[NAME_MAX];
+   int err = 0;
+   FILE *fp;
+
+   /* Populate 'path'. */
+   fp = fopen("/proc/mounts", "r");
+   if (fp) {
+   while (fscanf(fp, mounts_fmt, path, type) == 2) {
+   if (!strcmp(type, "bpf"))
+   break;
+   }
+   if (fclose(fp)) {
+   err = -errno;
+   pr_warning("failed to close /proc/mounts: %s\n",
+  strerror(errno));
+   }
+   if (strcmp(type, "bpf")) {
+   err = -errno;
+   pr_debug("failed to find bpf mount\n");
+   }
+   } else {
+   err = -errno;
+   pr_warning("cannot open /proc/mounts: %s\n", strerror(errno));
+   }
+   if (err) {
+   pr_debug("using /sys/fs/bpf for BPF filesystem mountpoint\n");
+   strcpy(path, "/sys/fs/bpf");
+   }
+
+   if (!statfs(path, &st_fs) && st_fs.f_type == BPF_FS_MAGIC)
+   return 0;
+
+   if (mount("bpf", path, "bpf", 0, NULL)) {
+   pr_warning("failed to mount bpf: %s\n", strerror(errno));
+   return -errno;
+   }
+
+   return 0;
+}
+
+static int make_dirs(const char *path, const char *subdir, char *buf,
+size_t len)
+{
+   snprintf(buf, len, "%s/%s/", path, subdir);
+   if (mkdir(buf, 0700) && errno != EEXIST) {
+   pr_warning("failed to mkdir %s: %s\n", subdir, strerror(errno));
+   return -errno;
+   }
+
+   snprintf(buf, len, "%s/%s/maps/", path, subdir);
+   if (mkdir(buf, 0700) && errno != EEXIST) {
+   pr_warning("failed to mkdir map: %s\n", strerror(errno));
+   return -errno;
+   }
+
+   snprintf(buf, len, "%s/%s/progs/", path, subdir);
+   if (mkdir(buf, 0700) && errno != EEXIST) {
+   pr_warning("failed to mkdir prog: %s\n", strerror(errno));
+   return -errno;
+   }
+
+   return 0;
+}
+
+int bpf_object__pin(struct bpf_object *obj, const char *subdir)
+{


We must pin a whole object? I suggest we provide
bpf_program__pin and bpf_map__pin, and build bpf_object__pin
on top of them.

Thank you.


+   const size_t len = 255;
+   char path[PATH_MAX];
+   char buf[len];
+   size_t i, j;
+   int err = 0;
+
+   if (!obj)
+   return -ENOENT;
+
+   if (!obj->loaded) {
+   pr_warning("object not yet loaded; load it first\n");
+   return -ENOENT;
+   }
+
+   err = mount_bpf(path);
+   if (err)
+   return err;
+
+   err = make_dirs(path, subdir, buf, len);
+   if (err)
+   return err;
+
+   for (i = 0; i < obj->nr_maps; i++) {
+   struct bpf_map *map = &obj->maps[i];
+
+   snprintf(buf, len, "%s/%s/maps/%s", path, subdir,
+bpf_map__name(map));
+   if (bpf_obj_pin(map->fd, buf)) {
+   err = -errno;
+   pr_warning("failed to pin map: %s\n", strerror(errno));
+   goto out;
+   }
+   pr_debug("Loaded map '%s' at '%s'\n", bpf_map__name(map), buf);
+   }
+
+   if (obj->programs && obj->nr_programs) {
+

Re: [PATCH perf/core 2/6] tools lib bpf: Fix grammar in map_idx warning

2017-01-19 Thread Wangnan (F)



On 2017/1/19 7:57, Joe Stringer wrote:

"Larger" is the comparative adjective form of "large".

Signed-off-by: Joe Stringer 
---
  tools/lib/bpf/libbpf.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 350ee4c59f85..653b1b368deb 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -834,7 +834,7 @@ bpf_program__collect_reloc(struct bpf_program *prog,
  
  		map_idx = sym.st_value / prog->obj->map_len;

if (map_idx >= nr_maps) {
-   pr_warning("bpf relocation: map_idx %d large than %d\n",
+   pr_warning("bpf relocation: map_idx %d larger than 
%d\n",


With my change to 1/6, this message become meaningless. We can
change it to:

bpf relocation: failed to relocate instruction %u

Thank you.


   (int)map_idx, (int)nr_maps - 1);
return -LIBBPF_ERRNO__RELOC;
}





Re: [PATCH perf/core 1/6] tools lib bpf: Fix map offsets in relocation

2017-01-18 Thread Wangnan (F)



On 2017/1/19 7:57, Joe Stringer wrote:

Commit 4708bbda5cb2 ("tools lib bpf: Fix maps resolution") attempted to
fix map resolution by identifying the number of symbols that point to
maps, and using this number to resolve each of the maps.

However, during relocation the original definition of the map size was
still in use.



Oops... Didn't realize that.



  For up to two maps, the calculation was correct if there
was a small difference in size between the map definition in libbpf and
the one that the client library uses. However if the difference was
large, particularly if more than two maps were used in the BPF program,
the relocation would fail.

For example, when using a map definition with size 28, with three maps,
map relocation would count
 (sym_offset / sizeof(struct bpf_map_def) => map_idx)
 (0 / 16 => 0), ie map_idx = 0
 (28 / 16 => 1), ie map_idx = 1
 (56 / 16 => 3), ie map_idx = 3

So, libbpf reports:
 libbpf: bpf relocation: map_idx 3 large than 2


Understand.


Fix map relocation by tracking the size of the map definition during
map counting, then reuse this instead of the size of the libbpf map
structure.


Then we add a restriction that map size is fixed
in one object, but we don't have to. During relocating,
we can get exact map positioning information from
sym.st_value. I think finding a map with exact offset
from the map array would be better.

I'll write a patch to replace this one.

Thank you.


  With this patch applied, libbpf will accept such ELFs.

Fixes: 4708bbda5cb2 ("tools lib bpf: Fix maps resolution")
Signed-off-by: Joe Stringer 
---
  tools/lib/bpf/libbpf.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 84e6b35da4bd..350ee4c59f85 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -201,6 +201,7 @@ struct bpf_object {
size_t nr_programs;
struct bpf_map *maps;
size_t nr_maps;
+   size_t map_len;
  
  	bool loaded;
  
@@ -379,6 +380,7 @@ static struct bpf_object *bpf_object__new(const char *path,

obj->efile.maps_shndx = -1;
  
  	obj->loaded = false;

+   obj->map_len = sizeof(struct bpf_map_def);
  
  	INIT_LIST_HEAD(&obj->list);

list_add(&obj->list, &bpf_objects_list);
@@ -602,6 +604,7 @@ bpf_object__init_maps(struct bpf_object *obj)
return -ENOMEM;
}
obj->nr_maps = nr_maps;
+   obj->map_len = data->d_size / nr_maps;
  
  	/*

 * fill all fd with -1 so won't close incorrect
@@ -829,7 +832,7 @@ bpf_program__collect_reloc(struct bpf_program *prog,
return -LIBBPF_ERRNO__RELOC;
}
  
-		map_idx = sym.st_value / sizeof(struct bpf_map_def);

+   map_idx = sym.st_value / prog->obj->map_len;
if (map_idx >= nr_maps) {
pr_warning("bpf relocation: map_idx %d large than %d\n",
   (int)map_idx, (int)nr_maps - 1);





Re: [PATCH 0/7] perf tools: Add switch-output size and time threshold options

2017-01-03 Thread Wangnan (F)



On 2017/1/3 16:19, Jiri Olsa wrote:

hi,
adding a way to configure switch data output
for size and time, like:

   $ sudo perf record -e 'sched:*' --switch-output=10M -avg
   callchain: type FP
   switch-output with 10M size threshold
   mmap size 528384B
   [ perf record: dump data: Woken up 37 times ]
   [ perf record: Dump perf.data.2017010309135512 ]
   [ perf record: dump data: Woken up 39 times ]
   [ perf record: Dump perf.data.2017010309135771 ]
   [ perf record: dump data: Woken up 38 times ]
   [ perf record: Dump perf.data.2017010309140005 ]
   ^C[ perf record: Woken up 16 times to write data ]
   [ perf record: Dump perf.data.2017010309140111 ]
   [ perf record: Captured and wrote 4.748 MB perf.data. ]
   ...

the default for switch-output option stays
and does the SIGUSR2 output switch

Also available in:
   git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
   perf/fixes

thanks,
jirka


Cc: Wang Nan 
---


Good functions, and thank you for fixing documentations.

You didn't cc patch 1 and 2 to me.

Tested-by: Wang Nan 



Jiri Olsa (7):
   tools lib subcmd: Add OPT_STRING_OPTARG_SET option
   perf record: Make __record_options static
   perf record: Fix --switch-output documentation and comment
   perf record: Add struct switch_output
   perf record: Change switch-output option to take optional argument
   perf record: Add switch-output size option argument
   perf record: Add switch-output time option argument

  tools/lib/subcmd/parse-options.c |   3 ++
  tools/lib/subcmd/parse-options.h |   5 
  tools/perf/Documentation/perf-record.txt |  13 +++--
  tools/perf/builtin-record.c  | 134 
--
  4 files changed, 139 insertions(+), 16 deletions(-)





coresight: Problem caused by resetting enable_sink

2016-12-26 Thread Wangnan (F)

Hi Mathieu,

I meet problems caused by your commit d52c9750f150 ('coresight:
reset "enable_sink" flag when need be'). Not only the one I
posted in the previous patch.

My use case is a simple 'perf record -e cs_etm// ls'. It works
properly before this commit, and failed when allocating aux buffer
after your commit. I can't fully understand your code, but the
problem I meet seems caused by inappropriately reseting sink.

My device is connected like this (use two etfs):

Core0--+
Core1--+-- funnel0 --> etf0
Core2--|
Core3--+

Core0--+
Core1--+-- funnel1 --> etf1
Core2--|
Core3--+

Before running perf, two etfs are activated using sysfs
enable_sink interface.

During etm_setup_aux, coresight_get_enabled_sink() finds
etf0 for core0, and automatically deactivates it.

For core1, coresight_get_enabled_sink() returns etf1.
However, etf1 is not on the link of core1, so following
coresight_build_path() fails.

I guess your commit is based on the assumption that all
sinks are in the patch for each cores. Like this:


Core0--+
Core1--+-- funnel0 --> etf0 ++
Core2--| |+--- etr
Core3--+ ||
 + replicator +
Core0--+ ||
Core1--+-- funnel1 --> etf1 +++--- etb
Core2--|
Core3--+

But it is not true, at least for some hisilicon board.

I have to revert your patch to make CoreSight on my board
work. Please reconsider this patch, or please give some
suggestion if you think I misunderstood your patch.

Thank you.




Re: [PATCHv3 perf/core 5/7] samples/bpf: Switch over to libbpf

2016-12-08 Thread Wangnan (F)



On 2016/12/9 13:04, Wangnan (F) wrote:



On 2016/12/9 10:46, Joe Stringer wrote:

[SNIP]


  diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
index 62d89d50fcbd..616bd55f3be8 100644
--- a/tools/lib/bpf/Makefile
+++ b/tools/lib/bpf/Makefile
@@ -149,6 +149,8 @@ CMD_TARGETS = $(LIB_FILE)
TARGETS = $(CMD_TARGETS)
  +libbpf: all
+


Why we need this? I tested this patch without it and it seems to work, 
and

this line causes an extra error:
 $ pwd
 /home/wn/kernel/tools/lib/bpf
 $ make libbpf
 ...
 gcc -g -Wall -DHAVE_LIBELF_MMAP_SUPPORT -DHAVE_ELF_GETPHDRNUM_SUPPORT 
-Wbad-function-cast -Wdeclaration-after-statement -Wformat-security 
-Wformat-y2k -Winit-self -Wmissing-declarations -Wmissing-prototypes 
-Wnested-externs -Wno-system-headers -Wold-style-definition -Wpacked 
-Wredundant-decls -Wshadow -Wstrict-aliasing=3 -Wstrict-prototypes 
-Wswitch-default -Wswitch-enum -Wundef -Wwrite-strings -Wformat 
-Werror -Wall -fPIC -I. -I/home/wn/kernel-hydrogen/tools/include 
-I/home/wn/kernel-hydrogen/tools/arch/x86/include/uapi 
-I/home/wn/kernel-hydrogen/tools/include/uapilibbpf.c all   -o libbpf

 gcc: error: all: No such file or directory
 make: *** [libbpf] Error 1

Thank you.


It is not 'caused' by your patch. 'make libbpf' fails without
your change because it tries to build an executable from
libbpf.c, but main() is missing.

I think libbpf should never be used as a make target. Your
new dependency looks strange.

Thank you.



Re: [PATCHv3 perf/core 5/7] samples/bpf: Switch over to libbpf

2016-12-08 Thread Wangnan (F)



On 2016/12/9 10:46, Joe Stringer wrote:

Now that libbpf under tools/lib/bpf/* is synced with the version from
samples/bpf, we can get rid most of the libbpf library here.

Signed-off-by: Joe Stringer 
---
v3: First post.
---
  samples/bpf/Makefile   |  60 +-
  samples/bpf/README.rst |   4 +-
  samples/bpf/libbpf.c   | 111 -
  samples/bpf/libbpf.h   |  19 +
  tools/lib/bpf/Makefile |   2 +
  5 files changed, 38 insertions(+), 158 deletions(-)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 72c58675973e..c8f7ed37b2de 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -29,35 +29,38 @@ hostprogs-y += trace_event
  hostprogs-y += sampleip
  hostprogs-y += tc_l2_redirect
  
-test_verifier-objs := test_verifier.o libbpf.o

-test_maps-objs := test_maps.o libbpf.o
-sock_example-objs := sock_example.o libbpf.o
-fds_example-objs := bpf_load.o libbpf.o fds_example.o
-sockex1-objs := bpf_load.o libbpf.o sockex1_user.o
-sockex2-objs := bpf_load.o libbpf.o sockex2_user.o
-sockex3-objs := bpf_load.o libbpf.o sockex3_user.o
-tracex1-objs := bpf_load.o libbpf.o tracex1_user.o
-tracex2-objs := bpf_load.o libbpf.o tracex2_user.o
-tracex3-objs := bpf_load.o libbpf.o tracex3_user.o
-tracex4-objs := bpf_load.o libbpf.o tracex4_user.o
-tracex5-objs := bpf_load.o libbpf.o tracex5_user.o
-tracex6-objs := bpf_load.o libbpf.o tracex6_user.o
-test_probe_write_user-objs := bpf_load.o libbpf.o test_probe_write_user_user.o
-trace_output-objs := bpf_load.o libbpf.o trace_output_user.o
-lathist-objs := bpf_load.o libbpf.o lathist_user.o
-offwaketime-objs := bpf_load.o libbpf.o offwaketime_user.o
-spintest-objs := bpf_load.o libbpf.o spintest_user.o
-map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
-test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
-test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
-xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
+# Libbpf dependencies
+LIBBPF := libbpf.o ../../tools/lib/bpf/bpf.o
+
+test_verifier-objs := test_verifier.o $(LIBBPF)
+test_maps-objs := test_maps.o $(LIBBPF)
+sock_example-objs := sock_example.o $(LIBBPF)
+fds_example-objs := bpf_load.o $(LIBBPF) fds_example.o
+sockex1-objs := bpf_load.o $(LIBBPF) sockex1_user.o
+sockex2-objs := bpf_load.o $(LIBBPF) sockex2_user.o
+sockex3-objs := bpf_load.o $(LIBBPF) sockex3_user.o
+tracex1-objs := bpf_load.o $(LIBBPF) tracex1_user.o
+tracex2-objs := bpf_load.o $(LIBBPF) tracex2_user.o
+tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o
+tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o
+tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o
+tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o
+test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o
+trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o
+lathist-objs := bpf_load.o $(LIBBPF) lathist_user.o
+offwaketime-objs := bpf_load.o $(LIBBPF) offwaketime_user.o
+spintest-objs := bpf_load.o $(LIBBPF) spintest_user.o
+map_perf_test-objs := bpf_load.o $(LIBBPF) map_perf_test_user.o
+test_overhead-objs := bpf_load.o $(LIBBPF) test_overhead_user.o
+test_cgrp2_array_pin-objs := $(LIBBPF) test_cgrp2_array_pin.o
+xdp1-objs := bpf_load.o $(LIBBPF) xdp1_user.o
  # reuse xdp1 source intentionally
-xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
-test_current_task_under_cgroup-objs := bpf_load.o libbpf.o \
+xdp2-objs := bpf_load.o $(LIBBPF) xdp1_user.o
+test_current_task_under_cgroup-objs := bpf_load.o $(LIBBPF) \
   test_current_task_under_cgroup_user.o
-trace_event-objs := bpf_load.o libbpf.o trace_event_user.o
-sampleip-objs := bpf_load.o libbpf.o sampleip_user.o
-tc_l2_redirect-objs := bpf_load.o libbpf.o tc_l2_redirect_user.o
+trace_event-objs := bpf_load.o $(LIBBPF) trace_event_user.o
+sampleip-objs := bpf_load.o $(LIBBPF) sampleip_user.o
+tc_l2_redirect-objs := bpf_load.o $(LIBBPF) tc_l2_redirect_user.o
  
  # Tell kbuild to always build the programs

  always := $(hostprogs-y)
@@ -89,7 +92,7 @@ always += test_current_task_under_cgroup_kern.o
  always += trace_event_kern.o
  always += sampleip_kern.o
  
-HOSTCFLAGS += -I$(objtree)/usr/include

+HOSTCFLAGS += -I$(objtree)/usr/include -I$(objtree)/tools/lib/
  


Should use srctree like this:

+HOSTCFLAGS += -I$(objtree)/usr/include -I$(srctree)/tools/lib/

Or you will see following failure when doing off-tree build:

 $ mkdir buildkernel
 $ cd buildkernel
 $ make -C ../ O=`pwd` menuconfig
 $ make -j64
 $ make samples/bpf/
  CHK include/config/kernel.release
  Using .. as source for kernel
  GEN ./Makefile
  CHK include/generated/uapi/linux/version.h
  CHK include/generated/utsrelease.h
  CHK include/generated/bounds.h
  CHK include/generated/timeconst.h
  CHK include/generated/asm-offsets.h
  CALL../scripts/checksyscalls.sh
  HOSTCC  samples/bpf/test_verifier.o
 In file included from ../samples/bpf/test_verifier.c:20:0:
 .

Re: [PATCHv3 perf/core 3/7] tools lib bpf: Add flags to bpf_create_map()

2016-12-08 Thread Wangnan (F)



On 2016/12/9 10:46, Joe Stringer wrote:

The map_flags argument to bpf_create_map() was previously not exposed.
By exposing it, users can access flags such as whether or not to
preallocate the map.

Signed-off-by: Joe Stringer 


Please mention commit 6c90598174322b029e40dd84a4eb01f56afe in
commit message:

Commit 6c905981743 ("bpf: pre-allocate hash map elements") introduces
map_flags to bpf_attr for BPF_MAP_CREATE command. Expose this new
parameter in libbpf.

Acked-by: Wang Nan 


---
v3: Split from "tools lib bpf: Sync with samples/bpf/libbpf".
---
  tools/lib/bpf/bpf.c| 3 ++-
  tools/lib/bpf/bpf.h| 2 +-
  tools/lib/bpf/libbpf.c | 3 ++-
  3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 89e8e8e5b60e..d0afb26c2e0f 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -54,7 +54,7 @@ static int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
  }
  
  int bpf_create_map(enum bpf_map_type map_type, int key_size,

-  int value_size, int max_entries)
+  int value_size, int max_entries, __u32 map_flags)
  {
union bpf_attr attr;
  
@@ -64,6 +64,7 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size,

attr.key_size = key_size;
attr.value_size = value_size;
attr.max_entries = max_entries;
+   attr.map_flags = map_flags;
  
  	return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr));

  }
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 61130170a6ad..7fcdce16fd62 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -24,7 +24,7 @@
  #include 
  
  int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,

-  int max_entries);
+  int max_entries, __u32 map_flags);
  
  /* Recommend log buffer size */

  #define BPF_LOG_BUF_SIZE 65536
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2e974593f3e8..84e6b35da4bd 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -854,7 +854,8 @@ bpf_object__create_maps(struct bpf_object *obj)
*pfd = bpf_create_map(def->type,
  def->key_size,
  def->value_size,
- def->max_entries);
+ def->max_entries,
+ 0);
if (*pfd < 0) {
size_t j;
int err = *pfd;





Re: [PATCHv3 perf/core 2/7] tools lib bpf: use __u32 from linux/types.h

2016-12-08 Thread Wangnan (F)



On 2016/12/9 10:46, Joe Stringer wrote:

Fixes the following issue when building without access to 'u32' type:

./tools/lib/bpf/bpf.h:27:23: error: unknown type name ‘u32’

Signed-off-by: Joe Stringer 
---
v3: Split from "tools lib bpf: Sync with samples/bpf/libbpf"
---
  tools/lib/bpf/bpf.c | 4 ++--
  tools/lib/bpf/bpf.h | 4 ++--
  2 files changed, 4 insertions(+), 4 deletions(-)


Acked-by: Wang Nan 


diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 8143536b462a..89e8e8e5b60e 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -70,7 +70,7 @@ int bpf_create_map(enum bpf_map_type map_type, int key_size,
  
  int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns,

 size_t insns_cnt, char *license,
-u32 kern_version, char *log_buf, size_t log_buf_sz)
+__u32 kern_version, char *log_buf, size_t log_buf_sz)
  {
int fd;
union bpf_attr attr;
@@ -98,7 +98,7 @@ int bpf_load_program(enum bpf_prog_type type, struct bpf_insn 
*insns,
  }
  
  int bpf_map_update_elem(int fd, void *key, void *value,

-   u64 flags)
+   __u64 flags)
  {
union bpf_attr attr;
  
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h

index 253c3dbb06b4..61130170a6ad 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -30,11 +30,11 @@ int bpf_create_map(enum bpf_map_type map_type, int 
key_size, int value_size,
  #define BPF_LOG_BUF_SIZE 65536
  int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns,
 size_t insns_cnt, char *license,
-u32 kern_version, char *log_buf,
+__u32 kern_version, char *log_buf,
 size_t log_buf_sz);
  
  int bpf_map_update_elem(int fd, void *key, void *value,

-   u64 flags);
+   __u64 flags);
  
  int bpf_map_lookup_elem(int fd, void *key, void *value);

  int bpf_map_delete_elem(int fd, void *key);





Re: [PATCH v4 01/18] perf build: Check LLVM version in feature check

2016-12-05 Thread Wangnan (F)



On 2016/12/6 15:13, Wang Nan wrote:

Cancel builtin llvm and clang support when LLVM version is
less than 3.9.0: following commits uses newer API.

Since Clang/LLVM's API is not guaranteed to be stable,
add a test-llvm-version.cpp feature checker, issue warning if
LLVM found in compiling environment is not tested yet.

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Arnaldo Carvalho de Melo 
Cc: He Kuang 
Cc: Jiri Olsa 
Cc: Joe Stringer 
Cc: Zefan Li 
Cc: pi3or...@163.com
---
  tools/build/feature/Makefile  |  8 ++--
  tools/build/feature/test-llvm-version.cpp | 12 
  tools/build/feature/test-llvm.cpp |  5 +
  tools/perf/Makefile.config|  8 ++--
  4 files changed, 29 insertions(+), 4 deletions(-)
  create mode 100644 tools/build/feature/test-llvm-version.cpp


[SNIP]


diff --git a/tools/build/feature/test-llvm-version.cpp 
b/tools/build/feature/test-llvm-version.cpp
new file mode 100644
index 000..e86b642
--- /dev/null
+++ b/tools/build/feature/test-llvm-version.cpp
@@ -0,0 +1,12 @@
+#include 
+#include "llvm/Config/llvm-config.h"
+
+#define NUM_VERSION (((LLVM_VERSION_MAJOR) << 16) + (LLVM_VERSION_MINOR << 8) 
+ LLVM_VERSION_PATCH)
+#define pass int main() {printf("%x\n", NUM_VERSION); return 0;}
+
+#if NUM_VERSION >= 0x030900
+pass
+#else
+# error This LLVM is not tested yet.
+#endif
+

Sorry for this blank line. Will resend.



Re: [PATCH v3 10/30] perf clang: Add builtin clang support ant test case

2016-12-05 Thread Wangnan (F)



On 2016/12/6 5:48, Arnaldo Carvalho de Melo wrote:

Em Mon, Dec 05, 2016 at 07:02:48PM -0200, Arnaldo Carvalho de Melo escreveu:

Em Mon, Dec 05, 2016 at 08:51:01AM -0800, Alexei Starovoitov escreveu:

yeah. it's kinda high. I'm guessing rpm llvm libs are in debug mode.
Try llvm-config --build-mode --assertion-mode
it should be Release OFF

Probably this was with 3.9 and built from git, quite a while ago, now I
removed it from /usr/local/ and installed what is in f25, but I fear it
will be insufficient, does 3.8 cuts it for what we're testing? Humm, it
looks like it will:

[root@jouet ~]# llc --version
LLVM (http://llvm.org/):
   LLVM version 3.8.0

But I'm now running the container based tests to send a pull req, will
check later, after that.

Not really, Wang, we need to update that feature detection test to state what
is the minimum required LLVM/clang version, one that has those functions,
which, unfortunately, isn't the one in the latest fedora, fedora 25:


I'll set the minimum required LLVM version to 3.9, and report
warning when LLVM is too old. However, since LLVM interface is
keep changing, finally we will have problem if we want to support
2 or 3 different clang/LLVM. We should keep moving minimum
requirement LLVM version if we don't want to see '#ifdef's
spread in our code.

Thank you.



Re: [PATCH v3 10/30] perf clang: Add builtin clang support ant test case

2016-12-04 Thread Wangnan (F)



On 2016/12/5 10:36, Wangnan (F) wrote:



On 2016/12/2 23:44, Arnaldo Carvalho de Melo wrote:

Em Sat, Nov 26, 2016 at 07:03:34AM +, Wang Nan escreveu:

Add basic clang support in clang.cpp and test__clang() testcase. The
first testcase checks if builtin clang is able to generate LLVM IR.

tests/clang.c is a proxy. Real testcase resides in
utils/c++/clang-test.cpp in c++ and exports C interface to perf test
subsystem.

Test result:

$ perf test -v clang
51: Test builtin clang support   :
51.1: Test builtin clang compile C source to IR  :
--- start ---
test child forked, pid 13215
test child finished with 0
 end 
Test builtin clang support subtest 0: Ok

While testing this I noticed that the perf binary got huge, can't this
be done in some other way, i.e. using dynamic library?


I intentionally use statically linking because it is good for
smartphone: we can simply 'adb push' a statically linked perf
to Android.

The resulting ELF executable would be even larger if LLVM is
built with default setting.

In my setting the resuling 'perf' is less than 60MB:

 $ ls -s  ~/perf -h
 58M /home/wn/perf

 $ size  ~/perf
   text   databssdechexfilename
 569312732950808241086328399071350198b9 /home/wn/perf

It is reasonable for me.

I think using dynamic clang and llvm libraries is possible but I
never tried it before. It depend on LLVM compiling. I think if
distro provides shared libraries then perf can utilize them
automatically. Let me check it today.



Good news: it works.

To enable llvm and clang dynamic libraries, we need to build llvm and 
clang with

cmake option -DBUILD_SHARED_LIBS=ON. Then apply a trivial patch to perf:

diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
index dfb20dd..e3054f3 100644
--- a/tools/perf/Makefile.perf
+++ b/tools/perf/Makefile.perf
@@ -346,7 +346,7 @@ LIBS = -Wl,--whole-archive $(PERFLIBS) 
-Wl,--no-whole-archive -Wl,--start-group


 ifeq ($(USE_CLANG), 1)
   CLANGLIBS_LIST = AST Basic CodeGen Driver Frontend Lex Tooling Edit 
Sema Analysis Parse Serialization
-  LIBCLANG = $(foreach l,$(CLANGLIBS_LIST),$(wildcard $(shell 
$(LLVM_CONFIG) --libdir)/libclang$(l).a))
+  LIBCLANG = $(foreach l,$(CLANGLIBS_LIST),$(wildcard $(shell 
$(LLVM_CONFIG) --libdir)/libclang$(l).so))

   LIBS += -Wl,--start-group $(LIBCLANG) -Wl,--end-group
 endif

(replace '.a' to '.so')

Resuling perf executable:

$ ls -s -h ~/perf
26M /home/wn/perf

$ size ~/perf
   text   databssdechexfilename
4274339 75503223959984289893551ba57ab /home/wn/perf

$ strip ~/perf
$ ls -s -h
4.9M /home/wn/perf

$ ldd ~/perf | grep 'LLVM\|clang'
libclangBasic.so => /tmp/oxygen_root/usr/lib64/libclangBasic.so 
(0x7f24da49f000)
libclangCodeGen.so => /tmp/oxygen_root/usr/lib64/libclangCodeGen.so 
(0x7f24d9f72000)
libclangFrontend.so => 
/tmp/oxygen_root/usr/lib64/libclangFrontend.so (0x7f24d9c64000)
libclangTooling.so => /tmp/oxygen_root/usr/lib64/libclangTooling.so 
(0x7f24d9a01000)
libLLVMBPFCodeGen.so => 
/tmp/oxygen_root/usr/lib64/libLLVMBPFCodeGen.so (0x7f24d97e2000)
libLLVMBPFDesc.so => /tmp/oxygen_root/usr/lib64/libLLVMBPFDesc.so 
(0x7f24d95da000)



As I said, if distro provides dynamic libraries then everything would be 
fine.


If you are okay with the above conclusion, in next version I'll use 
'-lclangBasic'
style linking options so it works for both dynamic and static 
LLVM/clang, then let's

wait for distro's action.

Thank you.



Re: [PATCH v3 10/30] perf clang: Add builtin clang support ant test case

2016-12-04 Thread Wangnan (F)



On 2016/12/2 23:44, Arnaldo Carvalho de Melo wrote:

Em Sat, Nov 26, 2016 at 07:03:34AM +, Wang Nan escreveu:

Add basic clang support in clang.cpp and test__clang() testcase. The
first testcase checks if builtin clang is able to generate LLVM IR.

tests/clang.c is a proxy. Real testcase resides in
utils/c++/clang-test.cpp in c++ and exports C interface to perf test
subsystem.

Test result:

$ perf test -v clang
51: Test builtin clang support   :
51.1: Test builtin clang compile C source to IR  :
--- start ---
test child forked, pid 13215
test child finished with 0
 end 
Test builtin clang support subtest 0: Ok

While testing this I noticed that the perf binary got huge, can't this
be done in some other way, i.e. using dynamic library?


I intentionally use statically linking because it is good for
smartphone: we can simply 'adb push' a statically linked perf
to Android.

The resulting ELF executable would be even larger if LLVM is
built with default setting.

In my setting the resuling 'perf' is less than 60MB:

 $ ls -s  ~/perf -h
 58M /home/wn/perf

 $ size  ~/perf
   text   databssdechexfilename
 569312732950808241086328399071350198b9 /home/wn/perf

It is reasonable for me.

I think using dynamic clang and llvm libraries is possible but I
never tried it before. It depend on LLVM compiling. I think if
distro provides shared libraries then perf can utilize them
automatically. Let me check it today.

About the file size, I discussed with Alexei, he taught me a lot
on it. Maybe he or his ioVisor ex-colleagues can provide some
new idea?

Thank you.



Re: [PATCH v3 14/30] perf clang: Support compile IR to BPF object and add testcase

2016-11-28 Thread Wangnan (F)



On 2016/11/28 14:32, Wangnan (F) wrote:



On 2016/11/27 1:25, Alexei Starovoitov wrote:

On Sat, Nov 26, 2016 at 07:03:38AM +, Wang Nan wrote:

getBPFObjectFromModule() is introduced to compile LLVM IR(Module)
to BPF object. Add new testcase for it.

Test result:
   $ ./buildperf/perf test -v clang
   51: Test builtin clang support   :
   51.1: Test builtin clang compile C source to IR  :
   --- start ---
   test child forked, pid 21822
   test child finished with 0
    end 
   Test builtin clang support subtest 0: Ok
   51.2: Test builtin clang compile C source to ELF object  :
   --- start ---
   test child forked, pid 21823
   test child finished with 0
    end 
   Test builtin clang support subtest 1: Ok

Signed-off-by: Wang Nan 

...

+legacy::PassManager PM;
+if (TargetMachine->addPassesToEmitFile(PM, ostream,
+   TargetMachine::CGFT_ObjectFile)) {
+llvm::errs() << "TargetMachine can't emit a file of this 
type\n";

+return std::unique_ptr>(nullptr);;
+}
+PM.run(*Module);

I'm pretty sure you want to add FunctionInlingPass as well otherwise
I think llvm won't be doing much inlining and only very very simple
programs will compile fine. See what we did on bcc side.


Thank you for your information. I though inlining should be done during
C to IR phase, and we have use -O2 for it. Let me check it.



I did a simple test. It seems even without FunctionInliningPass clang/llvm
can inline static function with no problem. For example, in the sample code
in the cover letter, extract a static function like this:

  static void inc_counter(u64 id)
  {
  u64 *counter;

  counter = bpf_map_lookup_elem(&syscall_counter, &id);
  if (!counter) {
  u64 value = 1;
  bpf_map_update_elem(&syscall_counter, &id, &value, 0);
  return;
  }
  __sync_fetch_and_add(counter, 1);
  return;
  }

Then enable llvm.dump-obj = true in ~/.perfconfig so we can see the
resuling ELF object.

The script worked correctly. readelf report:

 $ readelf -a ./count_syscalls.o | grep inc_counter
 $

Inserting output command into PerfModule::prepareBPF and 
PerfModule::prepareJIT

to print names of functions, can't see inc_counter.

Then remove -O2 in cflags in createCompilerInvocation. Result:

# ./perf record -e ./count_syscalls.c -a sleep 1
LLVM ERROR: Cannot select: t38: ch,glue = BPFISD::CALL t37, t31, 
Register:i64 %R1, Register:i64 %R2, t37:1

  t31: i64,ch = load t51, t58, undef:i64
t58: i64 = BPFISD::Wrapper TargetGlobalAddress:i64@bpf_map_lookup_elem> 0
  t57: i64 = TargetGlobalAddress@bpf_map_lookup_elem> 0

t5: i64 = undef
  t34: i64 = Register %R1
  t36: i64 = Register %R2
  t37: ch,glue = CopyToReg t35, Register:i64 %R2, FrameIndex:i64<5>, t35:1
t36: i64 = Register %R2
t8: i64 = FrameIndex<5>
t35: ch,glue = CopyToReg t33, Register:i64 %R1, t56
  t34: i64 = Register %R1
  t56: i64 = BPFISD::Wrapper 
TargetGlobalAddress:i64<%struct.bpf_map_def* @GVALS> 0

t55: i64 = TargetGlobalAddress<%struct.bpf_map_def* @GVALS> 0
In function: func

Don't know whether -O2 imply inlining.

In bcc, you not only use FunctionInlining, but also add AlwaysInlinerPass
and use populateModulePassManager to append other optimization. I tried to
minimic your code, but it seems the perfhook functions are optimized out
by some optimization added by populateModulePassManager.

Although not quite clear, I'll make following change. Please help me
check it.

Thank you.

diff --git a/tools/perf/util/c++/clang.cpp b/tools/perf/util/c++/clang.cpp
index d05ab6f..d6d1959 100644
--- a/tools/perf/util/c++/clang.cpp
+++ b/tools/perf/util/c++/clang.cpp
@@ -22,6 +22,8 @@
 #include "llvm/Support/TargetSelect.h"
 #include "llvm/Target/TargetMachine.h"
 #include "llvm/Target/TargetOptions.h"
+#include "llvm-c/Transforms/IPO.h"
+#include "llvm/Transforms/IPO.h"
 #include 

 #include "clang.h"
@@ -133,6 +135,13 @@ getBPFObjectFromModule(llvm::Module *Module)
raw_svector_ostream ostream(*Buffer);

legacy::PassManager PM;
+
+PM.add(createFunctionInliningPass());
+/*
+ * LLVM is changing its interface. Use a stable workaround.
+ */
+ LLVMAddAlwaysInlinerPass(reinterpret_cast(&PM));
+
 if (TargetMachine->addPassesToEmitFile(PM, ostream,
TargetMachine::CGFT_ObjectFile)) {
 llvm::errs() << "TargetMachine can't emit a file of this type\n";




Re: [PATCH v3 10/30] perf clang: Add builtin clang support ant test case

2016-11-27 Thread Wangnan (F)



On 2016/11/27 1:17, Alexei Starovoitov wrote:

On Sat, Nov 26, 2016 at 07:03:34AM +, Wang Nan wrote:

Add basic clang support in clang.cpp and test__clang() testcase. The
first testcase checks if builtin clang is able to generate LLVM IR.

tests/clang.c is a proxy. Real testcase resides in
utils/c++/clang-test.cpp in c++ and exports C interface to perf test
subsystem.

Test result:

$ perf test -v clang
51: Test builtin clang support   :
51.1: Test builtin clang compile C source to IR  :
--- start ---
test child forked, pid 13215
test child finished with 0
 end 
Test builtin clang support subtest 0: Ok

Signed-off-by: Wang Nan 

...

+static CompilerInvocation *
+createCompilerInvocation(StringRef& Path, DiagnosticsEngine& Diags)
+{
+   llvm::opt::ArgStringList CCArgs {
+   "-cc1",
+   "-triple", "bpf-pc-linux",
+   "-fsyntax-only",
+   "-ferror-limit", "19",
+   "-fmessage-length", "127",

why such limits?


+   "-O2",
+   "-nostdsysteminc",
+   "-nobuiltininc",
+   "-vectorize-loops",
+   "-vectorize-slp",


Thank you for pointing these out. These arguments are get by analysising
the clang example:

https://llvm.org/svn/llvm-project/cfe/branches/ggreif/CallInst-operands/examples/clang-interpreter/main.cpp

The above example create a C compiler using clang::driver::Driver (bcc 
also uses driver).

I form the argument list according to arglist the driver created for its CI,
and leaves arguments I'm not quite sure unchanged.


why above two flags are needed?


+   "-Wno-unused-value",
+   "-Wno-pointer-sign",

these two -Wno makes sense. please add the comment to explain the reasons.


They are inherited from samples/bpf/Makefile to suppress some warning
when include kernel headers.

Thank you.




Re: [PATCH v3 20/30] perf clang jit: add PerfModule::doJIT to JIT perfhook functions

2016-11-27 Thread Wangnan (F)



On 2016/11/27 1:29, Alexei Starovoitov wrote:

On Sat, Nov 26, 2016 at 07:03:44AM +, Wang Nan wrote:

PerfModule::doJIT JIT compile perfhook functions and saves result into
a map. Add a test case for it.

At this stage perfhook functions can do no useful things because they
can't invoke external functions and can't return value. Following
commits are going to make improvment.

Don't hook functions right after jitted because bpf_object is unavailable
during jitting but it should be the context of jitted functions.

Signed-off-by: Wang Nan 

...

+   for (Function *F : JITFunctions) {
+   JITSymbol sym = CompileLayer.findSymbol(F->getName().str(), 
true);
+
+   /*
+* Type of F->getSection() is moving from
+* const char * to StringRef.
+* Convert it to std::string so we don't need
+* consider this API change.
+*/
+   std::string sec(F->getSection());
+   std::string hook(&sec.c_str()[sizeof("perfhook:") - 1]);
+   perf_hook_func_t func = 
(perf_hook_func_t)(intptr_t)sym.getAddress();
+
+   if (JITResult[hook])
+   llvm::errs() << "Warning: multiple functions on hook "
+<< hook << ", only one is used\n";
+   JITResult[hook] = func;
+   }
+   return 0;
+}
+
  class ClangOptions {
llvm::SmallString FileName;
llvm::SmallString<64> KVerDef;
@@ -292,6 +359,10 @@ void perf_clang__init(void)
LLVMInitializeBPFTarget();
LLVMInitializeBPFTargetMC();
LLVMInitializeBPFAsmPrinter();
+
+   llvm::InitializeNativeTarget();
+   llvm::InitializeNativeTargetAsmPrinter();
+   llvm::InitializeNativeTargetAsmParser();

Looks great. I bet a lot of people reading perf code
won't be able to understand what you're doing here.
Could you please add a design doc on how perf<->clang/llvm interaction
is done.


I'll put a perf-clang.txt in tools/perf/Documentation.

Thank you.


Acked-by: Alexei Starovoitov 






Re: [PATCH v3 14/30] perf clang: Support compile IR to BPF object and add testcase

2016-11-27 Thread Wangnan (F)



On 2016/11/27 1:25, Alexei Starovoitov wrote:

On Sat, Nov 26, 2016 at 07:03:38AM +, Wang Nan wrote:

getBPFObjectFromModule() is introduced to compile LLVM IR(Module)
to BPF object. Add new testcase for it.

Test result:
   $ ./buildperf/perf test -v clang
   51: Test builtin clang support   :
   51.1: Test builtin clang compile C source to IR  :
   --- start ---
   test child forked, pid 21822
   test child finished with 0
    end 
   Test builtin clang support subtest 0: Ok
   51.2: Test builtin clang compile C source to ELF object  :
   --- start ---
   test child forked, pid 21823
   test child finished with 0
    end 
   Test builtin clang support subtest 1: Ok

Signed-off-by: Wang Nan 

...

+   legacy::PassManager PM;
+   if (TargetMachine->addPassesToEmitFile(PM, ostream,
+  TargetMachine::CGFT_ObjectFile)) 
{
+   llvm::errs() << "TargetMachine can't emit a file of this 
type\n";
+   return std::unique_ptr>(nullptr);;
+   }
+   PM.run(*Module);

I'm pretty sure you want to add FunctionInlingPass as well otherwise
I think llvm won't be doing much inlining and only very very simple
programs will compile fine. See what we did on bcc side.


Thank you for your information. I though inlining should be done during
C to IR phase, and we have use -O2 for it. Let me check it.


Also did you consider skipping elf generation and using in memory
instead ? That will improve compile/run time.



Maybe in future work? Current design reuse libelf and bpf-loader for
loading BPF code. Skipping ELF generation make compiling faster, but
require another loader to be coexist with bpf-loader.c, or we need
a new libbpf. In my current experience, outputting ELF is not so slow
for both server and smartphone.

Thank you.



Re: [PATCHv2 perf/core 2/2] tools lib bpf: Sync with samples/bpf/libbpf

2016-11-16 Thread Wangnan (F)



On 2016/11/17 10:46, Joe Stringer wrote:

On 16 November 2016 at 18:10, Wangnan (F)  wrote:

I'm also working on improving bpf.c. Please have a look at:

https://lkml.org/lkml/2016/11/14/1078

Since bpf.c is simple, I think we can add more functions and fixes
gradually, instead of a full copy.

See my inline comment below.

Ah, I missed this, my apologies. It looks like it will provide much of
what I need, I can reassess this patch with your series in mind.

One comment though for your patch (I don't have the original thread to
respond to unfortunately): The map_pin and map_get functions in your
patch series can be used to pin progs too, so maybe there is a better
name? You'll see that this patch uses bpf_obj_{pin,get}() - although I
wouldn't want those to be confused with the libbpf.c objects so maybe
there's a clearer name that could be used.

I also have some patches to rework the samples/bpf/* code to use
libbpf instead of the sample code that is there, is it worth me
submitting that? It will need to wait for your patch to go in, plus a
merge with davem's tree.


I'm happy to see you are trying to replace samples/bpf 's own
libbpf with tools/lib/bpf. I think you can submit your work
on libbpf and patches on samples together if they are ready.
Arnaldo can pick up patches good for him, and you can improve
other patches based on his newest branch.

Thank you.



Re: [PATCH 05/34] tools lib bpf: Add missing bpf map functions

2016-11-16 Thread Wangnan (F)



On 2016/11/15 12:05, Wang Nan wrote:

Add more BPF map operations to libbpf.

Signed-off-by: Wang Nan 
Cc: Alexei Starovoitov 
Cc: Arnaldo Carvalho de Melo 
Cc: Li Zefan 
---
  tools/lib/bpf/bpf.c | 56 +
  tools/lib/bpf/bpf.h |  7 +++
  2 files changed, 63 insertions(+)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 4212ed6..e966248 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -110,3 +110,59 @@ int bpf_map_update_elem(int fd, void *key, void *value,
  
  	return sys_bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));

  }
+
+int bpf_map_lookup_elem(int fd, void *key, void *value)
+{
+   union bpf_attr attr;
+
+   bzero(&attr, sizeof(attr));
+   attr.map_fd = fd;
+   attr.key = ptr_to_u64(key);
+   attr.value = ptr_to_u64(value);
+
+   return sys_bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
+}
+
+int bpf_map_delete_elem(int fd, void *key)
+{
+   union bpf_attr attr;
+
+   bzero(&attr, sizeof(attr));
+   attr.map_fd = fd;
+   attr.key = ptr_to_u64(key);
+
+   return sys_bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
+}
+
+int bpf_map_get_next_key(int fd, void *key, void *next_key)
+{
+   union bpf_attr attr;
+
+   bzero(&attr, sizeof(attr));
+   attr.map_fd = fd;
+   attr.key = ptr_to_u64(key);
+   attr.next_key = ptr_to_u64(next_key);
+
+   return sys_bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
+}
+
+int bpf_map_pin(int fd, const char *pathname)
+{
+   union bpf_attr attr;
+
+   bzero(&attr, sizeof(attr));
+   attr.pathname = ptr_to_u64((void *)pathname);
+   attr.bpf_fd = fd;
+
+   return sys_bpf(BPF_OBJ_PIN, &attr, sizeof(attr));
+}
+
+int bpf_map_get(const char *pathname)
+{
+   union bpf_attr attr;
+
+   bzero(&attr, sizeof(attr));
+   attr.pathname = ptr_to_u64((void *)pathname);
+
+   return sys_bpf(BPF_OBJ_GET, &attr, sizeof(attr));
+}


bpf_map_{pin,get} should be rename to bpf_obj_{pin,get} since they can
be used on BPF program. Thanks to Joe Stringer.


diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index e8ba540..5b3e52b 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -35,4 +35,11 @@ int bpf_load_program(enum bpf_prog_type type, struct 
bpf_insn *insns,
  
  int bpf_map_update_elem(int fd, void *key, void *value,

u64 flags);
+
+int bpf_map_lookup_elem(int fd, void *key, void *value);
+int bpf_map_delete_elem(int fd, void *key);
+int bpf_map_get_next_key(int fd, void *key, void *next_key);
+int bpf_map_pin(int fd, const char *pathname);
+int bpf_map_get(const char *pathname);
+
  #endif





Re: [PATCHv2 perf/core 2/2] tools lib bpf: Sync with samples/bpf/libbpf

2016-11-16 Thread Wangnan (F)



On 2016/11/17 10:46, Joe Stringer wrote:

On 16 November 2016 at 18:10, Wangnan (F)  wrote:

I'm also working on improving bpf.c. Please have a look at:

https://lkml.org/lkml/2016/11/14/1078

Since bpf.c is simple, I think we can add more functions and fixes
gradually, instead of a full copy.

See my inline comment below.

Ah, I missed this, my apologies. It looks like it will provide much of
what I need, I can reassess this patch with your series in mind.

One comment though for your patch (I don't have the original thread to
respond to unfortunately): The map_pin and map_get functions in your
patch series can be used to pin progs too, so maybe there is a better
name? You'll see that this patch uses bpf_obj_{pin,get}() - although I
wouldn't want those to be confused with the libbpf.c objects so maybe
there's a clearer name that could be used.


The full thread can be found:

https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1272045.html

(lkml.kernel.org is not working for me, sorry)

In that patch set, bpf_map_pin/get is linked into perf hooks, so BPF script
can pin a map to sysfs. I think this feature would be useful, but I don't
have an example to show how to use it. I didn't provide program pinning/get
interface because in perf hook they are not useful. After rethinking your
suggestion now I think it is okay to provide bpf_obj_{pin,get} in bpf.c
and export bpf_map_pin to perf hook. I'll adjust my own patch.


I also have some patches to rework the samples/bpf/* code to use
libbpf instead of the sample code that is there, is it worth me
submitting that? It will need to wait for your patch to go in, plus a
merge with davem's tree.


On 2016/11/17 1:43, Joe Stringer wrote:


[SNIP]


   /*
@@ -53,24 +60,71 @@ static int sys_bpf(enum bpf_cmd cmd, union bpf_attr
*attr,
 return syscall(__NR_bpf, cmd, attr, size);
   }
   -int bpf_create_map(enum bpf_map_type map_type, int key_size,
-  int value_size, int max_entries)
+int bpf_create_map(enum bpf_map_type map_type, int key_size, int
value_size,
+  int max_entries, int map_flags)
   {
-   union bpf_attr attr;
+   union bpf_attr attr = {
+   .map_type = map_type,
+   .key_size = key_size,
+   .value_size = value_size,
+   .max_entries = max_entries,
+   .map_flags = map_flags,
+   };
   - memset(&attr, '\0', sizeof(attr));
+   return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
+}



I lost map_flags in original bpf.c. Thanks to your patch. map_flags is
useful
when creating BPF_MAP_TYPE_PERCPU_HASH: BPF_F_NO_PREALLOC is meanful in this
case.

Do you want me to resubmit this piece as a separate patch or will you
address this?


Please send it.

Thank you.



Re: [PATCHv2 perf/core 2/2] tools lib bpf: Sync with samples/bpf/libbpf

2016-11-16 Thread Wangnan (F)

I'm also working on improving bpf.c. Please have a look at:

https://lkml.org/lkml/2016/11/14/1078

Since bpf.c is simple, I think we can add more functions and fixes
gradually, instead of a full copy.

See my inline comment below.

On 2016/11/17 1:43, Joe Stringer wrote:

Extend the tools/ version of libbpf to include all of the functionality
provided in the samples/bpf version.

Signed-off-by: Joe Stringer 
---
v2: Don't shift non-bpf changes across.
 Various type cleanups, removal of extraneous declarations
---
  tools/lib/bpf/bpf.c| 107 --
  tools/lib/bpf/bpf.h| 202 +++--
  tools/lib/bpf/libbpf.c |   3 +-
  3 files changed, 279 insertions(+), 33 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 4212ed62235b..5e061851ac00 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -20,10 +20,17 @@
   */
  
  #include 

-#include 
+#include 
  #include 
  #include 
+#include 
+#include 
  #include 
+#include 
+#include 
+#include 
+#include 
+#include 
  #include "bpf.h"
  


Why we need these network related headers?


  /*
@@ -53,24 +60,71 @@ static int sys_bpf(enum bpf_cmd cmd, union bpf_attr *attr,
return syscall(__NR_bpf, cmd, attr, size);
  }
  
-int bpf_create_map(enum bpf_map_type map_type, int key_size,

-  int value_size, int max_entries)
+int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
+  int max_entries, int map_flags)
  {
-   union bpf_attr attr;
+   union bpf_attr attr = {
+   .map_type = map_type,
+   .key_size = key_size,
+   .value_size = value_size,
+   .max_entries = max_entries,
+   .map_flags = map_flags,
+   };
  
-	memset(&attr, '\0', sizeof(attr));

+   return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
+}
  


I lost map_flags in original bpf.c. Thanks to your patch. map_flags is 
useful

when creating BPF_MAP_TYPE_PERCPU_HASH: BPF_F_NO_PREALLOC is meanful in this
case.

Although it is okay in samples, I still prefer a explicit bzero() or 
memset(),
because kernel checks if unused field in this union is zero. However 
I'll check

c standard to see how unused field would be initialized.





diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index e8ba54087497..4dba36995771 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -23,16 +23,202 @@
  
  #include 
  
+struct bpf_insn;

+
  int bpf_create_map(enum bpf_map_type map_type, int key_size, int value_size,
-  int max_entries);
+  int max_entries, int map_flags);
+int bpf_update_elem(int fd, void *key, void *value, unsigned long long flags);
+int bpf_lookup_elem(int fd, void *key, void *value);
+int bpf_delete_elem(int fd, void *key);
+int bpf_get_next_key(int fd, void *key, void *next_key);
+
+int bpf_load_program(enum bpf_prog_type prog_type,
+const struct bpf_insn *insns, int insn_len,
+const char *license, int kern_version,
+char *log_buf, size_t log_buf_sz);
+
+int bpf_obj_pin(int fd, const char *pathname);
+int bpf_obj_get(const char *pathname);
  
-/* Recommend log buffer size */

  #define BPF_LOG_BUF_SIZE 65536
-int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns,
-size_t insns_cnt, char *license,
-u32 kern_version, char *log_buf,
-size_t log_buf_sz);
  
-int bpf_map_update_elem(int fd, void *key, void *value,

-   u64 flags);
+/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
+
+#define BPF_ALU64_REG(OP, DST, SRC)\
+   ((struct bpf_insn) {\
+   .code  = BPF_ALU64 | BPF_OP(OP) | BPF_X,\
+   .dst_reg = DST, \
+   .src_reg = SRC, \
+   .off   = 0, \
+   .imm   = 0 })
+


Should we define these macros here? They are in include/linux/filter.h
and duplicated in tools/include/linux/filter.h. Redefining them here
would cause conflict.

Thank you.




Re: [PATCHv2 perf/core 1/2] tools lib bpf: Sync {tools,}/include/uapi/linux/bpf.h

2016-11-16 Thread Wangnan (F)



On 2016/11/17 1:43, Joe Stringer wrote:

The tools version of this header is out of date; update it to the latest
version from the kernel headers.

Signed-off-by: Joe Stringer 
---
v2: No change.
---
  tools/include/uapi/linux/bpf.h | 51 ++
  1 file changed, 51 insertions(+)


Acked-by: Wang Nan 


diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 9e5fc168c8a3..f09c70b97eca 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -95,6 +95,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SCHED_ACT,
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
+   BPF_PROG_TYPE_PERF_EVENT,
  };
  
  #define BPF_PSEUDO_MAP_FD	1

@@ -375,6 +376,56 @@ enum bpf_func_id {
 */
BPF_FUNC_probe_write_user,
  
+	/**

+* bpf_current_task_under_cgroup(map, index) - Check cgroup2 membership 
of current task
+* @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
+* @index: index of the cgroup in the bpf_map
+* Return:
+*   == 0 current failed the cgroup2 descendant test
+*   == 1 current succeeded the cgroup2 descendant test
+*< 0 error
+*/
+   BPF_FUNC_current_task_under_cgroup,
+
+   /**
+* bpf_skb_change_tail(skb, len, flags)
+* The helper will resize the skb to the given new size,
+* to be used f.e. with control messages.
+* @skb: pointer to skb
+* @len: new skb length
+* @flags: reserved
+* Return: 0 on success or negative error
+*/
+   BPF_FUNC_skb_change_tail,
+
+   /**
+* bpf_skb_pull_data(skb, len)
+* The helper will pull in non-linear data in case the
+* skb is non-linear and not all of len are part of the
+* linear section. Only needed for read/write with direct
+* packet access.
+* @skb: pointer to skb
+* @len: len to make read/writeable
+* Return: 0 on success or negative error
+*/
+   BPF_FUNC_skb_pull_data,
+
+   /**
+* bpf_csum_update(skb, csum)
+* Adds csum into skb->csum in case of CHECKSUM_COMPLETE.
+* @skb: pointer to skb
+* @csum: csum to add
+* Return: csum on success or negative error
+*/
+   BPF_FUNC_csum_update,
+
+   /**
+* bpf_set_hash_invalid(skb)
+* Invalidate current skb>hash.
+* @skb: pointer to skb
+*/
+   BPF_FUNC_set_hash_invalid,
+
__BPF_FUNC_MAX_ID,
  };
  





Re: [PATCH 00/34] perf clang: Builtin clang and perfhook support

2016-11-14 Thread Wangnan (F)



On 2016/11/15 13:21, Alexei Starovoitov wrote:

On Mon, Nov 14, 2016 at 9:03 PM, Wangnan (F)  wrote:


On 2016/11/15 12:57, Alexei Starovoitov wrote:

On Mon, Nov 14, 2016 at 8:05 PM, Wang Nan  wrote:

This is version 2 of perf builtin clang patch series. Compare to v1,
add an exciting feature: jit compiling perf hook functions. This
features allows script writer report result through BPF map in a
customized way.

looks great.


SEC("perfhook:record_start")
void record_start(void *ctx)
{
  int perf_pid = getpid(), key = G_perf_pid;
  printf("Start count, perfpid=%d\n", perf_pid);
  jit_helper__map_update_elem(ctx, &GVALS, &key, &perf_pid, 0);

the name, I think, is too verbose.
Why not to keep them as bpf_map_update_elem
even for user space programs?


I can make it shorter by give it a better name or use a wrapper like

BPF_MAP(update_elem)

the macro isn't pretty, since function calls won't look like calls.


but the only thing I can't do is to make perfhook and in-kernel script
use a uniform name for these bpf_map functions, because
bpf_map_update_elem is already defined:

"static long (*bpf_map_update_elem)(void *, void *, void *, unsigned long) =
(void *)2;\n"

right. i guess you could have #ifdef it, so it's different for bpf backend
and for native.


Then the '.c' -> LLVM IR compiling should be done twice for BPF
and for JIT to make the macro work. In current implementation
we have only one LLVM IR. It is faster and can make sure the data
layout ("maps" section) is identical.


Another alternative is to call it map_update_elem or map_update
or bpf_map_update. Something shorter is already a win.
'jit_helper__' prefix is an implementation detail. The users don't
need to know and don't need to spell it out everywhere.

Good. Let choose a better name for them.

Thank you.



Re: [PATCH 00/34] perf clang: Builtin clang and perfhook support

2016-11-14 Thread Wangnan (F)



On 2016/11/15 12:57, Alexei Starovoitov wrote:

On Mon, Nov 14, 2016 at 8:05 PM, Wang Nan  wrote:

This is version 2 of perf builtin clang patch series. Compare to v1,
add an exciting feature: jit compiling perf hook functions. This
features allows script writer report result through BPF map in a
customized way.

looks great.


   SEC("perfhook:record_start")
   void record_start(void *ctx)
   {
 int perf_pid = getpid(), key = G_perf_pid;
 printf("Start count, perfpid=%d\n", perf_pid);
 jit_helper__map_update_elem(ctx, &GVALS, &key, &perf_pid, 0);

the name, I think, is too verbose.
Why not to keep them as bpf_map_update_elem
even for user space programs?


I can make it shorter by give it a better name or use a wrapper like

BPF_MAP(update_elem)

but the only thing I can't do is to make perfhook and in-kernel script
use a uniform name for these bpf_map functions, because
bpf_map_update_elem is already defined:

"static long (*bpf_map_update_elem)(void *, void *, void *, unsigned 
long) = (void *)2;\n"




   SEC("perfhook:record_end")
   void record_end(void *ctx)
   {
 u64 key = -1, value;
 while (!jit_helper__map_get_next_key(ctx, &syscall_counter, &key, 
&key)) {
 jit_helper__map_lookup_elem(ctx, &syscall_counter, &key, 
&value);
 printf("syscall %ld\tcount: %ld\n", (long)key, (long)value);

this loop will be less verbose as well.





Re: [PATCH 00/34] perf clang: Builtin clang and perfhook support

2016-11-14 Thread Wangnan (F)



On 2016/11/15 12:05, Wang Nan wrote:

   $ sudo -s
   # ulimit -l unlimited
   # perf record -e ./count_syscalls.c echo "Haha"
   Start count, perfpid=25209
   Haha
   [ perf record: Woken up 1 times to write data ]
   syscall 8count: 6
   syscall 11   count: 1
   syscall 4count: 6
   syscall 21   count: 1
   syscall 5count: 3
   syscall 231  count: 1
   syscall 45   count: 3
   syscall 0count: 24
   syscall 257  count: 1
   syscall 59   count: 4
   syscall 23   count: 9
   syscall 78   count: 2
   syscall 41   count: 4
   syscall 72   count: 8
   syscall 10   count: 3
   syscall 321  count: 1
   syscall 298  count: 7
   syscall 16   count: 21
   syscall 9count: 16
   syscall 1count: 114
   syscall 12   count: 3
   syscall 14   count: 35
   syscall 158  count: 1
   syscall 2count: 15
   syscall 7count: 18
   syscall 3count: 11
   [ perf record: Captured and wrote 0.011 MB perf.data ]


Note that this example counts system wide syscall histogram, not
only 'echo' proc. The in-kernel BPF script doesn't know pid of 'echo'
so can't filter base on it. I'm planning adding more perf hook points
to pass information like this.

Thank you.



Re: [PATCH 7/8] tools lib bpf: fix maps resolution

2016-11-08 Thread Wangnan (F)

Hi Eric,

During testing this patch I find a segfault, please see inline comment.

In addition, since both the BPF map array and map names should be done
after symbol table is collected, merging bpf_object__init_maps and
bpf_object__init_maps_name would be a good practice, making code
simpler.

So I prepare a new patch. Please have a look at:

http://lkml.kernel.org/g/20161108215734.28905-1-wangn...@huawei.com

New version ensure not crashing in any case user provides a corrupted
maps section, including array of bpf maps, maps with different definition
structures and very short map definition.

Thank you.

On 2016/10/16 14:18, Eric Leblond wrote:

It is not correct to assimilate the elf data of the maps section
to an array of map definition. In fact the sizes differ. The
offset provided in the symbol section has to be used instead.

This patch fixes a bug causing a elf with two maps not to load
correctly.

Signed-off-by: Eric Leblond 
---
  tools/lib/bpf/libbpf.c | 50 +++---
  1 file changed, 35 insertions(+), 15 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 1fe4532..f72628b 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -186,6 +186,7 @@ struct bpf_program {
  struct bpf_map {
int fd;
char *name;
+   size_t offset;
struct bpf_map_def def;
void *priv;
bpf_map_clear_priv_t clear_priv;
@@ -529,13 +530,6 @@ bpf_object__init_maps(struct bpf_object *obj, void *data,
  
  	pr_debug("maps in %s: %zd bytes\n", obj->path, size);
  
-	obj->maps = calloc(nr_maps, sizeof(obj->maps[0]));

-   if (!obj->maps) {
-   pr_warning("alloc maps for object failed\n");
-   return -ENOMEM;
-   }
-   obj->nr_maps = nr_maps;
-
for (i = 0; i < nr_maps; i++) {
struct bpf_map_def *def = &obj->maps[i].def;
  
@@ -547,23 +541,42 @@ bpf_object__init_maps(struct bpf_object *obj, void *data,

obj->maps[i].fd = -1;
  
  		/* Save map definition into obj->maps */

-   *def = ((struct bpf_map_def *)data)[i];
+   *def = *(struct bpf_map_def *)(data + obj->maps[i].offset);
}


Here, nr_maps is still size / sizeof(struct bpf_map_def), so obj->maps[i]
can be invalid.


return 0;
  }
  
  static int

-bpf_object__init_maps_name(struct bpf_object *obj)
+bpf_object__init_maps_symbol(struct bpf_object *obj)
  {
int i;
+   int nr_maps = 0;
Elf_Data *symbols = obj->efile.symbols;
+   size_t map_idx = 0;
  
  	if (!symbols || obj->efile.maps_shndx < 0)

return -EINVAL;
  
+	/* get the number of maps */

+   for (i = 0; i < symbols->d_size / sizeof(GElf_Sym); i++) {
+   GElf_Sym sym;
+
+   if (!gelf_getsym(symbols, i, &sym))
+   continue;
+   if (sym.st_shndx != obj->efile.maps_shndx)
+   continue;
+   nr_maps++;
+   }
+
+   obj->maps = calloc(nr_maps, sizeof(obj->maps[0]));
+   if (!obj->maps) {
+   pr_warning("alloc maps for object failed\n");
+   return -ENOMEM;
+   }
+   obj->nr_maps = nr_maps;
+
for (i = 0; i < symbols->d_size / sizeof(GElf_Sym); i++) {
GElf_Sym sym;
-   size_t map_idx;
const char *map_name;
  
  		if (!gelf_getsym(symbols, i, &sym))

@@ -574,12 +587,12 @@ bpf_object__init_maps_name(struct bpf_object *obj)
map_name = elf_strptr(obj->efile.elf,
  obj->efile.strtabidx,
  sym.st_name);
-   map_idx = sym.st_value / sizeof(struct bpf_map_def);
if (map_idx >= obj->nr_maps) {
pr_warning("index of map \"%s\" is buggy: %zu > %zu\n",
   map_name, map_idx, obj->nr_maps);
continue;
}
+   obj->maps[map_idx].offset = sym.st_value;
obj->maps[map_idx].name = strdup(map_name);
if (!obj->maps[map_idx].name) {
pr_warning("failed to alloc map name\n");
@@ -587,6 +600,7 @@ bpf_object__init_maps_name(struct bpf_object *obj)
}
pr_debug("map %zu is \"%s\"\n", map_idx,
 obj->maps[map_idx].name);
+   map_idx++;
}
return 0;
  }
@@ -647,8 +661,6 @@ static int bpf_object__elf_collect(struct bpf_object *obj)
data->d_buf,
data->d_size);
else if (strcmp(name, "maps") == 0) {
-   err = bpf_object__init_maps(obj, data->d_buf,
-   data->d_size);
obj->efile.maps_shndx = idx;
} else if (sh.sh_type == S

Re: [PATCH 7/8] tools lib bpf: fix maps resolution

2016-11-07 Thread Wangnan (F)

Hi Eric,

Are you still working in this patch set?

Now I know why maps section is not a simple array
from a patch set from Joe Stringer:

https://www.mail-archive.com/netdev@vger.kernel.org/msg135088.html

So I think this patch is really useful.

Are you going to resend the whole patch set? If not, let me collect
this patch 7/8 into my local code base and send to Arnaldo
with my other patches.

Thank you.

On 2016/10/17 5:18, Eric Leblond wrote:

It is not correct to assimilate the elf data of the maps section
to an array of map definition. In fact the sizes differ. The
offset provided in the symbol section has to be used instead.

This patch fixes a bug causing a elf with two maps not to load
correctly.

Signed-off-by: Eric Leblond 
---
  tools/lib/bpf/libbpf.c | 50 +++---
  1 file changed, 35 insertions(+), 15 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 1fe4532..f72628b 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -186,6 +186,7 @@ struct bpf_program {
  struct bpf_map {
int fd;
char *name;
+   size_t offset;
struct bpf_map_def def;
void *priv;
bpf_map_clear_priv_t clear_priv;
@@ -529,13 +530,6 @@ bpf_object__init_maps(struct bpf_object *obj, void *data,
  
  	pr_debug("maps in %s: %zd bytes\n", obj->path, size);
  
-	obj->maps = calloc(nr_maps, sizeof(obj->maps[0]));

-   if (!obj->maps) {
-   pr_warning("alloc maps for object failed\n");
-   return -ENOMEM;
-   }
-   obj->nr_maps = nr_maps;
-
for (i = 0; i < nr_maps; i++) {
struct bpf_map_def *def = &obj->maps[i].def;
  
@@ -547,23 +541,42 @@ bpf_object__init_maps(struct bpf_object *obj, void *data,

obj->maps[i].fd = -1;
  
  		/* Save map definition into obj->maps */

-   *def = ((struct bpf_map_def *)data)[i];
+   *def = *(struct bpf_map_def *)(data + obj->maps[i].offset);
}
return 0;
  }
  
  static int

-bpf_object__init_maps_name(struct bpf_object *obj)
+bpf_object__init_maps_symbol(struct bpf_object *obj)
  {
int i;
+   int nr_maps = 0;
Elf_Data *symbols = obj->efile.symbols;
+   size_t map_idx = 0;
  
  	if (!symbols || obj->efile.maps_shndx < 0)

return -EINVAL;
  
+	/* get the number of maps */

+   for (i = 0; i < symbols->d_size / sizeof(GElf_Sym); i++) {
+   GElf_Sym sym;
+
+   if (!gelf_getsym(symbols, i, &sym))
+   continue;
+   if (sym.st_shndx != obj->efile.maps_shndx)
+   continue;
+   nr_maps++;
+   }
+
+   obj->maps = calloc(nr_maps, sizeof(obj->maps[0]));
+   if (!obj->maps) {
+   pr_warning("alloc maps for object failed\n");
+   return -ENOMEM;
+   }
+   obj->nr_maps = nr_maps;
+
for (i = 0; i < symbols->d_size / sizeof(GElf_Sym); i++) {
GElf_Sym sym;
-   size_t map_idx;
const char *map_name;
  
  		if (!gelf_getsym(symbols, i, &sym))

@@ -574,12 +587,12 @@ bpf_object__init_maps_name(struct bpf_object *obj)
map_name = elf_strptr(obj->efile.elf,
  obj->efile.strtabidx,
  sym.st_name);
-   map_idx = sym.st_value / sizeof(struct bpf_map_def);
if (map_idx >= obj->nr_maps) {
pr_warning("index of map \"%s\" is buggy: %zu > %zu\n",
   map_name, map_idx, obj->nr_maps);
continue;
}
+   obj->maps[map_idx].offset = sym.st_value;
obj->maps[map_idx].name = strdup(map_name);
if (!obj->maps[map_idx].name) {
pr_warning("failed to alloc map name\n");
@@ -587,6 +600,7 @@ bpf_object__init_maps_name(struct bpf_object *obj)
}
pr_debug("map %zu is \"%s\"\n", map_idx,
 obj->maps[map_idx].name);
+   map_idx++;
}
return 0;
  }
@@ -647,8 +661,6 @@ static int bpf_object__elf_collect(struct bpf_object *obj)
data->d_buf,
data->d_size);
else if (strcmp(name, "maps") == 0) {
-   err = bpf_object__init_maps(obj, data->d_buf,
-   data->d_size);
obj->efile.maps_shndx = idx;
} else if (sh.sh_type == SHT_SYMTAB) {
if (obj->efile.symbols) {
@@ -698,8 +710,16 @@ static int bpf_object__elf_collect(struct bpf_object *obj)
pr_warning("Corrupted ELF file: index of strtab invalid\n");
return LIBBPF_ERRNO__FORMAT;
}
-   if (obj->efile.

Re: [PATCH 2/2] perf_event_open.2: Document write_backward

2016-10-23 Thread Wangnan (F)



On 2016/10/22 18:05, Michael Kerrisk (man-pages) wrote:

On 10/21/2016 11:25 PM, Vince Weaver wrote:

On Fri, 21 Oct 2016, Wang Nan wrote:


context_switch :  1,  /* context switch data */
-
-  __reserved_1   : 37;
+  write_backward :  1,  /* Write ring buffer from end to beginning */
+  __reserved_1   : 36;

This removes a blank line, not sure if intentional or not.

Maybe it would be better to keep it. I don't feel too strongly about
this though.


+.IR "write_backward" " (since Linux 4.6)"

It didn't committed until Linux 4.7 from what I can tell?

Yes, that's my recollection too.


+This makes the resuling event use a backward ring-buffer, which

resulting


+writes samples from the end of the ring-buffer.
+
+It is not allowed to connect events with backward and forward
+ring-buffer settings together using
+.B PERF_EVENT_IOC_SET_OUTPUT.
+
+Backward ring-buffer is useful when the ring-buffer is overwritable
+(created by readonly
+.BR mmap (2)
+).

A ring buffer is over-writable when it is mmapped readonly?
Is this a hard requirement?


I'd like to explain over-writable ring buffer in patch 1/1 like this:

diff --git a/man2/perf_event_open.2 b/man2/perf_event_open.2
index fade28c..561331c 100644
--- a/man2/perf_event_open.2
+++ b/man2/perf_event_open.2
@@ -1687,6 +1687,15 @@ the
 .I data_tail
 value should be written by user space to reflect the last read data.
 In this case, the kernel will not overwrite unread data.
+
+When the mapping is read only (without
+.BR PROT_WRITE ),
+setting .I data_tail is not allowed.
+In this case, the kernel will overwrite data when sample coming, unless
+the ring buffer is paused by a
+.BR PERF_EVENT_IOC_PAUSE_OUTPUT
+.BR ioctl (2)
+system call before reading.
 .TP
 .IR data_offset " (since Linux 4.1)"
 .\" commit e8c6deac69629c0cb97c3d3272f8631ef17f8f0f

The ring buffer become over-writable because there's no way to tell kernel
the positioin of the last read data when mmaped read only.


Can you set the read-backwards bit if not mapped readonly?


I don't understand why we need read-backwards.

Mapped with PROT_WRITE is the *default* setting. In this case user program
like perf is able to tell the reading position to kernel through writing to
'data_tail'. In this case kernel won't overwrite unread data, it reads
forwardly.

Or do you think the naming is confusing? The name of 'write_backward' is
kernel-centric, means adjust kernel behavior. kernel *write* data, so I
call it 'write_backward'. The name 'read-backwards' is user-centric,
because user 'read' data.

Thank you.




Re: [PATCH 1/5] perf core: Introduce new ioctl options to pause and resume ring buffer

2016-10-21 Thread Wangnan (F)



On 2016/10/21 15:06, Michael Kerrisk (man-pages) wrote:

Hello Wangnan,

The patch below seems to have landed in Linux 4.7,
commit 86e7972f690c1017fd086cdfe53d8524e68c661c

Could you draft a man-pages patch for this interface
change, please? Or, failing that, a plain-text
description that we can integrate into the man-page.


I sent man-pages patches at March:

https://patchwork.kernel.org/patch/8678861/
http://www.spinics.net/lists/linux-man/msg10177.html

Let me resend them again.

Thank you.



Re: [PATCH 1/8] tools lib bpf: add error functions

2016-10-18 Thread Wangnan (F)



On 2016/10/19 6:52, Joe Stringer wrote:

On 16 October 2016 at 14:18, Eric Leblond  wrote:

The include of err.h is not explicitely needed in exported
functions and it was causing include conflict with some existing
code due to redefining some macros.

To fix this, let's have error handling functions provided by the
library. Furthermore this will allow user to have an homogeneous
API.

Signed-off-by: Eric Leblond 

Does it need to return the error like this or should we just fix up
the bpf_object__open() API to return errors in a simpler form?

There's already libbpf_set_print(...) for outputting errors, is it
reasonable to just change the library to return NULLs in error cases
instead?


Returning error code to caller so caller knows what happen.
Other subsystems in perf also do this.

Perf hides libbpf's error output (make it silent unless -v),
so it needs a way for receiving libbpf's error code.

I think this patch is good, decouple libbpf.h and kernel headers.

Thank you.



Re: [PATCH 7/8] tools lib bpf: fix maps resolution

2016-10-16 Thread Wangnan (F)



On 2016/10/17 5:18, Eric Leblond wrote:

It is not correct to assimilate the elf data of the maps section
to an array of map definition. In fact the sizes differ. The
offset provided in the symbol section has to be used instead.

This patch fixes a bug causing a elf with two maps not to load
correctly.


Could you please give an example so we can understand why
section 'maps' is not an array?

Thank you.



Re: [PATCH 6/8] tools lib bpf: improve warning

2016-10-16 Thread Wangnan (F)



On 2016/10/17 5:18, Eric Leblond wrote:

Signed-off-by: Eric Leblond 


Please add some commit messages.

Thank you.


---
  tools/lib/bpf/libbpf.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 7cd341e..1fe4532 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -802,7 +802,8 @@ bpf_object__create_maps(struct bpf_object *obj)
size_t j;
int err = *pfd;
  
-			pr_warning("failed to create map: %s\n",

+   pr_warning("failed to create map (name: '%s'): %s\n",
+  obj->maps[i].name,
   strerror(errno));
for (j = 0; j < i; j++)
zclose(obj->maps[j].fd);





Re: [PATCH 5/8] tools lib bpf: add missing functions

2016-10-16 Thread Wangnan (F)



On 2016/10/17 5:18, Eric Leblond wrote:

Some functions were missing in the library to be able to use it
in the case where the userspace is handling the maps in kernel.

The patch also renames functions to have a homogeneous naming
convention.

Signed-off-by: Eric Leblond 
---
  tools/lib/bpf/bpf.c| 35 ++-
  tools/lib/bpf/bpf.h|  2 --
  tools/lib/bpf/libbpf.h |  5 +
  3 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 4212ed6..c0e07bd 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -25,6 +25,7 @@
  #include 
  #include 
  #include "bpf.h"
+#include "libbpf.h"
  
  /*

   * When building perf, unistd.h is overrided. __NR_bpf is
@@ -97,7 +98,7 @@ int bpf_load_program(enum bpf_prog_type type, struct bpf_insn 
*insns,
return sys_bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
  }
  
-int bpf_map_update_elem(int fd, void *key, void *value,

+int bpf_map__update_elem(int fd, void *key, void *value,
u64 flags)


Please don't use '__' style API here. It is easily be confused with
bpf_map__*() in libbpf.h. They are APIs at different level.

bpf_map__*() are APIs for 'struct bpf_map's, they are object introduced
by libbpf, defined in libbpf.h. bpf_map_*() APIs operate on fd, they are
objects defined by kernel. bpf_map_*() APIs are declared in bpf.h.

In libbpf, bpf.h directly operates on kernel objects (fd), APIs in it
are named bpf_map_*(); libbpf.h operates on 'struct bpf_map' object,
APIs in it are named using bpf_map__*(). libbpf.h and bpf.h are independent
with each other.


  {
union bpf_attr attr;
@@ -110,3 +111,35 @@ int bpf_map_update_elem(int fd, void *key, void *value,
  
  	return sys_bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr));

  }
+
+int bpf_map__lookup_elem(int fd, void *key, void *value)
+{
+   union bpf_attr attr = {
+   .map_fd = fd,
+   .key = ptr_to_u64(key),
+   .value = ptr_to_u64(value),
+   };
+
+   return sys_bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr));
+}
+
+int bpf_map__delete_elem(int fd, void *key)
+{
+   union bpf_attr attr = {
+   .map_fd = fd,
+   .key = ptr_to_u64(key),
+   };
+
+   return sys_bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr));
+}
+
+int bpf_map__get_next_key(int fd, void *key, void *next_key)
+{
+   union bpf_attr attr = {
+   .map_fd = fd,
+   .key = ptr_to_u64(key),
+   .next_key = ptr_to_u64(next_key),
+   };
+
+   return sys_bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr));
+}
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index e8ba540..5ca834a 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -33,6 +33,4 @@ int bpf_load_program(enum bpf_prog_type type, struct bpf_insn 
*insns,
 u32 kern_version, char *log_buf,
 size_t log_buf_sz);
  
-int bpf_map_update_elem(int fd, void *key, void *value,

-   u64 flags);
  #endif
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index a18783b..dfb46d0 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -207,6 +207,11 @@ bpf_map__next(struct bpf_map *map, struct bpf_object *obj);
  int bpf_map__fd(struct bpf_map *map);
  const struct bpf_map_def *bpf_map__def(struct bpf_map *map);
  const char *bpf_map__name(struct bpf_map *map);
+int bpf_map__update_elem(int fd, void *key, void *value,
+   uint64_t flags);
+int bpf_map__lookup_elem(int fd, void *key, void *value);
+int bpf_map__delete_elem(int fd, void *key);
+int bpf_map__get_next_key(int fd, void *key, void *next_key);


As what we have discussed, the newly introduced functions should be added
in bpf.h.

Thank you.



Re: [PATCH 4/8] tools lib bpf: export function to set type

2016-10-16 Thread Wangnan (F)



On 2016/10/17 5:18, Eric Leblond wrote:

Current API was not allowing the user to set a type like socket
filter. To avoid a setter function for each type, the patch simply
exports a set function that takes the type in parameter.

Signed-off-by: Eric Leblond 
---
  tools/lib/bpf/libbpf.c | 19 +--
  tools/lib/bpf/libbpf.h |  3 +++
  2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 90932f1..7cd341e 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1336,26 +1336,25 @@ int bpf_program__nth_fd(struct bpf_program *prog, int n)
return fd;
  }
  
-static void bpf_program__set_type(struct bpf_program *prog,

- enum bpf_prog_type type)
+int bpf_program__set_type(struct bpf_program *prog, unsigned int type)
  {
+   if (!prog)
+   return -EINVAL;
+   if (type >= __MAX_BPF_PROG_TYPE)
+   return -EINVAL;
+
prog->type = type;
+   return 0;
  }
  
  int bpf_program__set_tracepoint(struct bpf_program *prog)

  {
-   if (!prog)
-   return -EINVAL;
-   bpf_program__set_type(prog, BPF_PROG_TYPE_TRACEPOINT);
-   return 0;
+   return bpf_program__set_type(prog, BPF_PROG_TYPE_TRACEPOINT);
  }
  
  int bpf_program__set_kprobe(struct bpf_program *prog)

  {
-   if (!prog)
-   return -EINVAL;
-   bpf_program__set_type(prog, BPF_PROG_TYPE_KPROBE);
-   return 0;
+   return bpf_program__set_type(prog, BPF_PROG_TYPE_KPROBE);
  }
  
  static bool bpf_program__is_type(struct bpf_program *prog,

diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index e40c8d3..a18783b 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -173,6 +173,9 @@ int bpf_program__set_kprobe(struct bpf_program *prog);
  bool bpf_program__is_tracepoint(struct bpf_program *prog);
  bool bpf_program__is_kprobe(struct bpf_program *prog);
  
+int bpf_program__set_type(struct bpf_program *prog,

+ unsigned int type);
+


Although you don't include uapi/linux/bpf.h in this patch, logically
you add this dependency.

Please continously add bpf_program__set_socket_filter() and
bpf_program__is_socket_filter() like what we do for tracepoint.
This way libbpf.h is indenpendent from kernel header.

We can use macro in both .h and .c.

Thank you.



Re: [PATCH 3/8] tools: Sync tools/include/uapi/linux/bpf.h with the kernel

2016-10-16 Thread Wangnan (F)



On 2016/10/17 5:18, Eric Leblond wrote:

Signed-off-by: Eric Leblond 


Commit message is required.

Thank you.


---
  tools/include/uapi/linux/bpf.h | 52 ++
  1 file changed, 52 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 9e5fc16..570287f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -95,6 +95,8 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SCHED_ACT,
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
+   BPF_PROG_TYPE_PERF_EVENT,
+   __MAX_BPF_PROG_TYPE,
  };
  
  #define BPF_PSEUDO_MAP_FD	1

@@ -375,6 +377,56 @@ enum bpf_func_id {
 */
BPF_FUNC_probe_write_user,
  
+	/**

+* bpf_current_task_under_cgroup(map, index) - Check cgroup2 membership 
of current task
+* @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
+* @index: index of the cgroup in the bpf_map
+* Return:
+*   == 0 current failed the cgroup2 descendant test
+*   == 1 current succeeded the cgroup2 descendant test
+*< 0 error
+*/
+   BPF_FUNC_current_task_under_cgroup,
+
+   /**
+* bpf_skb_change_tail(skb, len, flags)
+* The helper will resize the skb to the given new size,
+* to be used f.e. with control messages.
+* @skb: pointer to skb
+* @len: new skb length
+* @flags: reserved
+* Return: 0 on success or negative error
+*/
+   BPF_FUNC_skb_change_tail,
+
+   /**
+* bpf_skb_pull_data(skb, len)
+* The helper will pull in non-linear data in case the
+* skb is non-linear and not all of len are part of the
+* linear section. Only needed for read/write with direct
+* packet access.
+* @skb: pointer to skb
+* @len: len to make read/writeable
+* Return: 0 on success or negative error
+*/
+   BPF_FUNC_skb_pull_data,
+
+   /**
+* bpf_csum_update(skb, csum)
+* Adds csum into skb->csum in case of CHECKSUM_COMPLETE.
+* @skb: pointer to skb
+* @csum: csum to add
+* Return: csum on success or negative error
+*/
+   BPF_FUNC_csum_update,
+
+   /**
+* bpf_set_hash_invalid(skb)
+* Invalidate current skb>hash.
+* @skb: pointer to skb
+*/
+   BPF_FUNC_set_hash_invalid,
+
__BPF_FUNC_MAX_ID,
  };
  





Re: [PATCH 1/8] tools lib bpf: add error functions

2016-10-16 Thread Wangnan (F)



On 2016/10/17 5:18, Eric Leblond wrote:

The include of err.h is not explicitely needed in exported
functions and it was causing include conflict with some existing
code due to redefining some macros.

To fix this, let's have error handling functions provided by the
library. Furthermore this will allow user to have an homogeneous
API.

Signed-off-by: Eric Leblond 
---
  tools/lib/bpf/libbpf.c | 11 +++
  tools/lib/bpf/libbpf.h |  4 +++-
  2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index b699aea..90932f1 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -31,6 +31,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  
@@ -1447,3 +1448,13 @@ bpf_object__find_map_by_name(struct bpf_object *obj, const char *name)

}
return NULL;
  }
+
+bool bpf__is_error(const void *ptr)


Please use libbpf_is_error(), like libbpf_set_print. We use '__' because 
we want

to use the OO concept. This utility is not OO.


+{
+   return IS_ERR(ptr);
+}
+
+long bpf__get_error(const void *ptr)


Same, please call it libbpf_get_error().

Thank you.



Re: [PATCH 1/3] perf, tools: Handle events including .c and .o

2016-10-07 Thread Wangnan (F)



On 2016/10/7 4:18, Arnaldo Carvalho de Melo wrote:

Em Wed, Oct 05, 2016 at 12:47:10PM -0700, Andi Kleen escreveu:

From: Andi Kleen 

This is a generic bug fix, but it helps with Sukadev's JSON event tree
where such events can happen.

Any event inclduing a .c/.o/.bpf currently triggers BPF compilation or loading
and then an error.  This can happen for some Intel JSON events, which cannot
be used.

Fix the scanner to only match for .o or .c or .bpf at the end.
This will prevent loading multiple BPF scripts separated with comma,
but I assume this is acceptable.

So, I tried it with the example provided in the thread for a previous
version of this patch (IIRC) and it still fails:


[acme@jouet linux]$ perf stat -e 
'{unc_p_clockticks,unc_p_power_state_occupancy.cores_c0}' -a -I 1000
ERROR: problems with path {unc_p_clockticks,unc_p_power_state_occupancy.c: No 
such file or directory
event syntax error: '{unc_p_clockticks,unc_p_power_state_occupancy.cores_c0}'
  \___ Failed to load 
{unc_p_clockticks,unc_p_power_state_occupancy.c from source: Error when 
compiling BPF scriptlet

(add -v to see detail)
Run 'perf list' for a list of valid events

  Usage: perf stat [] []

 -e, --eventevent selector. use 'perf list' to list available 
events
[acme@jouet linux]$

And with another event that for sure is available on this machine:



[acme@jouet linux]$ perf stat -e '{uops_executed.core_cycles_ge_2}' -I 1000 
usleep 10
ERROR: problems with path {uops_executed.c: No such file or directory
event syntax error: '{uops_executed.core_cycles_ge_2}'
  \___ Failed to load {uops_executed.c from source: Error 
when compiling BPF scriptlet

(add -v to see detail)
Run 'perf list' for a list of valid events

  Usage: perf stat [] []

 -e, --eventevent selector. use 'perf list' to list available 
events
[acme@jouet linux]$


I thought this was due to the Makefile not noticing the change in the .l files, 
but I made
sure I deleted the build dir and rebuilt from scratch, same problem.

- Arnaldo
  


Tested again, and thank you for giving us another chance for fixing this :)

The key problem here is not the ending '$' but the leading '{'. Flex's
greedy maching policy makes this problem.

According to the design of parse-events.l, when it see something like
'...{...}...', it first matches a 'group' in '' scope, then rewind
to INITIAL scope to match events in the group. In INITIAL scope, when
it see a '{', flex consume this char and goes back to '' scope
to match next event. It works well before match BPF file path using
unlimited '.*\.c' because '.*' will match the leading '{' in INITIAL
scope without consuming it.

The simplest method for this problem is fixing the '.*' part: like
what we define for 'event', don't match ',', '{' and '}'. Doesn't
like 'event', '/' is required because this is a path.

Will post a patch for it. Please test it again.

Thank you.



Re: [PATCH v2 00/18] perf clang: Support compiling BPF script on the fly

2016-10-07 Thread Wangnan (F)



On 2016/10/6 7:20, Arnaldo Carvalho de Melo wrote:

Em Mon, Sep 26, 2016 at 07:26:54AM +, Wang Nan escreveu:

This patch add builtin clang, allow perf compile BPF scripts on the fly.
This is the first step to implement what I announced at LinuxCon 2016 NA:

Ok, so I refreshed this series to apply against my latest perf/core and
put it in a tmp.perf/builtin-clang, will continue testing it tomorrow
after checking if fedora24 has those llvm-dev and libclang-dev that can
be used with this, do you know if it works on that distro?


Sorry for the late. I only tested on ubuntu.

I can see a llvm-static package for fedora:

https://apps.fedoraproject.org/packages/llvm-static/

but for clang I can find only dynamic libraries:

https://apps.fedoraproject.org/packages/clang-libs/

I think we can make it work by changing 'libclang$(l).a'
to 'libclang$(l).so' in commit 1125e7f6cf29.

Thank you.



Re: [PATCH 00/14] perf clang: Support compiling BPF script use builtin clang

2016-09-26 Thread Wangnan (F)



On 2016/9/27 7:58, Arnaldo Carvalho de Melo wrote:


Le 26 sept. 2016 8:47 PM, "Alexei Starovoitov" 
mailto:alexei.starovoi...@gmail.com>> a 
écrit :

>
> On Mon, Sep 26, 2016 at 09:49:30AM +0800, Wangnan (F) wrote:
> >
> >
> > On 2016/9/24 23:16, Alexei Starovoitov wrote:
> > >On Fri, Sep 23, 2016 at 12:49:47PM +, Wang Nan wrote:
> > >>This patch set is the first step to implement features I announced
> > >>in LinuxCon NA 2016. See page 31 of:
> > >>
> > >> 
http://events.linuxfoundation.org/sites/events/files/slides/Performance%20Monitoring%20and%20Analysis%20Using%20perf%20and%20BPF_1.pdf

> > >>
> > >>This patch set links LLVM and Clang libraries to perf, so perf
> > >>is able to compile BPF script to BPF object on the fly.
> > >Nice!
> > >So single perf binary won't have llvm external dependency anymore
> > >or both ways will be maintained?
> > >The command line stays the same?
> >
> > Yes. This patch set doesn't change interface. It compiles BPF script
> > with builtin clang, and if it fail, fall back to external clang.
> >
> > >If I understand the patches correctly, this set is establishing
> > >the basic functionality and more complex features coming?
> > >
> >
> > Yes. Following steps are:
> >
> >  1. Ease of use improvement: automatically include BPF functions
> > declaration and macros.
>
> +1
>
> >  2. Perf's hook: compile part of BPF script into native code, run
> > them in perf when something happen. Create a channel, coordinate
> > BPF and native code use bpf-output event.
>
> +1
>
> >  3. Define a new language to support common profiling task. I'm not
> > very clear what the new language should be. It may looks like lua,
> > perf converts it to C then to LLVM IR with builtin clang.
>
> Many tracing languages were invented in the past.
> At this point I'm not sure what exactly new language will solve.
> To make it easier to write bpf programs?
> I think it will be more fruitful to tweak clang/llvm to add
> good warnings/errors for cases where we know that C is not going
> be compiled into the code that the kernel verifier will accept.
> Like we can complain about loops, unitialized variables,
> non-inlined and unkown helper functions... all from clang/llvm.
> imo that would be the better path forward and will help
> both tracing and networking users that write in this restricted C.

++1
>



OK. Now let's focus on the first two goals. After that let's consider
how to help writing BPF program.

Thank you.




Re: [PATCH 00/14] perf clang: Support compiling BPF script use builtin clang

2016-09-25 Thread Wangnan (F)



On 2016/9/24 23:16, Alexei Starovoitov wrote:

On Fri, Sep 23, 2016 at 12:49:47PM +, Wang Nan wrote:

This patch set is the first step to implement features I announced
in LinuxCon NA 2016. See page 31 of:

  
http://events.linuxfoundation.org/sites/events/files/slides/Performance%20Monitoring%20and%20Analysis%20Using%20perf%20and%20BPF_1.pdf

This patch set links LLVM and Clang libraries to perf, so perf
is able to compile BPF script to BPF object on the fly.

Nice!
So single perf binary won't have llvm external dependency anymore
or both ways will be maintained?
The command line stays the same?


Yes. This patch set doesn't change interface. It compiles BPF script
with builtin clang, and if it fail, fall back to external clang.


If I understand the patches correctly, this set is establishing
the basic functionality and more complex features coming?



Yes. Following steps are:

 1. Ease of use improvement: automatically include BPF functions
declaration and macros.

 2. Perf's hook: compile part of BPF script into native code, run
them in perf when something happen. Create a channel, coordinate
BPF and native code use bpf-output event.

 3. Define a new language to support common profiling task. I'm not
very clear what the new language should be. It may looks like lua,
perf converts it to C then to LLVM IR with builtin clang.

Thank you.



Re: [PATCH] perf record: Fix segfault when running with suid and kptr_restrict is 1

2016-09-23 Thread Wangnan (F)



On 2016/9/22 23:41, Arnaldo Carvalho de Melo wrote:

Em Wed, Sep 21, 2016 at 03:56:20AM +, Wang Nan escreveu:

Before this patch perf panic if kptr_restrict set to 1 and perf is owned
by root with suid set:

  $ whoami
  wangnan
  $ ls -l ./perf
  -rwsr-xr-x 1 root root 19781908 Sep 21 19:29 /home/wangnan/perf
  $ cat /proc/sys/kernel/kptr_restrict
  1
  $ cat /proc/sys/kernel/perf_event_paranoid
  -1
  $ ./perf record -a
  Segmentation fault (core dumped)

Trying to reproduce this here, and failing, where am I making a mistake?

  


I tried again. Not related to paranoid. If 
/proc/sys/kernel/kptr_restrict is set to

1 and perf runs with (euid == 0 && uid != 0) it will panic at:

int perf_event__synthesize_kernel_mmap(struct perf_tool *tool,
...
kmap = map__kmap(map);
size = snprintf(event->mmap.filename, sizeof(event->mmap.filename),
"%s%s", mmap_name, kmap->ref_reloc_sym->name) + 
1; <-- *kmap->ref_reloc_sym is NULL*

...
}

because following function:

machine__create_kernel_maps
 machine__get_running_kernel_start

returns NULL, so kmap->ref_reloc_sym is never set.

No-root user never get the crash point because in 
perf_event__synthesize_kernel_mmap()
symbol_conf.kptr_restrict is true so perf_event__synthesize_kernel_mmap 
directly returns

fail.

Have you tried following 'cat' test?


The reason is perf assumes it is allowed to read kptr from /proc/kallsyms
when euid is root, but in fact kernel doesn't allow it reading kptr when
euid and uid are not match with each other:
  

  $ cp /bin/cat .
  $ sudo chown root:root ./cat
  $ sudo chmod u+s ./cat
  $ cat /proc/kallsyms | grep do_fork
   T _do_fork  <--- kptr is hidden even euid is root
  $ sudo cat /proc/kallsyms | grep do_fork
  81080230 T _do_fork

See lib/vsprintf.c for kernel side code.

This patch fixes this problem by checking both uid and euid.

Humm, can't we just do:

-   value = (geteuid() != 0) ?
+   value = (getuid() != 0) ?


No. though not very useful, by chown u+s one can make ((uid == 0) && 
(euid != 0)) by:


 # chown wangnan:wangnan ./perf
 # chmod u+s ./perf

Then perf fail if run by root:
 # ./perf record -a
 Segmentation fault (core dumped)

But okay for normal user.

Thank you.


I did it here and it seems to work:

[acme@jouet linux]$ whoami
acme
[acme@jouet linux]$ ls -la ~/bin/perf
-rwsr-xr-x. 2 root root 15539952 Sep 22 12:38 /home/acme/bin/perf
[acme@jouet linux]$ cat /proc/sys/kernel/kptr_restrict
1
[acme@jouet linux]$ cat /proc/sys/kernel/perf_event_paranoid
-1
[acme@jouet linux]$ ~acme/bin/perf record -a
   Warning: File /home/acme/.perfconfig not owned by current user or root, 
ignoring it.
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.

Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.

Samples in kernel modules won't be resolved at all.

If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.

Couldn't record kernel reference relocation symbol
Symbol resolution may be skewed if relocation was used (e.g. kexec).
Check /proc/kallsyms permission or run as root.
^C[ perf record: Woken up 17 times to write data ]
[ perf record: Captured and wrote 6.474 MB perf.data (90771 samples) ]

[acme@jouet linux]$ perf report -D 2>&1 | grep PERF_RECORD_SAMPLE | wc -l
90770
[acme@jouet linux]$ perf evlist
   Warning: File /home/acme/.perfconfig not owned by current user or root, 
ignoring it.
cycles:ppp
[acme@jouet linux]$ ls -la ~/.perfconfig
-rw-rw-r--. 1 acme acme 27 Aug 12 17:57 /home/acme/.perfconfig
[acme@jouet linux]$


- Arnaldo

  

Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
---

Resend with a meanless blank like removed.

---
  tools/perf/util/symbol.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index 19c9c55..c55e781 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -1947,7 +1947,7 @@ static bool symbol__read_kptr_restrict(void)
char line[8];
  
  		if (fgets(line, sizeof(line), fp) != NULL)

-   value = (geteuid() != 0) ?
+   value = ((geteuid() != 0) || (getuid() != 0)) ?
(atoi(line) != 0) :
(atoi(line) == 2);
  
--

1.8.3.4





Re: [PATCH] perf record: Fix segfault when running with suid and kptr_restrict is 1

2016-09-20 Thread Wangnan (F)



On 2016/9/21 11:48, Wang Nan wrote:

Before this patch perf panic if kptr_restrict set to 1 and perf is owned
by root with suid set:

  $ whoami
  wangnan
  $ ls -l ./perf
  -rwsr-xr-x 1 root root 19781908 Sep 21 19:29 /home/wangnan/perf
  $ cat /proc/sys/kernel/kptr_restrict
  1
  $ cat /proc/sys/kernel/perf_event_paranoid
  -1
  $ ./perf record -a
  Segmentation fault (core dumped)

The reason is perf assumes it is allowed to read kptr from /proc/kallsyms
when euid is root, but in fact kernel doesn't allow it reading kptr when
euid and uid are not match with each other:

  $ cp /bin/cat .
  $ sudo chown root:root ./cat
  $ sudo chmod u+s ./cat
  $ cat /proc/kallsyms | grep do_fork
   T _do_fork  <--- kptr is hidden even euid is root
  $ sudo cat /proc/kallsyms | grep do_fork
  81080230 T _do_fork

See lib/vsprintf.c for kernel side code.

This patch fixes this problem by checking both uid and euid.

Signed-off-by: Wang Nan 
Cc: Arnaldo Carvalho de Melo 
---
  tools/perf/util/symbol.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index 19c9c55..9528702 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -1946,8 +1946,9 @@ static bool symbol__read_kptr_restrict(void)
if (fp != NULL) {
char line[8];
  
+



Sorry for this blank line. Will fix it.


if (fgets(line, sizeof(line), fp) != NULL)
-   value = (geteuid() != 0) ?
+   value = ((geteuid() != 0) || (getuid() != 0)) ?
(atoi(line) != 0) :
(atoi(line) == 2);
  





Re: [PATCH] perf, tools: Handle events including .c and .o

2016-09-18 Thread Wangnan (F)



On 2016/9/18 22:56, Andi Kleen wrote:

On Sun, Sep 18, 2016 at 06:20:04PM +0800, Wangnan (F) wrote:


On 2016/9/18 9:02, Andi Kleen wrote:

From: Andi Kleen 

This is a generic bug fix, but it helps with Sukadev's JSON event tree
where such events can happen.

Any event inclduing a .c/.o/.bpf currently triggers BPF compilation or loading
and then an error.  This can happen for some Intel JSON events, which cannot
be used.

Fix the scanner to only match for .o or .c or .bpf at the end.
This will prevent loading multiple BPF scripts separated with comma,
but I assume this is acceptable.

Cc: wangn...@huawei.com
Cc: suka...@linux.vnet.ibm.com
Signed-off-by: Andi Kleen 

I tested '.c' in middle of an event:

  # perf trace --event 'aaa.ccc'
  invalid or unsupported event: 'aaa.ccc'
  Run 'perf list' for a list of valid events
  ...

It is not recongnized as a BPF source.

So could you please provide an example to show how
this potential bug breaks the parsing of new events?

This is with the upcoming JSON uncore events:

$ perf stat -e '{unc_p_clockticks,unc_p_power_state_occupancy.cores_c0}' -a -I 
1000
ERROR: problems with path {unc_p_clockticks,unc_p_power_state_occupancy.c: No 
such file or directory
event syntax error: '{unc_p_clockticks,unc_p_power_state_occupancy.cores_c0}'
  \___ Failed to load 
{unc_p_clockticks,unc_p_power_state_occupancy.c from source: Error when 
compiling BPF scriptlet

(add -v to see detail)
Run 'perf list' for a list of valid events

-Andi


I see, and your patch solve problem like this.

Tested-by: Wang Nan 



  1   2   3   4   5   6   >