subject:"\[RFC PATCH tip 0\/5\] tracing filters with BPF"

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-11 Thread Alexei Starovoitov

On Tue, Dec 10, 2013 at 7:35 PM, Masami Hiramatsu
 wrote:
> (2013/12/11 11:32), Alexei Starovoitov wrote:
>> On Tue, Dec 10, 2013 at 7:47 AM, Ingo Molnar  wrote:
>>>
>>> * Alexei Starovoitov  wrote:
>>>
> I'm fine if it becomes a requirement to have a vmlinux built with
> DEBUG_INFO to use BPF and have a tool like perf to translate the
> filters. But it that must not replace what the current filters do
> now. That is, it can be an add on, but not a replacement.

 Of course. tracing filters via bpf is an additional tool for kernel
 debugging. bpf by itself has use cases beyond tracing.
>>>
>>> Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for
>>> most people.
>>
>> there is a misunderstanding here.
>> I was saying 'of course' to 'not replace current filter infra'.
>>
>> bpf does not depend on debug info.
>> That's the key difference between 'perf probe' approach and bpf filters.
>>
>> Masami is right that what I was trying to achieve with bpf filters
>> is similar to 'perf probe': insert a dynamic probe anywhere
>> in the kernel, walk pointers, data structures, print interesting stuff.
>>
>> 'perf probe' does it via scanning vmlinux with debug info.
>> bpf filters don't need it.
>> tools/bpf/trace/*_orig.c examples only depend on linux headers
>> in /lib/modules/../build/include/
>> Today bpf compiler struct layout is the same as x86_64.
>>
>> Tomorrow bpf compiler will have flags to adjust endianness, pointer size, etc
>> of the front-end. Similar to -m32/-m64 and -m*-endian flags.
>> Neat part is that I don't need to do any work, just enable it properly in
>> the bpf backend. From gcc/llvm point of view, bpf is yet another 'hw'
>> architecture that compiler is emitting code for.
>> So when C code of filter_ex1_orig.c does 'skb->dev', compiler determines
>> field offset by looking at /lib/modules/.../include/skbuff.h
>> whereas for 'perf probe' 'skb->dev' means walk debug info.
>
> Right, the offset of the data structure can get from the header etc.
>
> However, how would the bpf get the register or stack assignment of
> skb itself? In the tracepoint macro, it will be able to get it from
> function parameters (it needs a trick, like jprobe does).
> I doubt you can do that on kprobes/uprobes without any debuginfo
> support. :(

the 4/5 diff actually shows how it's working ;)
for kprobes it works at the function entry, since arguments are still
in the registers
and walks the pointers further down.
It cannot do func+line_number as perf-probe does, of course.
for tracepoints it's the same trick: call no-inline func with traceprobe args
and call inlined crash_setup_regs() that stores the regs.

Of course, there are limitations. Like 7th func argument goes into
stack and requires
more work to get out. If struct is not defined in .h, it would need to
be redefined in filter.c
Corner cases as you said.
Today user of bpf filter needs to know that arg1 goes into %rdi and so on.
that is easy to cleanup.

>> Another use case is to optimize fetch sequences of dynamic probes
>> as Masami suggested, but backward compatibility requirement
>> would preserve to ways of doing it as well.
>
> The backward compatibility issue is only for the interface, but not
> for the implementation, I think. :) The fetch method and filter
> pred do already parse the argument into a syntax tree. IMHO, bpf
> can optimize that tree to just a simple opcode stream.

ahh. yes. that's doable.

Thanks
Alexei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-11 Thread Alexei Starovoitov

On Tue, Dec 10, 2013 at 7:35 PM, Masami Hiramatsu
masami.hiramatsu...@hitachi.com wrote:
 (2013/12/11 11:32), Alexei Starovoitov wrote:
 On Tue, Dec 10, 2013 at 7:47 AM, Ingo Molnar mi...@kernel.org wrote:

 * Alexei Starovoitov a...@plumgrid.com wrote:

 I'm fine if it becomes a requirement to have a vmlinux built with
 DEBUG_INFO to use BPF and have a tool like perf to translate the
 filters. But it that must not replace what the current filters do
 now. That is, it can be an add on, but not a replacement.

 Of course. tracing filters via bpf is an additional tool for kernel
 debugging. bpf by itself has use cases beyond tracing.

 Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for
 most people.

 there is a misunderstanding here.
 I was saying 'of course' to 'not replace current filter infra'.

 bpf does not depend on debug info.
 That's the key difference between 'perf probe' approach and bpf filters.

 Masami is right that what I was trying to achieve with bpf filters
 is similar to 'perf probe': insert a dynamic probe anywhere
 in the kernel, walk pointers, data structures, print interesting stuff.

 'perf probe' does it via scanning vmlinux with debug info.
 bpf filters don't need it.
 tools/bpf/trace/*_orig.c examples only depend on linux headers
 in /lib/modules/../build/include/
 Today bpf compiler struct layout is the same as x86_64.

 Tomorrow bpf compiler will have flags to adjust endianness, pointer size, etc
 of the front-end. Similar to -m32/-m64 and -m*-endian flags.
 Neat part is that I don't need to do any work, just enable it properly in
 the bpf backend. From gcc/llvm point of view, bpf is yet another 'hw'
 architecture that compiler is emitting code for.
 So when C code of filter_ex1_orig.c does 'skb-dev', compiler determines
 field offset by looking at /lib/modules/.../include/skbuff.h
 whereas for 'perf probe' 'skb-dev' means walk debug info.

 Right, the offset of the data structure can get from the header etc.

 However, how would the bpf get the register or stack assignment of
 skb itself? In the tracepoint macro, it will be able to get it from
 function parameters (it needs a trick, like jprobe does).
 I doubt you can do that on kprobes/uprobes without any debuginfo
 support. :(

the 4/5 diff actually shows how it's working ;)
for kprobes it works at the function entry, since arguments are still
in the registers
and walks the pointers further down.
It cannot do func+line_number as perf-probe does, of course.
for tracepoints it's the same trick: call no-inline func with traceprobe args
and call inlined crash_setup_regs() that stores the regs.

Of course, there are limitations. Like 7th func argument goes into
stack and requires
more work to get out. If struct is not defined in .h, it would need to
be redefined in filter.c
Corner cases as you said.
Today user of bpf filter needs to know that arg1 goes into %rdi and so on.
that is easy to cleanup.

 Another use case is to optimize fetch sequences of dynamic probes
 as Masami suggested, but backward compatibility requirement
 would preserve to ways of doing it as well.

 The backward compatibility issue is only for the interface, but not
 for the implementation, I think. :) The fetch method and filter
 pred do already parse the argument into a syntax tree. IMHO, bpf
 can optimize that tree to just a simple opcode stream.

ahh. yes. that's doable.

Thanks
Alexei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-10 Thread Masami Hiramatsu

(2013/12/11 11:32), Alexei Starovoitov wrote:
> On Tue, Dec 10, 2013 at 7:47 AM, Ingo Molnar  wrote:
>>
>> * Alexei Starovoitov  wrote:
>>
 I'm fine if it becomes a requirement to have a vmlinux built with
 DEBUG_INFO to use BPF and have a tool like perf to translate the
 filters. But it that must not replace what the current filters do
 now. That is, it can be an add on, but not a replacement.
>>>
>>> Of course. tracing filters via bpf is an additional tool for kernel
>>> debugging. bpf by itself has use cases beyond tracing.
>>
>> Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for
>> most people.
> 
> there is a misunderstanding here.
> I was saying 'of course' to 'not replace current filter infra'.
> 
> bpf does not depend on debug info.
> That's the key difference between 'perf probe' approach and bpf filters.
> 
> Masami is right that what I was trying to achieve with bpf filters
> is similar to 'perf probe': insert a dynamic probe anywhere
> in the kernel, walk pointers, data structures, print interesting stuff.
> 
> 'perf probe' does it via scanning vmlinux with debug info.
> bpf filters don't need it.
> tools/bpf/trace/*_orig.c examples only depend on linux headers
> in /lib/modules/../build/include/
> Today bpf compiler struct layout is the same as x86_64.
> 
> Tomorrow bpf compiler will have flags to adjust endianness, pointer size, etc
> of the front-end. Similar to -m32/-m64 and -m*-endian flags.
> Neat part is that I don't need to do any work, just enable it properly in
> the bpf backend. From gcc/llvm point of view, bpf is yet another 'hw'
> architecture that compiler is emitting code for.
> So when C code of filter_ex1_orig.c does 'skb->dev', compiler determines
> field offset by looking at /lib/modules/.../include/skbuff.h
> whereas for 'perf probe' 'skb->dev' means walk debug info.

Right, the offset of the data structure can get from the header etc.

However, how would the bpf get the register or stack assignment of
skb itself? In the tracepoint macro, it will be able to get it from
function parameters (it needs a trick, like jprobe does).
I doubt you can do that on kprobes/uprobes without any debuginfo
support. :(

And is it possible to trace a field in a data structure which is
defined locally in somewhere.c ? :) (maybe it's just a corner case)

> Something like: cc1 -mlayout_x86_64 filter.c will produce bpf code that
> walks all data structures in the same way x86_64 does it.
> Even if the user makes a mistake and uses -mlayout_aarch64, it won't crash.
> Note that all -m* flags will be in one compiler. It won't grow any bigger
> because of that. All of it already supported by C front-ends.
> It may sound complex, but really very little code for the bpf backend.
> 
> I didn't look inside systemtap/ktap enough to say how much they're
> relying on presence of debug info to make a comparison.
> 
> I see two main use cases for bpf tracing filters: debugging live kernel
> and collecting stats. Same tricks that [sk]tap do with their maps.
> Or may be some of the stats that 'perf record' collects in userspace
> can be collected by bpf filter in kernel and stored into generic bpf table?
> 
>> Would it be possible to make BFP filters recognize exposed details
>> like the current filters do, without depending on the vmlinux?
> 
> Well, if you say that presence of linux headers is also too much to ask,
> I can hook bpf after probes stored all the args.
> 
> This way current simple filter syntax can move to userspace.
> 'arg1==x || arg2!=y' can be parsed by userspace, bpf code
> generated and fed into kernel. It will be faster than walk_pred_tree(),
> but if we cannot remove 2k lines from trace_events_filter.c
> because of backward compatibility, extra performance becomes
> the only reason to have two different implementations.
> 
> Another use case is to optimize fetch sequences of dynamic probes
> as Masami suggested, but backward compatibility requirement
> would preserve to ways of doing it as well.

The backward compatibility issue is only for the interface, but not
for the implementation, I think. :) The fetch method and filter
pred do already parse the argument into a syntax tree. IMHO, bpf
can optimize that tree to just a simple opcode stream.

Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu...@hitachi.com


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-10 Thread Alexei Starovoitov

On Tue, Dec 10, 2013 at 7:47 AM, Ingo Molnar  wrote:
>
> * Alexei Starovoitov  wrote:
>
>> > I'm fine if it becomes a requirement to have a vmlinux built with
>> > DEBUG_INFO to use BPF and have a tool like perf to translate the
>> > filters. But it that must not replace what the current filters do
>> > now. That is, it can be an add on, but not a replacement.
>>
>> Of course. tracing filters via bpf is an additional tool for kernel
>> debugging. bpf by itself has use cases beyond tracing.
>
> Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for
> most people.

there is a misunderstanding here.
I was saying 'of course' to 'not replace current filter infra'.

bpf does not depend on debug info.
That's the key difference between 'perf probe' approach and bpf filters.

Masami is right that what I was trying to achieve with bpf filters
is similar to 'perf probe': insert a dynamic probe anywhere
in the kernel, walk pointers, data structures, print interesting stuff.

'perf probe' does it via scanning vmlinux with debug info.
bpf filters don't need it.
tools/bpf/trace/*_orig.c examples only depend on linux headers
in /lib/modules/../build/include/
Today bpf compiler struct layout is the same as x86_64.

Tomorrow bpf compiler will have flags to adjust endianness, pointer size, etc
of the front-end. Similar to -m32/-m64 and -m*-endian flags.
Neat part is that I don't need to do any work, just enable it properly in
the bpf backend. From gcc/llvm point of view, bpf is yet another 'hw'
architecture that compiler is emitting code for.
So when C code of filter_ex1_orig.c does 'skb->dev', compiler determines
field offset by looking at /lib/modules/.../include/skbuff.h
whereas for 'perf probe' 'skb->dev' means walk debug info.

Something like: cc1 -mlayout_x86_64 filter.c will produce bpf code that
walks all data structures in the same way x86_64 does it.
Even if the user makes a mistake and uses -mlayout_aarch64, it won't crash.
Note that all -m* flags will be in one compiler. It won't grow any bigger
because of that. All of it already supported by C front-ends.
It may sound complex, but really very little code for the bpf backend.

I didn't look inside systemtap/ktap enough to say how much they're
relying on presence of debug info to make a comparison.

I see two main use cases for bpf tracing filters: debugging live kernel
and collecting stats. Same tricks that [sk]tap do with their maps.
Or may be some of the stats that 'perf record' collects in userspace
can be collected by bpf filter in kernel and stored into generic bpf table?

> Would it be possible to make BFP filters recognize exposed details
> like the current filters do, without depending on the vmlinux?

Well, if you say that presence of linux headers is also too much to ask,
I can hook bpf after probes stored all the args.

This way current simple filter syntax can move to userspace.
'arg1==x || arg2!=y' can be parsed by userspace, bpf code
generated and fed into kernel. It will be faster than walk_pred_tree(),
but if we cannot remove 2k lines from trace_events_filter.c
because of backward compatibility, extra performance becomes
the only reason to have two different implementations.

Another use case is to optimize fetch sequences of dynamic probes
as Masami suggested, but backward compatibility requirement
would preserve to ways of doing it as well.

imo the current hook of bpf into tracing is more compelling, but let me
think more about reusing data stored in the ring buffer.

Thanks
Alexei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-10 Thread Ingo Molnar


* Alexei Starovoitov  wrote:

> > I'm fine if it becomes a requirement to have a vmlinux built with 
> > DEBUG_INFO to use BPF and have a tool like perf to translate the 
> > filters. But it that must not replace what the current filters do 
> > now. That is, it can be an add on, but not a replacement.
> 
> Of course. tracing filters via bpf is an additional tool for kernel 
> debugging. bpf by itself has use cases beyond tracing.

Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for 
most people.

Would it be possible to make BFP filters recognize exposed details 
like the current filters do, without depending on the vmlinux?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-10 Thread Alexei Starovoitov

On Tue, Dec 10, 2013 at 7:47 AM, Ingo Molnar mi...@kernel.org wrote:

 * Alexei Starovoitov a...@plumgrid.com wrote:

  I'm fine if it becomes a requirement to have a vmlinux built with
  DEBUG_INFO to use BPF and have a tool like perf to translate the
  filters. But it that must not replace what the current filters do
  now. That is, it can be an add on, but not a replacement.

 Of course. tracing filters via bpf is an additional tool for kernel
 debugging. bpf by itself has use cases beyond tracing.

 Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for
 most people.

there is a misunderstanding here.
I was saying 'of course' to 'not replace current filter infra'.

bpf does not depend on debug info.
That's the key difference between 'perf probe' approach and bpf filters.

Masami is right that what I was trying to achieve with bpf filters
is similar to 'perf probe': insert a dynamic probe anywhere
in the kernel, walk pointers, data structures, print interesting stuff.

'perf probe' does it via scanning vmlinux with debug info.
bpf filters don't need it.
tools/bpf/trace/*_orig.c examples only depend on linux headers
in /lib/modules/../build/include/
Today bpf compiler struct layout is the same as x86_64.

Tomorrow bpf compiler will have flags to adjust endianness, pointer size, etc
of the front-end. Similar to -m32/-m64 and -m*-endian flags.
Neat part is that I don't need to do any work, just enable it properly in
the bpf backend. From gcc/llvm point of view, bpf is yet another 'hw'
architecture that compiler is emitting code for.
So when C code of filter_ex1_orig.c does 'skb-dev', compiler determines
field offset by looking at /lib/modules/.../include/skbuff.h
whereas for 'perf probe' 'skb-dev' means walk debug info.

Something like: cc1 -mlayout_x86_64 filter.c will produce bpf code that
walks all data structures in the same way x86_64 does it.
Even if the user makes a mistake and uses -mlayout_aarch64, it won't crash.
Note that all -m* flags will be in one compiler. It won't grow any bigger
because of that. All of it already supported by C front-ends.
It may sound complex, but really very little code for the bpf backend.

I didn't look inside systemtap/ktap enough to say how much they're
relying on presence of debug info to make a comparison.

I see two main use cases for bpf tracing filters: debugging live kernel
and collecting stats. Same tricks that [sk]tap do with their maps.
Or may be some of the stats that 'perf record' collects in userspace
can be collected by bpf filter in kernel and stored into generic bpf table?

 Would it be possible to make BFP filters recognize exposed details
 like the current filters do, without depending on the vmlinux?

Well, if you say that presence of linux headers is also too much to ask,
I can hook bpf after probes stored all the args.

This way current simple filter syntax can move to userspace.
'arg1==x || arg2!=y' can be parsed by userspace, bpf code
generated and fed into kernel. It will be faster than walk_pred_tree(),
but if we cannot remove 2k lines from trace_events_filter.c
because of backward compatibility, extra performance becomes
the only reason to have two different implementations.

Another use case is to optimize fetch sequences of dynamic probes
as Masami suggested, but backward compatibility requirement
would preserve to ways of doing it as well.

imo the current hook of bpf into tracing is more compelling, but let me
think more about reusing data stored in the ring buffer.

Thanks
Alexei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-10 Thread Masami Hiramatsu

(2013/12/11 11:32), Alexei Starovoitov wrote:
 On Tue, Dec 10, 2013 at 7:47 AM, Ingo Molnar mi...@kernel.org wrote:

 * Alexei Starovoitov a...@plumgrid.com wrote:

 I'm fine if it becomes a requirement to have a vmlinux built with
 DEBUG_INFO to use BPF and have a tool like perf to translate the
 filters. But it that must not replace what the current filters do
 now. That is, it can be an add on, but not a replacement.

 Of course. tracing filters via bpf is an additional tool for kernel
 debugging. bpf by itself has use cases beyond tracing.

 Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for
 most people.
 
 there is a misunderstanding here.
 I was saying 'of course' to 'not replace current filter infra'.
 
 bpf does not depend on debug info.
 That's the key difference between 'perf probe' approach and bpf filters.
 
 Masami is right that what I was trying to achieve with bpf filters
 is similar to 'perf probe': insert a dynamic probe anywhere
 in the kernel, walk pointers, data structures, print interesting stuff.
 
 'perf probe' does it via scanning vmlinux with debug info.
 bpf filters don't need it.
 tools/bpf/trace/*_orig.c examples only depend on linux headers
 in /lib/modules/../build/include/
 Today bpf compiler struct layout is the same as x86_64.
 
 Tomorrow bpf compiler will have flags to adjust endianness, pointer size, etc
 of the front-end. Similar to -m32/-m64 and -m*-endian flags.
 Neat part is that I don't need to do any work, just enable it properly in
 the bpf backend. From gcc/llvm point of view, bpf is yet another 'hw'
 architecture that compiler is emitting code for.
 So when C code of filter_ex1_orig.c does 'skb-dev', compiler determines
 field offset by looking at /lib/modules/.../include/skbuff.h
 whereas for 'perf probe' 'skb-dev' means walk debug info.

Right, the offset of the data structure can get from the header etc.

However, how would the bpf get the register or stack assignment of
skb itself? In the tracepoint macro, it will be able to get it from
function parameters (it needs a trick, like jprobe does).
I doubt you can do that on kprobes/uprobes without any debuginfo
support. :(

And is it possible to trace a field in a data structure which is
defined locally in somewhere.c ? :) (maybe it's just a corner case)

 Something like: cc1 -mlayout_x86_64 filter.c will produce bpf code that
 walks all data structures in the same way x86_64 does it.
 Even if the user makes a mistake and uses -mlayout_aarch64, it won't crash.
 Note that all -m* flags will be in one compiler. It won't grow any bigger
 because of that. All of it already supported by C front-ends.
 It may sound complex, but really very little code for the bpf backend.
 
 I didn't look inside systemtap/ktap enough to say how much they're
 relying on presence of debug info to make a comparison.
 
 I see two main use cases for bpf tracing filters: debugging live kernel
 and collecting stats. Same tricks that [sk]tap do with their maps.
 Or may be some of the stats that 'perf record' collects in userspace
 can be collected by bpf filter in kernel and stored into generic bpf table?
 
 Would it be possible to make BFP filters recognize exposed details
 like the current filters do, without depending on the vmlinux?
 
 Well, if you say that presence of linux headers is also too much to ask,
 I can hook bpf after probes stored all the args.
 
 This way current simple filter syntax can move to userspace.
 'arg1==x || arg2!=y' can be parsed by userspace, bpf code
 generated and fed into kernel. It will be faster than walk_pred_tree(),
 but if we cannot remove 2k lines from trace_events_filter.c
 because of backward compatibility, extra performance becomes
 the only reason to have two different implementations.
 
 Another use case is to optimize fetch sequences of dynamic probes
 as Masami suggested, but backward compatibility requirement
 would preserve to ways of doing it as well.

The backward compatibility issue is only for the interface, but not
for the implementation, I think. :) The fetch method and filter
pred do already parse the argument into a syntax tree. IMHO, bpf
can optimize that tree to just a simple opcode stream.

Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu...@hitachi.com


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-10 Thread Ingo Molnar


* Alexei Starovoitov a...@plumgrid.com wrote:

  I'm fine if it becomes a requirement to have a vmlinux built with 
  DEBUG_INFO to use BPF and have a tool like perf to translate the 
  filters. But it that must not replace what the current filters do 
  now. That is, it can be an add on, but not a replacement.
 
 Of course. tracing filters via bpf is an additional tool for kernel 
 debugging. bpf by itself has use cases beyond tracing.

Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for 
most people.

Would it be possible to make BFP filters recognize exposed details 
like the current filters do, without depending on the vmlinux?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: binary blob no more! Was: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-09 Thread Steven Rostedt

On Sun, 8 Dec 2013 19:36:18 -0800
Alexei Starovoitov  wrote:

> Actually I think there are few ways to include the source equivalent
> in bpf image.
> 
> Approach #1
> include original C source code into bpf image:
>   bpf_image = bpf_insns + original_C
> this will imply that C code can have #include's of linux kernel headers only
> and it can only be C source.
> this way the user can do 'cat /sys/kernel/debug/bpf/filter', kernel
> will print original_C and these restrictions will guarantee that it
> will compile into similar bpf code whether gcc or llvm compiler is
> used.
> 
> Approach #2
> include original llvm bitcode:
>   bpf_image = bpf_insns + llvm_bc
> The user can do 'cat .../filter' and use llvm-dis to see human readable 
> bitcode.
> It takes practice to read it, but it's high level enough to understand
> what filter is doing. llvm-llc can be used to generate bpf_insns
> again, or generate C from bitcode.
> Pro vs 1: bitcode is very compact
> Con: only llvm compiler can used to generate bpf instructions
> 
> Enforcement can be done by having a user space daemon that
> walks over all loaded filters and recompiles them from C or from bitcode.
> 
> Please let me know which approach you prefer.

I don't like either. And different compilers may produce different
results, so that daemon may not be able to verify what is in the C code
is really what's in the bitcode.

> 
> I still think that bpf_image = bpf_insns + license_string is just as good,
> since bpf code can only call tiny set of functions, so no matter what
> the code does its scope is very limited and license enforcement
> guarantees that original source has to be available,
> but I'm ok whichever way.

I like that approach much better. That is, all binary code must state
that it is under the GPL. That way, if you give a binary to someone,
you must also supply the source under the GPL license.

Having a disassembler in the kernel to see what code is loaded, adds
the added benefit that you can see what is there. We can have a
userspace tool to make even more sense out of the disassembled code.

I don't think the kernel should have anything more than a disassembler
though. Maybe that's even too much, but at least a human can inspect it
a little without needing extra tools.

> 
> Also please indicate whether gcc or llvm backend is preferred to
> be hosted in tools.

If we end up placing a compiler in tools, than that compiler should
also be able to be used to compile the entire kernel.

Maybe we will finally get our kcc ;-)

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-09 Thread Masami Hiramatsu

(2013/12/09 16:29), Namhyung Kim wrote:
> Hi Masami,
> 
> On Wed, 04 Dec 2013 10:13:37 +0900, Masami Hiramatsu wrote:
>> (2013/12/04 3:26), Alexei Starovoitov wrote:
>>> the only inconvenience so far is to know how parameters are getting
>>> into registers.
>>> on x86-64, arg1 is in rdi, arg2 is in rsi,... I want to improve that
>>> after first step is done.
>>
>> Actually, that part is done by the perf-probe and ftrace dynamic events
>> (kernel/trace/trace_probe.c). I think this generic BPF is good for
>> re-implementing fetch methods. :)
> 
> For implementing patch method, it seems that it needs to access to user
> memory, stack and/or current (task_struct - for utask or vma later) from
> the BPF VM as well.  Isn't it OK from the security perspective?

Would you mean security or safety?  :)
For safety, I think we can check the BPF binary doesn't break anything.
Anyway, for fetch method, I think we have to make a generic syntax tree
for the archs which don't support BPF, and BPF bytecode will be generated
by the syntax tree. IOW, I'd like to use BPF just for optimizing
memory address calculation.
For security, it is hard to check what is the sensitive information
in the kernel, I think it should be restricted to root user a while.

> Anyway, I'll take a look at it later if I have time, but I want to get
> the existing/pending implementation merged first. :)

Yes, of course ! :)

Thank you,
-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu...@hitachi.com


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-09 Thread Masami Hiramatsu

(2013/12/09 16:29), Namhyung Kim wrote:
 Hi Masami,
 
 On Wed, 04 Dec 2013 10:13:37 +0900, Masami Hiramatsu wrote:
 (2013/12/04 3:26), Alexei Starovoitov wrote:
 the only inconvenience so far is to know how parameters are getting
 into registers.
 on x86-64, arg1 is in rdi, arg2 is in rsi,... I want to improve that
 after first step is done.

 Actually, that part is done by the perf-probe and ftrace dynamic events
 (kernel/trace/trace_probe.c). I think this generic BPF is good for
 re-implementing fetch methods. :)
 
 For implementing patch method, it seems that it needs to access to user
 memory, stack and/or current (task_struct - for utask or vma later) from
 the BPF VM as well.  Isn't it OK from the security perspective?

Would you mean security or safety?  :)
For safety, I think we can check the BPF binary doesn't break anything.
Anyway, for fetch method, I think we have to make a generic syntax tree
for the archs which don't support BPF, and BPF bytecode will be generated
by the syntax tree. IOW, I'd like to use BPF just for optimizing
memory address calculation.
For security, it is hard to check what is the sensitive information
in the kernel, I think it should be restricted to root user a while.

 Anyway, I'll take a look at it later if I have time, but I want to get
 the existing/pending implementation merged first. :)

Yes, of course ! :)

Thank you,
-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu...@hitachi.com


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: binary blob no more! Was: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-09 Thread Steven Rostedt

On Sun, 8 Dec 2013 19:36:18 -0800
Alexei Starovoitov a...@plumgrid.com wrote:

 
 Actually I think there are few ways to include the source equivalent
 in bpf image.
 
 Approach #1
 include original C source code into bpf image:
   bpf_image = bpf_insns + original_C
 this will imply that C code can have #include's of linux kernel headers only
 and it can only be C source.
 this way the user can do 'cat /sys/kernel/debug/bpf/filter', kernel
 will print original_C and these restrictions will guarantee that it
 will compile into similar bpf code whether gcc or llvm compiler is
 used.
 
 Approach #2
 include original llvm bitcode:
   bpf_image = bpf_insns + llvm_bc
 The user can do 'cat .../filter' and use llvm-dis to see human readable 
 bitcode.
 It takes practice to read it, but it's high level enough to understand
 what filter is doing. llvm-llc can be used to generate bpf_insns
 again, or generate C from bitcode.
 Pro vs 1: bitcode is very compact
 Con: only llvm compiler can used to generate bpf instructions
 
 Enforcement can be done by having a user space daemon that
 walks over all loaded filters and recompiles them from C or from bitcode.
 
 Please let me know which approach you prefer.

I don't like either. And different compilers may produce different
results, so that daemon may not be able to verify what is in the C code
is really what's in the bitcode.

 
 I still think that bpf_image = bpf_insns + license_string is just as good,
 since bpf code can only call tiny set of functions, so no matter what
 the code does its scope is very limited and license enforcement
 guarantees that original source has to be available,
 but I'm ok whichever way.

I like that approach much better. That is, all binary code must state
that it is under the GPL. That way, if you give a binary to someone,
you must also supply the source under the GPL license.

Having a disassembler in the kernel to see what code is loaded, adds
the added benefit that you can see what is there. We can have a
userspace tool to make even more sense out of the disassembled code.

I don't think the kernel should have anything more than a disassembler
though. Maybe that's even too much, but at least a human can inspect it
a little without needing extra tools.

 
 Also please indicate whether gcc or llvm backend is preferred to
 be hosted in tools.

If we end up placing a compiler in tools, than that compiler should
also be able to be used to compile the entire kernel.

Maybe we will finally get our kcc ;-)

-- Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-08 Thread Namhyung Kim

Hi Masami,

On Wed, 04 Dec 2013 10:13:37 +0900, Masami Hiramatsu wrote:
> (2013/12/04 3:26), Alexei Starovoitov wrote:
>> the only inconvenience so far is to know how parameters are getting
>> into registers.
>> on x86-64, arg1 is in rdi, arg2 is in rsi,... I want to improve that
>> after first step is done.
>
> Actually, that part is done by the perf-probe and ftrace dynamic events
> (kernel/trace/trace_probe.c). I think this generic BPF is good for
> re-implementing fetch methods. :)

For implementing patch method, it seems that it needs to access to user
memory, stack and/or current (task_struct - for utask or vma later) from
the BPF VM as well.  Isn't it OK from the security perspective?

Anyway, I'll take a look at it later if I have time, but I want to get
the existing/pending implementation merged first. :)

Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-08 Thread Masami Hiramatsu

(2013/12/08 1:21), Jovi Zhangwei wrote:
> On Sat, Dec 7, 2013 at 7:58 AM, Masami Hiramatsu
>  wrote:
>> (2013/12/06 14:19), Jovi Zhangwei wrote:
>>> Hi Alexei,
>>>
>>> On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov  
>>> wrote:
> On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen  wrote:
>>
>> Can you do some performance comparison compared to e.g. ktap?
>> How much faster is it?

 Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier 
 email:
 trace skb:kfree_skb {
 if (arg2 == 0x100) {
 printf("%x %x\n", arg1, arg2)
 }
 }
 1M skb alloc/free 350315 (usecs)

 baseline without any tracing:
 1M skb alloc/free 145400 (usecs)

 then equivalent bpf test:
 void filter(struct bpf_context *ctx)
 {
 void *loc = (void *)ctx->regs.dx;
 if (loc == 0x100) {
 struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
 char fmt[] = "skb %p loc %p\n";
 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 
 0);
 }
 }
 1M skb alloc/free 183214 (usecs)

 so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145

 obviously ktap is an interpreter, so it's not really fair.

 To make it really unfair I did:
 trace skb:kfree_skb {
 if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 
 0x400 ||
 arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 
 0x800 ||
 arg2 == 0x900 || arg2 == 0x1000) {
 printf("%x %x\n", arg1, arg2)
 }
 }
 1M skb alloc/free 484280 (usecs)

 and corresponding bpf:
 void filter(struct bpf_context *ctx)
 {
 void *loc = (void *)ctx->regs.dx;
 if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 ||
 loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 ||
 loc == 0x900 || loc == 0x1000) {
 struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
 char fmt[] = "skb %p loc %p\n";
 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 
 0);
 }
 }
 1M skb alloc/free 185660 (usecs)

 the difference is bigger now: 484-145 vs 185-145

>>> There have big differences for compare arg2(in ktap) with direct register
>>> access(ctx->regs.dx).
>>>
>>> The current argument fetching(arg2 in above testcase) implementation in ktap
>>> is very inefficient, see ktap/interpreter/lib_kdebug.c:kp_event_getarg.
>>> The only way to speedup is kernel tracing code change, let external tracing
>>> module access event field not through list lookup. This work is not
>>> started yet. :)
>>
>> I'm not sure why you can't access it directly from ftrace-event buffer.
>> There is just a packed data structure and it is exposed via debugfs.
>> You can decode it and can get an offset/size by using libtraceevent.
>>
> Then it means there need pass the event field info into kernel through trunk,
> it looks strange because the kernel structure is the source of event field 
> info,
> it's like loop-back, and need to engage with libtraceevent in userspace.

No, the static traceevents have its own kernel data structure, but
the dynamic events don't. They expose the data format (offset/type)
via debugfs, but do not define new data structure.
So, I meant the script is enough to take an offset and a method casting
to corresponding size.

> (the side effect is it will make compilation slow, and consume more memory,
> sometimes it will process 20K events in one script, like 'trace
> probe:big_dso:*')

I doubt it, since you just need to get formats only for the events what
the script using.

> So "the only way" which I said is wrong, your approach indeed is another way.
> I just think maybe use array instead of list for event fields would be more
> efficient if list is not must needed. we can check it more in future.

Ah, perhaps, I misunderstood ktap implementation. Does it define dynamic
events right before loading a bytecode? In that case, I recommend you to
change a loader to adjust the bytecode after defining event to tune the
offset information, which fits to the target event format.

e.g.
 1) compile a bytecode with dummy offsets
 2) define new additional dynamic events
 3) get the field offset information from the events
 4) modify the bytecode to replace offsets with correct one on memory
 5) load the bytecode

Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu...@hitachi.com


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read

binary blob no more! Was: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-08 Thread Alexei Starovoitov

On Thu, Dec 5, 2013 at 9:43 PM, Alexei Starovoitov  wrote:
> On Thu, Dec 5, 2013 at 2:38 AM, Ingo Molnar  wrote:
>>
>>> Also I'm thinking to add 'license_string' section to bpf binary format
>>> and call license_is_gpl_compatible() on it during load.
>>> If false, then just reject it…. not even messing with taint flags...
>>> That would be way stronger indication of bpf licensing terms than what
>>> we have for .ko
>>
>> But will BFP tools generate such gpl-compatible license tags by
>> default? If yes then this might work, combined with the facility
>> below. If not then it's just a nuisance to users.
>
> yes. similar to existing .ko module_license() tag. see below.
>
>> My concern would be solved by adding a facility to always be able to
>> dump source code as well, i.e. trivially transform it to C or so, so
>> that people can review it - or just edit it on the fly, recompile and
>> reinsert? Most BFP scripts ought to be pretty simple.
>
> C code has '#include' in them, so without storing fully preprocessed code
> it will not be equivalent. but then true source will be gigantic.
> Can be zipped, but that sounds like an overkill.
> Also we might want other languages with their own dependent includes.
> Sure, we can have a section in bpf binary that has the source, but it's not
> enforceable. Kernel cannot know that it's an actual source.
> gcc/llvm will produce different bpf code out of the same source.
> the source is in C or in language X, etc.
> Doesn't seem that including some form of source will help
> with enforcing the license.
>
> imo requiring module_license("gpl"); line in C code and equivalent
> string in all other languages that want to translate to bpf would be
> stronger indication of licensing terms.
> then compiler would have to include that string into 'license_string'
> section and kernel can actually enforce it.

Actually I think there are few ways to include the source equivalent
in bpf image.

Approach #1
include original C source code into bpf image:
  bpf_image = bpf_insns + original_C
this will imply that C code can have #include's of linux kernel headers only
and it can only be C source.
this way the user can do 'cat /sys/kernel/debug/bpf/filter', kernel
will print original_C and these restrictions will guarantee that it
will compile into similar bpf code whether gcc or llvm compiler is
used.

Approach #2
include original llvm bitcode:
  bpf_image = bpf_insns + llvm_bc
The user can do 'cat .../filter' and use llvm-dis to see human readable bitcode.
It takes practice to read it, but it's high level enough to understand
what filter is doing. llvm-llc can be used to generate bpf_insns
again, or generate C from bitcode.
Pro vs 1: bitcode is very compact
Con: only llvm compiler can used to generate bpf instructions

Enforcement can be done by having a user space daemon that
walks over all loaded filters and recompiles them from C or from bitcode.

Please let me know which approach you prefer.

I still think that bpf_image = bpf_insns + license_string is just as good,
since bpf code can only call tiny set of functions, so no matter what
the code does its scope is very limited and license enforcement
guarantees that original source has to be available,
but I'm ok whichever way.

Also please indicate whether gcc or llvm backend is preferred to
be hosted in tools.

Build of gcc backend is slow (takes ~100 sec), since front-end,
optimizer and backend are single binary of ~13M.
It doesn't need any other files to compile filter.c into bpf_image

Build of llvm backend ('llc') takes ~10 sec, since it has to compile only
bpf backend files. But it would need clang package to translate C into
llvm bitcode and 'llc' (single 8M binary) to compile bitcode into
bpf_image.

Thanks
Alexei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

binary blob no more! Was: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-08 Thread Alexei Starovoitov

On Thu, Dec 5, 2013 at 9:43 PM, Alexei Starovoitov a...@plumgrid.com wrote:
 On Thu, Dec 5, 2013 at 2:38 AM, Ingo Molnar mi...@kernel.org wrote:

 Also I'm thinking to add 'license_string' section to bpf binary format
 and call license_is_gpl_compatible() on it during load.
 If false, then just reject it…. not even messing with taint flags...
 That would be way stronger indication of bpf licensing terms than what
 we have for .ko

 But will BFP tools generate such gpl-compatible license tags by
 default? If yes then this might work, combined with the facility
 below. If not then it's just a nuisance to users.

 yes. similar to existing .ko module_license() tag. see below.

 My concern would be solved by adding a facility to always be able to
 dump source code as well, i.e. trivially transform it to C or so, so
 that people can review it - or just edit it on the fly, recompile and
 reinsert? Most BFP scripts ought to be pretty simple.

 C code has '#include' in them, so without storing fully preprocessed code
 it will not be equivalent. but then true source will be gigantic.
 Can be zipped, but that sounds like an overkill.
 Also we might want other languages with their own dependent includes.
 Sure, we can have a section in bpf binary that has the source, but it's not
 enforceable. Kernel cannot know that it's an actual source.
 gcc/llvm will produce different bpf code out of the same source.
 the source is in C or in language X, etc.
 Doesn't seem that including some form of source will help
 with enforcing the license.

 imo requiring module_license(gpl); line in C code and equivalent
 string in all other languages that want to translate to bpf would be
 stronger indication of licensing terms.
 then compiler would have to include that string into 'license_string'
 section and kernel can actually enforce it.

Actually I think there are few ways to include the source equivalent
in bpf image.

Approach #1
include original C source code into bpf image:
  bpf_image = bpf_insns + original_C
this will imply that C code can have #include's of linux kernel headers only
and it can only be C source.
this way the user can do 'cat /sys/kernel/debug/bpf/filter', kernel
will print original_C and these restrictions will guarantee that it
will compile into similar bpf code whether gcc or llvm compiler is
used.

Approach #2
include original llvm bitcode:
  bpf_image = bpf_insns + llvm_bc
The user can do 'cat .../filter' and use llvm-dis to see human readable bitcode.
It takes practice to read it, but it's high level enough to understand
what filter is doing. llvm-llc can be used to generate bpf_insns
again, or generate C from bitcode.
Pro vs 1: bitcode is very compact
Con: only llvm compiler can used to generate bpf instructions

Enforcement can be done by having a user space daemon that
walks over all loaded filters and recompiles them from C or from bitcode.

Please let me know which approach you prefer.

I still think that bpf_image = bpf_insns + license_string is just as good,
since bpf code can only call tiny set of functions, so no matter what
the code does its scope is very limited and license enforcement
guarantees that original source has to be available,
but I'm ok whichever way.

Also please indicate whether gcc or llvm backend is preferred to
be hosted in tools.

Build of gcc backend is slow (takes ~100 sec), since front-end,
optimizer and backend are single binary of ~13M.
It doesn't need any other files to compile filter.c into bpf_image

Build of llvm backend ('llc') takes ~10 sec, since it has to compile only
bpf backend files. But it would need clang package to translate C into
llvm bitcode and 'llc' (single 8M binary) to compile bitcode into
bpf_image.

Thanks
Alexei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-08 Thread Masami Hiramatsu

(2013/12/08 1:21), Jovi Zhangwei wrote:
 On Sat, Dec 7, 2013 at 7:58 AM, Masami Hiramatsu
 masami.hiramatsu...@hitachi.com wrote:
 (2013/12/06 14:19), Jovi Zhangwei wrote:
 Hi Alexei,

 On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov a...@plumgrid.com 
 wrote:
 On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen a...@firstfloor.org wrote:

 Can you do some performance comparison compared to e.g. ktap?
 How much faster is it?

 Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier 
 email:
 trace skb:kfree_skb {
 if (arg2 == 0x100) {
 printf(%x %x\n, arg1, arg2)
 }
 }
 1M skb alloc/free 350315 (usecs)

 baseline without any tracing:
 1M skb alloc/free 145400 (usecs)

 then equivalent bpf test:
 void filter(struct bpf_context *ctx)
 {
 void *loc = (void *)ctx-regs.dx;
 if (loc == 0x100) {
 struct sk_buff *skb = (struct sk_buff *)ctx-regs.si;
 char fmt[] = skb %p loc %p\n;
 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 
 0);
 }
 }
 1M skb alloc/free 183214 (usecs)

 so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145

 obviously ktap is an interpreter, so it's not really fair.

 To make it really unfair I did:
 trace skb:kfree_skb {
 if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 
 0x400 ||
 arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 
 0x800 ||
 arg2 == 0x900 || arg2 == 0x1000) {
 printf(%x %x\n, arg1, arg2)
 }
 }
 1M skb alloc/free 484280 (usecs)

 and corresponding bpf:
 void filter(struct bpf_context *ctx)
 {
 void *loc = (void *)ctx-regs.dx;
 if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 ||
 loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 ||
 loc == 0x900 || loc == 0x1000) {
 struct sk_buff *skb = (struct sk_buff *)ctx-regs.si;
 char fmt[] = skb %p loc %p\n;
 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 
 0);
 }
 }
 1M skb alloc/free 185660 (usecs)

 the difference is bigger now: 484-145 vs 185-145

 There have big differences for compare arg2(in ktap) with direct register
 access(ctx-regs.dx).

 The current argument fetching(arg2 in above testcase) implementation in ktap
 is very inefficient, see ktap/interpreter/lib_kdebug.c:kp_event_getarg.
 The only way to speedup is kernel tracing code change, let external tracing
 module access event field not through list lookup. This work is not
 started yet. :)

 I'm not sure why you can't access it directly from ftrace-event buffer.
 There is just a packed data structure and it is exposed via debugfs.
 You can decode it and can get an offset/size by using libtraceevent.

 Then it means there need pass the event field info into kernel through trunk,
 it looks strange because the kernel structure is the source of event field 
 info,
 it's like loop-back, and need to engage with libtraceevent in userspace.

No, the static traceevents have its own kernel data structure, but
the dynamic events don't. They expose the data format (offset/type)
via debugfs, but do not define new data structure.
So, I meant the script is enough to take an offset and a method casting
to corresponding size.

 (the side effect is it will make compilation slow, and consume more memory,
 sometimes it will process 20K events in one script, like 'trace
 probe:big_dso:*')

I doubt it, since you just need to get formats only for the events what
the script using.

 So the only way which I said is wrong, your approach indeed is another way.
 I just think maybe use array instead of list for event fields would be more
 efficient if list is not must needed. we can check it more in future.

Ah, perhaps, I misunderstood ktap implementation. Does it define dynamic
events right before loading a bytecode? In that case, I recommend you to
change a loader to adjust the bytecode after defining event to tune the
offset information, which fits to the target event format.

e.g.
 1) compile a bytecode with dummy offsets
 2) define new additional dynamic events
 3) get the field offset information from the events
 4) modify the bytecode to replace offsets with correct one on memory
 5) load the bytecode

Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu...@hitachi.com


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-08 Thread Namhyung Kim

Hi Masami,

On Wed, 04 Dec 2013 10:13:37 +0900, Masami Hiramatsu wrote:
 (2013/12/04 3:26), Alexei Starovoitov wrote:
 the only inconvenience so far is to know how parameters are getting
 into registers.
 on x86-64, arg1 is in rdi, arg2 is in rsi,... I want to improve that
 after first step is done.

 Actually, that part is done by the perf-probe and ftrace dynamic events
 (kernel/trace/trace_probe.c). I think this generic BPF is good for
 re-implementing fetch methods. :)

For implementing patch method, it seems that it needs to access to user
memory, stack and/or current (task_struct - for utask or vma later) from
the BPF VM as well.  Isn't it OK from the security perspective?

Anyway, I'll take a look at it later if I have time, but I want to get
the existing/pending implementation merged first. :)

Thanks,
Namhyung
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-07 Thread Jovi Zhangwei

On Sat, Dec 7, 2013 at 9:12 AM, Alexei Starovoitov  wrote:
> On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen  wrote:
>> "H. Peter Anvin"  writes:
>>>
>>> Not to mention that in that case we might as well -- since we need a
>>> compiler anyway -- generate the machine code in user space; the JIT
>>> solution really only is useful if it can provide something that we can't
>>> do otherwise, e.g. enable it in secure boot environments.
>>
>> I can see there may be some setups which don't have a compiler
>> (e.g. I know some people don't use systemtap because of that)
>> But this needs a custom gcc install too as far as I understand.
>
> fyi custom gcc is a single 13M binary. It doesn't depend on any
> include files or any libraries.
> and can be easily packaged together with perf... even for embedded 
> environment.

Hmm, 13M binary is big IMO, perf is just 5M after compiled in my system,
I'm not sure embed a custom gcc into perf is a good idea. (and need to
compile that custom gcc every time when build perf ?)

IMO gcc size is not all/main reason of why embedded system didn't
install it, I saw many many production embedded system, no one
install gcc, also gdb, etc. I would never expect Android will install
gcc in some day, I also will really surprise if telcom-vender deliver
Linux board with gcc installed to customers.

Another question is: does the custom gcc of bpf-filter need kernel
header file for compilation? if it need, then this issue is more bigger
than gcc size for embedded system.(same problem like Systemtap)

Thanks,

Jovi.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-07 Thread Jovi Zhangwei

On Sat, Dec 7, 2013 at 7:58 AM, Masami Hiramatsu
 wrote:
> (2013/12/06 14:19), Jovi Zhangwei wrote:
>> Hi Alexei,
>>
>> On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov  
>> wrote:
 On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen  wrote:
>
> Can you do some performance comparison compared to e.g. ktap?
> How much faster is it?
>>>
>>> Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier 
>>> email:
>>> trace skb:kfree_skb {
>>> if (arg2 == 0x100) {
>>> printf("%x %x\n", arg1, arg2)
>>> }
>>> }
>>> 1M skb alloc/free 350315 (usecs)
>>>
>>> baseline without any tracing:
>>> 1M skb alloc/free 145400 (usecs)
>>>
>>> then equivalent bpf test:
>>> void filter(struct bpf_context *ctx)
>>> {
>>> void *loc = (void *)ctx->regs.dx;
>>> if (loc == 0x100) {
>>> struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
>>> char fmt[] = "skb %p loc %p\n";
>>> bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
>>> }
>>> }
>>> 1M skb alloc/free 183214 (usecs)
>>>
>>> so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145
>>>
>>> obviously ktap is an interpreter, so it's not really fair.
>>>
>>> To make it really unfair I did:
>>> trace skb:kfree_skb {
>>> if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 
>>> 0x400 ||
>>> arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 
>>> 0x800 ||
>>> arg2 == 0x900 || arg2 == 0x1000) {
>>> printf("%x %x\n", arg1, arg2)
>>> }
>>> }
>>> 1M skb alloc/free 484280 (usecs)
>>>
>>> and corresponding bpf:
>>> void filter(struct bpf_context *ctx)
>>> {
>>> void *loc = (void *)ctx->regs.dx;
>>> if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 ||
>>> loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 ||
>>> loc == 0x900 || loc == 0x1000) {
>>> struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
>>> char fmt[] = "skb %p loc %p\n";
>>> bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
>>> }
>>> }
>>> 1M skb alloc/free 185660 (usecs)
>>>
>>> the difference is bigger now: 484-145 vs 185-145
>>>
>> There have big differences for compare arg2(in ktap) with direct register
>> access(ctx->regs.dx).
>>
>> The current argument fetching(arg2 in above testcase) implementation in ktap
>> is very inefficient, see ktap/interpreter/lib_kdebug.c:kp_event_getarg.
>> The only way to speedup is kernel tracing code change, let external tracing
>> module access event field not through list lookup. This work is not
>> started yet. :)
>
> I'm not sure why you can't access it directly from ftrace-event buffer.
> There is just a packed data structure and it is exposed via debugfs.
> You can decode it and can get an offset/size by using libtraceevent.
>
Then it means there need pass the event field info into kernel through trunk,
it looks strange because the kernel structure is the source of event field info,
it's like loop-back, and need to engage with libtraceevent in userspace.
(the side effect is it will make compilation slow, and consume more memory,
sometimes it will process 20K events in one script, like 'trace
probe:big_dso:*')

So "the only way" which I said is wrong, your approach indeed is another way.
I just think maybe use array instead of list for event fields would be more
efficient if list is not must needed. we can check it more in future.

Thanks.

Jovi.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-07 Thread Jovi Zhangwei

On Sat, Dec 7, 2013 at 7:58 AM, Masami Hiramatsu
masami.hiramatsu...@hitachi.com wrote:
 (2013/12/06 14:19), Jovi Zhangwei wrote:
 Hi Alexei,

 On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov a...@plumgrid.com 
 wrote:
 On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen a...@firstfloor.org wrote:

 Can you do some performance comparison compared to e.g. ktap?
 How much faster is it?

 Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier 
 email:
 trace skb:kfree_skb {
 if (arg2 == 0x100) {
 printf(%x %x\n, arg1, arg2)
 }
 }
 1M skb alloc/free 350315 (usecs)

 baseline without any tracing:
 1M skb alloc/free 145400 (usecs)

 then equivalent bpf test:
 void filter(struct bpf_context *ctx)
 {
 void *loc = (void *)ctx-regs.dx;
 if (loc == 0x100) {
 struct sk_buff *skb = (struct sk_buff *)ctx-regs.si;
 char fmt[] = skb %p loc %p\n;
 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
 }
 }
 1M skb alloc/free 183214 (usecs)

 so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145

 obviously ktap is an interpreter, so it's not really fair.

 To make it really unfair I did:
 trace skb:kfree_skb {
 if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 
 0x400 ||
 arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 
 0x800 ||
 arg2 == 0x900 || arg2 == 0x1000) {
 printf(%x %x\n, arg1, arg2)
 }
 }
 1M skb alloc/free 484280 (usecs)

 and corresponding bpf:
 void filter(struct bpf_context *ctx)
 {
 void *loc = (void *)ctx-regs.dx;
 if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 ||
 loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 ||
 loc == 0x900 || loc == 0x1000) {
 struct sk_buff *skb = (struct sk_buff *)ctx-regs.si;
 char fmt[] = skb %p loc %p\n;
 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
 }
 }
 1M skb alloc/free 185660 (usecs)

 the difference is bigger now: 484-145 vs 185-145

 There have big differences for compare arg2(in ktap) with direct register
 access(ctx-regs.dx).

 The current argument fetching(arg2 in above testcase) implementation in ktap
 is very inefficient, see ktap/interpreter/lib_kdebug.c:kp_event_getarg.
 The only way to speedup is kernel tracing code change, let external tracing
 module access event field not through list lookup. This work is not
 started yet. :)

 I'm not sure why you can't access it directly from ftrace-event buffer.
 There is just a packed data structure and it is exposed via debugfs.
 You can decode it and can get an offset/size by using libtraceevent.

Then it means there need pass the event field info into kernel through trunk,
it looks strange because the kernel structure is the source of event field info,
it's like loop-back, and need to engage with libtraceevent in userspace.
(the side effect is it will make compilation slow, and consume more memory,
sometimes it will process 20K events in one script, like 'trace
probe:big_dso:*')

So the only way which I said is wrong, your approach indeed is another way.
I just think maybe use array instead of list for event fields would be more
efficient if list is not must needed. we can check it more in future.

Thanks.

Jovi.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-07 Thread Jovi Zhangwei

On Sat, Dec 7, 2013 at 9:12 AM, Alexei Starovoitov a...@plumgrid.com wrote:
 On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen a...@firstfloor.org wrote:
 H. Peter Anvin h...@zytor.com writes:

 Not to mention that in that case we might as well -- since we need a
 compiler anyway -- generate the machine code in user space; the JIT
 solution really only is useful if it can provide something that we can't
 do otherwise, e.g. enable it in secure boot environments.

 I can see there may be some setups which don't have a compiler
 (e.g. I know some people don't use systemtap because of that)
 But this needs a custom gcc install too as far as I understand.

 fyi custom gcc is a single 13M binary. It doesn't depend on any
 include files or any libraries.
 and can be easily packaged together with perf... even for embedded 
 environment.

Hmm, 13M binary is big IMO, perf is just 5M after compiled in my system,
I'm not sure embed a custom gcc into perf is a good idea. (and need to
compile that custom gcc every time when build perf ?)

IMO gcc size is not all/main reason of why embedded system didn't
install it, I saw many many production embedded system, no one
install gcc, also gdb, etc. I would never expect Android will install
gcc in some day, I also will really surprise if telcom-vender deliver
Linux board with gcc installed to customers.

Another question is: does the custom gcc of bpf-filter need kernel
header file for compilation? if it need, then this issue is more bigger
than gcc size for embedded system.(same problem like Systemtap)

Thanks,

Jovi.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-06 Thread Alexei Starovoitov

On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen  wrote:
> "H. Peter Anvin"  writes:
>>
>> Not to mention that in that case we might as well -- since we need a
>> compiler anyway -- generate the machine code in user space; the JIT
>> solution really only is useful if it can provide something that we can't
>> do otherwise, e.g. enable it in secure boot environments.
>
> I can see there may be some setups which don't have a compiler
> (e.g. I know some people don't use systemtap because of that)
> But this needs a custom gcc install too as far as I understand.

fyi custom gcc is a single 13M binary. It doesn't depend on any
include files or any libraries.
and can be easily packaged together with perf... even for embedded environment.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-06 Thread Alexei Starovoitov

On Fri, Dec 6, 2013 at 3:54 PM, Masami Hiramatsu
 wrote:
> (2013/12/06 14:16), Alexei Starovoitov wrote:
>> On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen  wrote:
 the difference is bigger now: 484-145 vs 185-145
>>>
>>> This is a obvious improvement, but imho not big enough to be extremely
>>> compelling (< cost 1-2 cache misses, no orders of magnitude improvements
>>> that would justify a lot of code)
>>
>> hmm. we're comparing against ktap here…
>> which has 5x more kernel code and 8x slower in this test...
>>
>>> Your code requires a compiler, so from my perspective it
>>> wouldn't be a lot easier or faster to use than just changing
>>> the code directly and recompile.
>>>
>>> The users want something simple too that shields them from
>>> having to learn all the internals. They don't want to recompile.
>>> As far as I can tell your code is a bit too low level for that,
>>> and the requirement for the compiler may also scare them.
>>>
>>> Where exactly does it fit?
>>
>> the goal is to have llvm compiler next to perf, wrapped in a user friendly 
>> way.
>>
>> compiling small filter vs recompiling full kernel…
>> inserting into live kernel vs rebooting …
>> not sure how you're saying it's equivalent.
>>
>> In my kernel debugging experience current tools (tracing, systemtap)
>> were rarely enough.
>> I always had to add my own printks through the code, recompile and reboot.
>> Often just to see that it's not the place where I want to print things
>> or it's too verbose.
>> Then I would adjust printks, recompile and reboot again.
>> That was slow and tedious, since I would be crashing things from time to time
>> just because skb doesn't always have a valid dev or I made a typo.
>> For debugging I do really need something quick and dirty that lets me
>> add my own printk
>> of whatever structs I want anywhere in the kernel without crashing it.
>> That's exactly what bpf tracing filters do.
>
> I recommend you to use perf-probe. That will give you an easy solution. :)

it is indeed very cool.
Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-06 Thread Masami Hiramatsu

(2013/12/06 14:19), Jovi Zhangwei wrote:
> Hi Alexei,
> 
> On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov  wrote:
>>> On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen  wrote:

 Can you do some performance comparison compared to e.g. ktap?
 How much faster is it?
>>
>> Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
>> trace skb:kfree_skb {
>> if (arg2 == 0x100) {
>> printf("%x %x\n", arg1, arg2)
>> }
>> }
>> 1M skb alloc/free 350315 (usecs)
>>
>> baseline without any tracing:
>> 1M skb alloc/free 145400 (usecs)
>>
>> then equivalent bpf test:
>> void filter(struct bpf_context *ctx)
>> {
>> void *loc = (void *)ctx->regs.dx;
>> if (loc == 0x100) {
>> struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
>> char fmt[] = "skb %p loc %p\n";
>> bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
>> }
>> }
>> 1M skb alloc/free 183214 (usecs)
>>
>> so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145
>>
>> obviously ktap is an interpreter, so it's not really fair.
>>
>> To make it really unfair I did:
>> trace skb:kfree_skb {
>> if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 
>> ||
>> arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 
>> ||
>> arg2 == 0x900 || arg2 == 0x1000) {
>> printf("%x %x\n", arg1, arg2)
>> }
>> }
>> 1M skb alloc/free 484280 (usecs)
>>
>> and corresponding bpf:
>> void filter(struct bpf_context *ctx)
>> {
>> void *loc = (void *)ctx->regs.dx;
>> if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 ||
>> loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 ||
>> loc == 0x900 || loc == 0x1000) {
>> struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
>> char fmt[] = "skb %p loc %p\n";
>> bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
>> }
>> }
>> 1M skb alloc/free 185660 (usecs)
>>
>> the difference is bigger now: 484-145 vs 185-145
>>
> There have big differences for compare arg2(in ktap) with direct register
> access(ctx->regs.dx).
> 
> The current argument fetching(arg2 in above testcase) implementation in ktap
> is very inefficient, see ktap/interpreter/lib_kdebug.c:kp_event_getarg.
> The only way to speedup is kernel tracing code change, let external tracing
> module access event field not through list lookup. This work is not
> started yet. :)

I'm not sure why you can't access it directly from ftrace-event buffer.
There is just a packed data structure and it is exposed via debugfs.
You can decode it and can get an offset/size by using libtraceevent.

Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu...@hitachi.com


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-06 Thread Masami Hiramatsu

(2013/12/06 14:16), Alexei Starovoitov wrote:
> On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen  wrote:
>>> the difference is bigger now: 484-145 vs 185-145
>>
>> This is a obvious improvement, but imho not big enough to be extremely
>> compelling (< cost 1-2 cache misses, no orders of magnitude improvements
>> that would justify a lot of code)
> 
> hmm. we're comparing against ktap here…
> which has 5x more kernel code and 8x slower in this test...
> 
>> Your code requires a compiler, so from my perspective it
>> wouldn't be a lot easier or faster to use than just changing
>> the code directly and recompile.
>>
>> The users want something simple too that shields them from
>> having to learn all the internals. They don't want to recompile.
>> As far as I can tell your code is a bit too low level for that,
>> and the requirement for the compiler may also scare them.
>>
>> Where exactly does it fit?
> 
> the goal is to have llvm compiler next to perf, wrapped in a user friendly 
> way.
> 
> compiling small filter vs recompiling full kernel…
> inserting into live kernel vs rebooting …
> not sure how you're saying it's equivalent.
> 
> In my kernel debugging experience current tools (tracing, systemtap)
> were rarely enough.
> I always had to add my own printks through the code, recompile and reboot.
> Often just to see that it's not the place where I want to print things
> or it's too verbose.
> Then I would adjust printks, recompile and reboot again.
> That was slow and tedious, since I would be crashing things from time to time
> just because skb doesn't always have a valid dev or I made a typo.
> For debugging I do really need something quick and dirty that lets me
> add my own printk
> of whatever structs I want anywhere in the kernel without crashing it.
> That's exactly what bpf tracing filters do.

I recommend you to use perf-probe. That will give you an easy solution. :)


Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu...@hitachi.com


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-06 Thread Frank Ch. Eigler


hpa wrote:

>> I can see there may be some setups which don't have a compiler
>> (e.g. I know some people don't use systemtap because of that)
>> But this needs a custom gcc install too as far as I understand.
>
> Yes... but no compiler and secure boot tend to go together, or at
> least will in the future.

(Maybe not: we're already experimenting with support for secureboot in
systemtap.)

- FChE
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-06 Thread Frank Ch. Eigler


hpa wrote:

 I can see there may be some setups which don't have a compiler
 (e.g. I know some people don't use systemtap because of that)
 But this needs a custom gcc install too as far as I understand.

 Yes... but no compiler and secure boot tend to go together, or at
 least will in the future.

(Maybe not: we're already experimenting with support for secureboot in
systemtap.)

- FChE
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-06 Thread Masami Hiramatsu

(2013/12/06 14:16), Alexei Starovoitov wrote:
 On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen a...@firstfloor.org wrote:
 the difference is bigger now: 484-145 vs 185-145

 This is a obvious improvement, but imho not big enough to be extremely
 compelling ( cost 1-2 cache misses, no orders of magnitude improvements
 that would justify a lot of code)
 
 hmm. we're comparing against ktap here…
 which has 5x more kernel code and 8x slower in this test...
 
 Your code requires a compiler, so from my perspective it
 wouldn't be a lot easier or faster to use than just changing
 the code directly and recompile.

 The users want something simple too that shields them from
 having to learn all the internals. They don't want to recompile.
 As far as I can tell your code is a bit too low level for that,
 and the requirement for the compiler may also scare them.

 Where exactly does it fit?
 
 the goal is to have llvm compiler next to perf, wrapped in a user friendly 
 way.
 
 compiling small filter vs recompiling full kernel…
 inserting into live kernel vs rebooting …
 not sure how you're saying it's equivalent.
 
 In my kernel debugging experience current tools (tracing, systemtap)
 were rarely enough.
 I always had to add my own printks through the code, recompile and reboot.
 Often just to see that it's not the place where I want to print things
 or it's too verbose.
 Then I would adjust printks, recompile and reboot again.
 That was slow and tedious, since I would be crashing things from time to time
 just because skb doesn't always have a valid dev or I made a typo.
 For debugging I do really need something quick and dirty that lets me
 add my own printk
 of whatever structs I want anywhere in the kernel without crashing it.
 That's exactly what bpf tracing filters do.

I recommend you to use perf-probe. That will give you an easy solution. :)


Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu...@hitachi.com


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-06 Thread Masami Hiramatsu

(2013/12/06 14:19), Jovi Zhangwei wrote:
 Hi Alexei,
 
 On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov a...@plumgrid.com wrote:
 On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen a...@firstfloor.org wrote:

 Can you do some performance comparison compared to e.g. ktap?
 How much faster is it?

 Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
 trace skb:kfree_skb {
 if (arg2 == 0x100) {
 printf(%x %x\n, arg1, arg2)
 }
 }
 1M skb alloc/free 350315 (usecs)

 baseline without any tracing:
 1M skb alloc/free 145400 (usecs)

 then equivalent bpf test:
 void filter(struct bpf_context *ctx)
 {
 void *loc = (void *)ctx-regs.dx;
 if (loc == 0x100) {
 struct sk_buff *skb = (struct sk_buff *)ctx-regs.si;
 char fmt[] = skb %p loc %p\n;
 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
 }
 }
 1M skb alloc/free 183214 (usecs)

 so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145

 obviously ktap is an interpreter, so it's not really fair.

 To make it really unfair I did:
 trace skb:kfree_skb {
 if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 
 ||
 arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 
 ||
 arg2 == 0x900 || arg2 == 0x1000) {
 printf(%x %x\n, arg1, arg2)
 }
 }
 1M skb alloc/free 484280 (usecs)

 and corresponding bpf:
 void filter(struct bpf_context *ctx)
 {
 void *loc = (void *)ctx-regs.dx;
 if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 ||
 loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 ||
 loc == 0x900 || loc == 0x1000) {
 struct sk_buff *skb = (struct sk_buff *)ctx-regs.si;
 char fmt[] = skb %p loc %p\n;
 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
 }
 }
 1M skb alloc/free 185660 (usecs)

 the difference is bigger now: 484-145 vs 185-145

 There have big differences for compare arg2(in ktap) with direct register
 access(ctx-regs.dx).
 
 The current argument fetching(arg2 in above testcase) implementation in ktap
 is very inefficient, see ktap/interpreter/lib_kdebug.c:kp_event_getarg.
 The only way to speedup is kernel tracing code change, let external tracing
 module access event field not through list lookup. This work is not
 started yet. :)

I'm not sure why you can't access it directly from ftrace-event buffer.
There is just a packed data structure and it is exposed via debugfs.
You can decode it and can get an offset/size by using libtraceevent.

Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu...@hitachi.com


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-06 Thread Alexei Starovoitov

On Fri, Dec 6, 2013 at 3:54 PM, Masami Hiramatsu
masami.hiramatsu...@hitachi.com wrote:
 (2013/12/06 14:16), Alexei Starovoitov wrote:
 On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen a...@firstfloor.org wrote:
 the difference is bigger now: 484-145 vs 185-145

 This is a obvious improvement, but imho not big enough to be extremely
 compelling ( cost 1-2 cache misses, no orders of magnitude improvements
 that would justify a lot of code)

 hmm. we're comparing against ktap here…
 which has 5x more kernel code and 8x slower in this test...

 Your code requires a compiler, so from my perspective it
 wouldn't be a lot easier or faster to use than just changing
 the code directly and recompile.

 The users want something simple too that shields them from
 having to learn all the internals. They don't want to recompile.
 As far as I can tell your code is a bit too low level for that,
 and the requirement for the compiler may also scare them.

 Where exactly does it fit?

 the goal is to have llvm compiler next to perf, wrapped in a user friendly 
 way.

 compiling small filter vs recompiling full kernel…
 inserting into live kernel vs rebooting …
 not sure how you're saying it's equivalent.

 In my kernel debugging experience current tools (tracing, systemtap)
 were rarely enough.
 I always had to add my own printks through the code, recompile and reboot.
 Often just to see that it's not the place where I want to print things
 or it's too verbose.
 Then I would adjust printks, recompile and reboot again.
 That was slow and tedious, since I would be crashing things from time to time
 just because skb doesn't always have a valid dev or I made a typo.
 For debugging I do really need something quick and dirty that lets me
 add my own printk
 of whatever structs I want anywhere in the kernel without crashing it.
 That's exactly what bpf tracing filters do.

 I recommend you to use perf-probe. That will give you an easy solution. :)

it is indeed very cool.
Thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-06 Thread Alexei Starovoitov

On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen a...@firstfloor.org wrote:
 H. Peter Anvin h...@zytor.com writes:

 Not to mention that in that case we might as well -- since we need a
 compiler anyway -- generate the machine code in user space; the JIT
 solution really only is useful if it can provide something that we can't
 do otherwise, e.g. enable it in secure boot environments.

 I can see there may be some setups which don't have a compiler
 (e.g. I know some people don't use systemtap because of that)
 But this needs a custom gcc install too as far as I understand.

fyi custom gcc is a single 13M binary. It doesn't depend on any
include files or any libraries.
and can be easily packaged together with perf... even for embedded environment.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Jovi Zhangwei

On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov  wrote:
>> On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen  wrote:
>>>
>>> Can you do some performance comparison compared to e.g. ktap?
>>> How much faster is it?
>
> Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
> trace skb:kfree_skb {
> if (arg2 == 0x100) {
> printf("%x %x\n", arg1, arg2)
> }
> }
> 1M skb alloc/free 350315 (usecs)
>
> baseline without any tracing:
> 1M skb alloc/free 145400 (usecs)
>
> then equivalent bpf test:
> void filter(struct bpf_context *ctx)
> {
> void *loc = (void *)ctx->regs.dx;
> if (loc == 0x100) {
> struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
> char fmt[] = "skb %p loc %p\n";
> bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
> }
> }
> 1M skb alloc/free 183214 (usecs)
>
> so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145
>
> obviously ktap is an interpreter, so it's not really fair.
>
> To make it really unfair I did:
> trace skb:kfree_skb {
> if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 
> ||
> arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 
> ||
> arg2 == 0x900 || arg2 == 0x1000) {
> printf("%x %x\n", arg1, arg2)
> }
> }
> 1M skb alloc/free 484280 (usecs)
>
I've lost my mind for a while. :)

If bpf only focus on filter, then it's not good to compare with ktap
like that, since
ktap can easily make use on current kernel filter, you should use below script:

trace skb:kfree_skb /location == 0x100 || location == 0x200 || .../ {
printf("%x %x\n", arg1, arg2)
}

As ktap is a user of current simple kernel tracing filter, I fully
agree with Steven,
"it can be an add on, but not a replacement."


Thanks,

Jovi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Jovi Zhangwei

On Fri, Dec 6, 2013 at 9:20 AM, Andi Kleen  wrote:
> "H. Peter Anvin"  writes:
>>
>> Not to mention that in that case we might as well -- since we need a
>> compiler anyway -- generate the machine code in user space; the JIT
>> solution really only is useful if it can provide something that we can't
>> do otherwise, e.g. enable it in secure boot environments.
>
> I can see there may be some setups which don't have a compiler
> (e.g. I know some people don't use systemtap because of that)
> But this needs a custom gcc install too as far as I understand.
>
If it's depend on gcc, then it's look like Systemtap. There have big
inconvenient for embedded environment and many production system
to install gcc.
(not sure if it need kernel compilation environment as well)

It seems the event filter is binding to specific event, it's not possible
to trace many events in a cooperation style, look Systemtap and ktap
samples, many event handler need to cooperate, the simplest
example is record syscall execution time(duration of exit - entry).

If this design is intentional, then I would think it's target for speed up
current kernel tracing filter.(but need extra usespace filter compiler)

And I guess bpf filter still need to take mind on usespace tracing :),
if it want to be a complete and integrated tracing solution.
(use a separated userspace compiler or translator to resolve symbol)

Thanks

Jovi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Alexei Starovoitov

On Thu, Dec 5, 2013 at 2:38 AM, Ingo Molnar  wrote:
>
>> Also I'm thinking to add 'license_string' section to bpf binary format
>> and call license_is_gpl_compatible() on it during load.
>> If false, then just reject it…. not even messing with taint flags...
>> That would be way stronger indication of bpf licensing terms than what
>> we have for .ko
>
> But will BFP tools generate such gpl-compatible license tags by
> default? If yes then this might work, combined with the facility
> below. If not then it's just a nuisance to users.

yes. similar to existing .ko module_license() tag. see below.

> My concern would be solved by adding a facility to always be able to
> dump source code as well, i.e. trivially transform it to C or so, so
> that people can review it - or just edit it on the fly, recompile and
> reinsert? Most BFP scripts ought to be pretty simple.

C code has '#include' in them, so without storing fully preprocessed code
it will not be equivalent. but then true source will be gigantic.
Can be zipped, but that sounds like an overkill.
Also we might want other languages with their own dependent includes.
Sure, we can have a section in bpf binary that has the source, but it's not
enforceable. Kernel cannot know that it's an actual source.
gcc/llvm will produce different bpf code out of the same source.
the source is in C or in language X, etc.
Doesn't seem that including some form of source will help
with enforcing the license.

imo requiring module_license("gpl"); line in C code and equivalent
string in all other languages that want to translate to bpf would be
stronger indication of licensing terms.
then compiler would have to include that string into 'license_string'
section and kernel can actually enforce it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Jovi Zhangwei

Hi Alexei,

On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov  wrote:
>> On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen  wrote:
>>>
>>> Can you do some performance comparison compared to e.g. ktap?
>>> How much faster is it?
>
> Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
> trace skb:kfree_skb {
> if (arg2 == 0x100) {
> printf("%x %x\n", arg1, arg2)
> }
> }
> 1M skb alloc/free 350315 (usecs)
>
> baseline without any tracing:
> 1M skb alloc/free 145400 (usecs)
>
> then equivalent bpf test:
> void filter(struct bpf_context *ctx)
> {
> void *loc = (void *)ctx->regs.dx;
> if (loc == 0x100) {
> struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
> char fmt[] = "skb %p loc %p\n";
> bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
> }
> }
> 1M skb alloc/free 183214 (usecs)
>
> so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145
>
> obviously ktap is an interpreter, so it's not really fair.
>
> To make it really unfair I did:
> trace skb:kfree_skb {
> if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 
> ||
> arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 
> ||
> arg2 == 0x900 || arg2 == 0x1000) {
> printf("%x %x\n", arg1, arg2)
> }
> }
> 1M skb alloc/free 484280 (usecs)
>
> and corresponding bpf:
> void filter(struct bpf_context *ctx)
> {
> void *loc = (void *)ctx->regs.dx;
> if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 ||
> loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 ||
> loc == 0x900 || loc == 0x1000) {
> struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
> char fmt[] = "skb %p loc %p\n";
> bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
> }
> }
> 1M skb alloc/free 185660 (usecs)
>
> the difference is bigger now: 484-145 vs 185-145
>
There have big differences for compare arg2(in ktap) with direct register
access(ctx->regs.dx).

The current argument fetching(arg2 in above testcase) implementation in ktap
is very inefficient, see ktap/interpreter/lib_kdebug.c:kp_event_getarg.
The only way to speedup is kernel tracing code change, let external tracing
module access event field not through list lookup. This work is not
started yet. :)

Of course, I'm not saying this argument fetching issue is the performance
root cause compared with bpf and Systemtap, the bytecode executing speed
wouldn't compare with raw machine code.
(There have a plan to use JIT in ktap core, like luajit project, but
it need some
time to work on)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Alexei Starovoitov

On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen  wrote:
>> the difference is bigger now: 484-145 vs 185-145
>
> This is a obvious improvement, but imho not big enough to be extremely
> compelling (< cost 1-2 cache misses, no orders of magnitude improvements
> that would justify a lot of code)

hmm. we're comparing against ktap here…
which has 5x more kernel code and 8x slower in this test...

> Your code requires a compiler, so from my perspective it
> wouldn't be a lot easier or faster to use than just changing
> the code directly and recompile.
>
> The users want something simple too that shields them from
> having to learn all the internals. They don't want to recompile.
> As far as I can tell your code is a bit too low level for that,
> and the requirement for the compiler may also scare them.
>
> Where exactly does it fit?

the goal is to have llvm compiler next to perf, wrapped in a user friendly way.

compiling small filter vs recompiling full kernel…
inserting into live kernel vs rebooting …
not sure how you're saying it's equivalent.

In my kernel debugging experience current tools (tracing, systemtap)
were rarely enough.
I always had to add my own printks through the code, recompile and reboot.
Often just to see that it's not the place where I want to print things
or it's too verbose.
Then I would adjust printks, recompile and reboot again.
That was slow and tedious, since I would be crashing things from time to time
just because skb doesn't always have a valid dev or I made a typo.
For debugging I do really need something quick and dirty that lets me
add my own printk
of whatever structs I want anywhere in the kernel without crashing it.
That's exactly what bpf tracing filters do.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Alexei Starovoitov

On Thu, Dec 5, 2013 at 3:37 PM, Steven Rostedt  wrote:
> On Thu, 5 Dec 2013 14:36:58 -0800
> Alexei Starovoitov  wrote:
>
>> On Thu, Dec 5, 2013 at 5:46 AM, Steven Rostedt  wrote:
>> >
>> > I know that it would be great to have the bpf filter run before
>> > recording of the tracepoint, but as that becomes quite awkward for a
>> > user interface, because it requires intimate knowledge of the kernel
>> > source, this speed up on the filter itself may be worth while to have
>> > it happen after the recording of the buffer. When it happens after the
>> > record, then the bpf has direct access to the event entry and its
>> > fields as described by the trace event format files.
>>
>> I don't understand that 'awkward' part yet. What do you mean by 'knowledge of
>> the kernel'? By accessing pt_regs structure? Something else ?
>> Can we try fixing the interface first before compromising on performance?
>
> Let me ask you this. If you do not have the source of the kernel on
> hand, can you use BPF to filter the sched_switch tracepoint on prev pid?
>
> The current filter interface allows you to filter with just what the
> running kernel provides. No need for debug info from the vmlinux or
> anything else.

Understood and agreed. For the users that are satisfied with amount of info
that single trace_event provides (like sched_switch) there is probably
little reason to do complex filtering. Either they're fine with all
the events or will
just filter based on pid only.

> I'm fine if it becomes a requirement to have a vmlinux built with
> DEBUG_INFO to use BPF and have a tool like perf to translate the
> filters. But it that must not replace what the current filters do now.
> That is, it can be an add on, but not a replacement.

Of course. tracing filters via bpf is an additional tool for kernel debugging.
bpf by itself has use cases beyond tracing.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread H. Peter Anvin

On 12/05/2013 05:20 PM, Andi Kleen wrote:
> "H. Peter Anvin"  writes:
>>
>> Not to mention that in that case we might as well -- since we need a
>> compiler anyway -- generate the machine code in user space; the JIT
>> solution really only is useful if it can provide something that we can't
>> do otherwise, e.g. enable it in secure boot environments.
> 
> I can see there may be some setups which don't have a compiler
> (e.g. I know some people don't use systemtap because of that)
> But this needs a custom gcc install too as far as I understand.
> 

Yes... but no compiler and secure boot tend to go together, or at least
will in the future.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Andi Kleen

"H. Peter Anvin"  writes:
>
> Not to mention that in that case we might as well -- since we need a
> compiler anyway -- generate the machine code in user space; the JIT
> solution really only is useful if it can provide something that we can't
> do otherwise, e.g. enable it in secure boot environments.

I can see there may be some setups which don't have a compiler
(e.g. I know some people don't use systemtap because of that)
But this needs a custom gcc install too as far as I understand.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread H. Peter Anvin

On 12/05/2013 04:14 PM, Andi Kleen wrote:
> 
> In my experience there are roughly two groups of trace users:
> kernel hackers and users. The kernel hackers want something
> convenient and fast, but for anything complicated or performance
> critical they can always hack the kernel to include custom 
> instrumentation.
> 

Not to mention that in that case we might as well -- since we need a
compiler anyway -- generate the machine code in user space; the JIT
solution really only is useful if it can provide something that we can't
do otherwise, e.g. enable it in secure boot environments.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Andi Kleen

> 1M skb alloc/free 185660 (usecs)
> 
> the difference is bigger now: 484-145 vs 185-145

Thanks for the data. 

This is a obvious improvement, but imho not big enough to be extremely
compelling (< cost 1-2 cache misses, no orders of magnitude improvements
that would justify a lot of code)

One larger problem I have with your patchkit is where exactly it fits
with the user base.

In my experience there are roughly two groups of trace users:
kernel hackers and users. The kernel hackers want something
convenient and fast, but for anything complicated or performance
critical they can always hack the kernel to include custom 
instrumentation.

Your code requires a compiler, so from my perspective it 
wouldn't be a lot easier or faster to use than just changing 
the code directly and recompile.

The users want something simple too that shields them from
having to learn all the internals. They don't want to recompile.
As far as I can tell your code is a bit too low level for that,
and the requirement for the compiler may also scare them.

Where exactly does it fit?

-Andi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Steven Rostedt

On Thu, 5 Dec 2013 14:36:58 -0800
Alexei Starovoitov  wrote:

> On Thu, Dec 5, 2013 at 5:46 AM, Steven Rostedt  wrote:
> >
> > I know that it would be great to have the bpf filter run before
> > recording of the tracepoint, but as that becomes quite awkward for a
> > user interface, because it requires intimate knowledge of the kernel
> > source, this speed up on the filter itself may be worth while to have
> > it happen after the recording of the buffer. When it happens after the
> > record, then the bpf has direct access to the event entry and its
> > fields as described by the trace event format files.
> 
> I don't understand that 'awkward' part yet. What do you mean by 'knowledge of
> the kernel'? By accessing pt_regs structure? Something else ?
> Can we try fixing the interface first before compromising on performance?

Let me ask you this. If you do not have the source of the kernel on
hand, can you use BPF to filter the sched_switch tracepoint on prev pid?

The current filter interface allows you to filter with just what the
running kernel provides. No need for debug info from the vmlinux or
anything else.

pt_regs is not that useful without having something to translate what
that means.

I'm fine if it becomes a requirement to have a vmlinux built with
DEBUG_INFO to use BPF and have a tool like perf to translate the
filters. But it that must not replace what the current filters do now.
That is, it can be an add on, but not a replacement.

 -- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Alexei Starovoitov

On Thu, Dec 5, 2013 at 5:46 AM, Steven Rostedt  wrote:
>
> I know that it would be great to have the bpf filter run before
> recording of the tracepoint, but as that becomes quite awkward for a
> user interface, because it requires intimate knowledge of the kernel
> source, this speed up on the filter itself may be worth while to have
> it happen after the recording of the buffer. When it happens after the
> record, then the bpf has direct access to the event entry and its
> fields as described by the trace event format files.

I don't understand that 'awkward' part yet. What do you mean by 'knowledge of
the kernel'? By accessing pt_regs structure? Something else ?
Can we try fixing the interface first before compromising on performance?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Alexei Starovoitov

On Thu, Dec 5, 2013 at 8:11 AM, Frank Ch. Eigler  wrote:
>
> ast wrote:
>
>>>[...]
>> Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
>> trace skb:kfree_skb {
>> if (arg2 == 0x100) {
>> printf("%x %x\n", arg1, arg2)
>> }
>> }
>> [...]
>
> For reference, you might try putting systemtap into the performance
> comparison matrix too:
>
> # stap -e 'probe kernel.trace("kfree_skb") {
>   if ($location == 0x100 /* || $location == 0x200 etc. */ ) {
>  printf("%x %x\n", $skb, $location)
>   }
>}'

stap with one 'if': 1M skb alloc/free 200696 (usecs)
stap with 10 'if': 1M skb alloc/free 202135 (usecs)

so systemtap entry overhead is a bit higher than bpf and extra if-s
show the same progression as expected.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Frank Ch. Eigler

Andi Kleen  writes:

> [...]  While it sounds interesting, I would strongly advise to make
> this capability only available to root. Traditionally lots of
> complex byte code languages which were designed to be "safe" and
> verifiable weren't really. e.g. i managed to crash things with
> "safe" systemtap multiple times. [...]

Note that systemtap has never been a byte code language, that avenue
being considered lkml-futile at the time, but instead pure C.  Its
safety comes from a mix of compiled-in checks (which you can inspect
via "stap -p3") and script-to-C translation checks (which are
self-explanatory).  Its risks come from bugs in the checks (quite
rare), problems in the runtime library (rare), and problems in
underlying kernel facilities (rare or frequent - consider kprobes).

> So the likelyhood of this having some hole somewhere (either in
> the byte code or in some library function) is high.

Very true!

- FChE
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Frank Ch. Eigler


ast wrote:

>>[...]
> Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
> trace skb:kfree_skb {
> if (arg2 == 0x100) {
> printf("%x %x\n", arg1, arg2)
> }
> }
> [...]

For reference, you might try putting systemtap into the performance
comparison matrix too:

# stap -e 'probe kernel.trace("kfree_skb") { 
  if ($location == 0x100 /* || $location == 0x200 etc. */ ) {
 printf("%x %x\n", $skb, $location)
  }
   }'


- FChE
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Steven Rostedt

On Thu, 5 Dec 2013 11:41:13 +0100
Ingo Molnar  wrote:

> > so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145
> > 
> > obviously ktap is an interpreter, so it's not really fair.
> > 
> > To make it really unfair I did:
> > trace skb:kfree_skb {
> > if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 
> > 0x400 ||
> > arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 
> > 0x800 ||
> > arg2 == 0x900 || arg2 == 0x1000) {
> > printf("%x %x\n", arg1, arg2)
> > }
> > }
> > 1M skb alloc/free 484280 (usecs)
> 
> Real life scripts, for examples the ones related to network protocol 
> analysis will often have such patterns in them, so I don't think this 
> measurement is particularly unfair.

I agree. As the size of the if statement grows, the filter logic gets
lineally expensive, but the bpf filter does not.

I know that it would be great to have the bpf filter run before
recording of the tracepoint, but as that becomes quite awkward for a
user interface, because it requires intimate knowledge of the kernel
source, this speed up on the filter itself may be worth while to have
it happen after the recording of the buffer. When it happens after the
record, then the bpf has direct access to the event entry and its
fields as described by the trace event format files.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Ingo Molnar


* Alexei Starovoitov  wrote:

> > On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen  wrote:
> >>
> >> Can you do some performance comparison compared to e.g. ktap?
> >> How much faster is it?
> 
> Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
> trace skb:kfree_skb {
> if (arg2 == 0x100) {
> printf("%x %x\n", arg1, arg2)
> }
> }
> 1M skb alloc/free 350315 (usecs)
> 
> baseline without any tracing:
> 1M skb alloc/free 145400 (usecs)
> 
> then equivalent bpf test:
> void filter(struct bpf_context *ctx)
> {
> void *loc = (void *)ctx->regs.dx;
> if (loc == 0x100) {
> struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
> char fmt[] = "skb %p loc %p\n";
> bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
> }
> }
> 1M skb alloc/free 183214 (usecs)
> 
> so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145
> 
> obviously ktap is an interpreter, so it's not really fair.
> 
> To make it really unfair I did:
> trace skb:kfree_skb {
> if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 
> ||
> arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 
> ||
> arg2 == 0x900 || arg2 == 0x1000) {
> printf("%x %x\n", arg1, arg2)
> }
> }
> 1M skb alloc/free 484280 (usecs)

Real life scripts, for examples the ones related to network protocol 
analysis will often have such patterns in them, so I don't think this 
measurement is particularly unfair.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Ingo Molnar


* Alexei Starovoitov  wrote:

> > I mean more than that, I mean the licensing of BFP filters a user 
> > can find on his own system's kernel should be very clear: by the 
> > act of loading a BFP script into the kernel the user doing the 
> > 'upload' gives permission for it to be redistributed on 
> > kernel-compatible license terms.
> >
> > The easiest way to achieve that is to make sure that all loaded 
> > BFP scripts are 'registered' and are dumpable, viewable and 
> > reusable. That's good for debugging and it's good for 
> > transparency.
> >
> > This means a minimal BFP decoder will have to be in the kernel as 
> > well, but that's OK, we actually have several x86 instruction 
> > decoder in the kernel already, so there's no complexity threshold.
> 
> sure. there is pr_info_bpf_insn() in bpf_run.c that dumps bpf insn in
> human readable format.
> I'll hook it up to trace_seq, so that "cat
> /sys/kernel/debug/.../filter" will dump it.
> 
> Also I'm thinking to add 'license_string' section to bpf binary format
> and call license_is_gpl_compatible() on it during load.
> If false, then just reject it…. not even messing with taint flags...
> That would be way stronger indication of bpf licensing terms than what
> we have for .ko

But will BFP tools generate such gpl-compatible license tags by 
default? If yes then this might work, combined with the facility 
below. If not then it's just a nuisance to users.

Also, 'tainting' is a non-issue here, as we don't want the kernel to 
load license-incompatible scripts at all. This should be made clear in 
the design of the facility and the tooling itself.

> >> wow. I guess if the whole thing takes off, we would need an 
> >> in-kernel directory to store upstreamed bpf filters as well :)
> >
> > I see no reason why not, but more importantly all currently loaded 
> > BFP scripts should be dumpable, displayable and reusable in a 
> > kernel license compatible fashion.
> 
> ok. will add global bpf list as well (was hesitating to do something 
> like this because of central lock)

A lock + list is no big issue here I think, we do such central lookup 
locks all the time. If it ever becomes measurable it can be made 
scalable via numerous techniques.

> and something in debugfs that dumps bodies of all currently loaded 
> filters.
> 
> Will that solve the concern?

My concern would be solved by adding a facility to always be able to 
dump source code as well, i.e. trivially transform it to C or so, so 
that people can review it - or just edit it on the fly, recompile and 
reinsert? Most BFP scripts ought to be pretty simple.

(For example the most common way to load OpenGL shaders is to load the 
GLSL source code and that source code can be queried after insertion 
as well, so this is not an unusual model for small plugin-alike 
scriptlets.)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Ingo Molnar


* Alexei Starovoitov a...@plumgrid.com wrote:

  I mean more than that, I mean the licensing of BFP filters a user 
  can find on his own system's kernel should be very clear: by the 
  act of loading a BFP script into the kernel the user doing the 
  'upload' gives permission for it to be redistributed on 
  kernel-compatible license terms.
 
  The easiest way to achieve that is to make sure that all loaded 
  BFP scripts are 'registered' and are dumpable, viewable and 
  reusable. That's good for debugging and it's good for 
  transparency.
 
  This means a minimal BFP decoder will have to be in the kernel as 
  well, but that's OK, we actually have several x86 instruction 
  decoder in the kernel already, so there's no complexity threshold.
 
 sure. there is pr_info_bpf_insn() in bpf_run.c that dumps bpf insn in
 human readable format.
 I'll hook it up to trace_seq, so that cat
 /sys/kernel/debug/.../filter will dump it.
 
 Also I'm thinking to add 'license_string' section to bpf binary format
 and call license_is_gpl_compatible() on it during load.
 If false, then just reject it…. not even messing with taint flags...
 That would be way stronger indication of bpf licensing terms than what
 we have for .ko

But will BFP tools generate such gpl-compatible license tags by 
default? If yes then this might work, combined with the facility 
below. If not then it's just a nuisance to users.

Also, 'tainting' is a non-issue here, as we don't want the kernel to 
load license-incompatible scripts at all. This should be made clear in 
the design of the facility and the tooling itself.

  wow. I guess if the whole thing takes off, we would need an 
  in-kernel directory to store upstreamed bpf filters as well :)
 
  I see no reason why not, but more importantly all currently loaded 
  BFP scripts should be dumpable, displayable and reusable in a 
  kernel license compatible fashion.
 
 ok. will add global bpf list as well (was hesitating to do something 
 like this because of central lock)

A lock + list is no big issue here I think, we do such central lookup 
locks all the time. If it ever becomes measurable it can be made 
scalable via numerous techniques.

 and something in debugfs that dumps bodies of all currently loaded 
 filters.
 
 Will that solve the concern?

My concern would be solved by adding a facility to always be able to 
dump source code as well, i.e. trivially transform it to C or so, so 
that people can review it - or just edit it on the fly, recompile and 
reinsert? Most BFP scripts ought to be pretty simple.

(For example the most common way to load OpenGL shaders is to load the 
GLSL source code and that source code can be queried after insertion 
as well, so this is not an unusual model for small plugin-alike 
scriptlets.)

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Ingo Molnar


* Alexei Starovoitov a...@plumgrid.com wrote:

  On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen a...@firstfloor.org wrote:
 
  Can you do some performance comparison compared to e.g. ktap?
  How much faster is it?
 
 Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
 trace skb:kfree_skb {
 if (arg2 == 0x100) {
 printf(%x %x\n, arg1, arg2)
 }
 }
 1M skb alloc/free 350315 (usecs)
 
 baseline without any tracing:
 1M skb alloc/free 145400 (usecs)
 
 then equivalent bpf test:
 void filter(struct bpf_context *ctx)
 {
 void *loc = (void *)ctx-regs.dx;
 if (loc == 0x100) {
 struct sk_buff *skb = (struct sk_buff *)ctx-regs.si;
 char fmt[] = skb %p loc %p\n;
 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
 }
 }
 1M skb alloc/free 183214 (usecs)
 
 so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145
 
 obviously ktap is an interpreter, so it's not really fair.
 
 To make it really unfair I did:
 trace skb:kfree_skb {
 if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 
 ||
 arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 
 ||
 arg2 == 0x900 || arg2 == 0x1000) {
 printf(%x %x\n, arg1, arg2)
 }
 }
 1M skb alloc/free 484280 (usecs)

Real life scripts, for examples the ones related to network protocol 
analysis will often have such patterns in them, so I don't think this 
measurement is particularly unfair.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Steven Rostedt

On Thu, 5 Dec 2013 11:41:13 +0100
Ingo Molnar mi...@kernel.org wrote:

 
  so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145
  
  obviously ktap is an interpreter, so it's not really fair.
  
  To make it really unfair I did:
  trace skb:kfree_skb {
  if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 
  0x400 ||
  arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 
  0x800 ||
  arg2 == 0x900 || arg2 == 0x1000) {
  printf(%x %x\n, arg1, arg2)
  }
  }
  1M skb alloc/free 484280 (usecs)
 
 Real life scripts, for examples the ones related to network protocol 
 analysis will often have such patterns in them, so I don't think this 
 measurement is particularly unfair.

I agree. As the size of the if statement grows, the filter logic gets
lineally expensive, but the bpf filter does not.

I know that it would be great to have the bpf filter run before
recording of the tracepoint, but as that becomes quite awkward for a
user interface, because it requires intimate knowledge of the kernel
source, this speed up on the filter itself may be worth while to have
it happen after the recording of the buffer. When it happens after the
record, then the bpf has direct access to the event entry and its
fields as described by the trace event format files.

-- Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Frank Ch. Eigler


ast wrote:

[...]
 Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
 trace skb:kfree_skb {
 if (arg2 == 0x100) {
 printf(%x %x\n, arg1, arg2)
 }
 }
 [...]

For reference, you might try putting systemtap into the performance
comparison matrix too:

# stap -e 'probe kernel.trace(kfree_skb) { 
  if ($location == 0x100 /* || $location == 0x200 etc. */ ) {
 printf(%x %x\n, $skb, $location)
  }
   }'


- FChE
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Frank Ch. Eigler

Andi Kleen a...@firstfloor.org writes:

 [...]  While it sounds interesting, I would strongly advise to make
 this capability only available to root. Traditionally lots of
 complex byte code languages which were designed to be safe and
 verifiable weren't really. e.g. i managed to crash things with
 safe systemtap multiple times. [...]

Note that systemtap has never been a byte code language, that avenue
being considered lkml-futile at the time, but instead pure C.  Its
safety comes from a mix of compiled-in checks (which you can inspect
via stap -p3) and script-to-C translation checks (which are
self-explanatory).  Its risks come from bugs in the checks (quite
rare), problems in the runtime library (rare), and problems in
underlying kernel facilities (rare or frequent - consider kprobes).


 So the likelyhood of this having some hole somewhere (either in
 the byte code or in some library function) is high.

Very true!


- FChE
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Alexei Starovoitov

On Thu, Dec 5, 2013 at 8:11 AM, Frank Ch. Eigler f...@redhat.com wrote:

 ast wrote:

[...]
 Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
 trace skb:kfree_skb {
 if (arg2 == 0x100) {
 printf(%x %x\n, arg1, arg2)
 }
 }
 [...]

 For reference, you might try putting systemtap into the performance
 comparison matrix too:

 # stap -e 'probe kernel.trace(kfree_skb) {
   if ($location == 0x100 /* || $location == 0x200 etc. */ ) {
  printf(%x %x\n, $skb, $location)
   }
}'

stap with one 'if': 1M skb alloc/free 200696 (usecs)
stap with 10 'if': 1M skb alloc/free 202135 (usecs)

so systemtap entry overhead is a bit higher than bpf and extra if-s
show the same progression as expected.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Alexei Starovoitov

On Thu, Dec 5, 2013 at 5:46 AM, Steven Rostedt rost...@goodmis.org wrote:

 I know that it would be great to have the bpf filter run before
 recording of the tracepoint, but as that becomes quite awkward for a
 user interface, because it requires intimate knowledge of the kernel
 source, this speed up on the filter itself may be worth while to have
 it happen after the recording of the buffer. When it happens after the
 record, then the bpf has direct access to the event entry and its
 fields as described by the trace event format files.

I don't understand that 'awkward' part yet. What do you mean by 'knowledge of
the kernel'? By accessing pt_regs structure? Something else ?
Can we try fixing the interface first before compromising on performance?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Steven Rostedt

On Thu, 5 Dec 2013 14:36:58 -0800
Alexei Starovoitov a...@plumgrid.com wrote:

 On Thu, Dec 5, 2013 at 5:46 AM, Steven Rostedt rost...@goodmis.org wrote:
 
  I know that it would be great to have the bpf filter run before
  recording of the tracepoint, but as that becomes quite awkward for a
  user interface, because it requires intimate knowledge of the kernel
  source, this speed up on the filter itself may be worth while to have
  it happen after the recording of the buffer. When it happens after the
  record, then the bpf has direct access to the event entry and its
  fields as described by the trace event format files.
 
 I don't understand that 'awkward' part yet. What do you mean by 'knowledge of
 the kernel'? By accessing pt_regs structure? Something else ?
 Can we try fixing the interface first before compromising on performance?

Let me ask you this. If you do not have the source of the kernel on
hand, can you use BPF to filter the sched_switch tracepoint on prev pid?

The current filter interface allows you to filter with just what the
running kernel provides. No need for debug info from the vmlinux or
anything else.

pt_regs is not that useful without having something to translate what
that means.

I'm fine if it becomes a requirement to have a vmlinux built with
DEBUG_INFO to use BPF and have a tool like perf to translate the
filters. But it that must not replace what the current filters do now.
That is, it can be an add on, but not a replacement.

 -- Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Andi Kleen

 1M skb alloc/free 185660 (usecs)
 
 the difference is bigger now: 484-145 vs 185-145

Thanks for the data. 

This is a obvious improvement, but imho not big enough to be extremely
compelling ( cost 1-2 cache misses, no orders of magnitude improvements
that would justify a lot of code)

One larger problem I have with your patchkit is where exactly it fits
with the user base.

In my experience there are roughly two groups of trace users:
kernel hackers and users. The kernel hackers want something
convenient and fast, but for anything complicated or performance
critical they can always hack the kernel to include custom 
instrumentation.

Your code requires a compiler, so from my perspective it 
wouldn't be a lot easier or faster to use than just changing 
the code directly and recompile.

The users want something simple too that shields them from
having to learn all the internals. They don't want to recompile.
As far as I can tell your code is a bit too low level for that,
and the requirement for the compiler may also scare them.

Where exactly does it fit?

-Andi

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread H. Peter Anvin

On 12/05/2013 04:14 PM, Andi Kleen wrote:
 
 In my experience there are roughly two groups of trace users:
 kernel hackers and users. The kernel hackers want something
 convenient and fast, but for anything complicated or performance
 critical they can always hack the kernel to include custom 
 instrumentation.
 

Not to mention that in that case we might as well -- since we need a
compiler anyway -- generate the machine code in user space; the JIT
solution really only is useful if it can provide something that we can't
do otherwise, e.g. enable it in secure boot environments.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Andi Kleen

H. Peter Anvin h...@zytor.com writes:

 Not to mention that in that case we might as well -- since we need a
 compiler anyway -- generate the machine code in user space; the JIT
 solution really only is useful if it can provide something that we can't
 do otherwise, e.g. enable it in secure boot environments.

I can see there may be some setups which don't have a compiler
(e.g. I know some people don't use systemtap because of that)
But this needs a custom gcc install too as far as I understand.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread H. Peter Anvin

On 12/05/2013 05:20 PM, Andi Kleen wrote:
 H. Peter Anvin h...@zytor.com writes:

 Not to mention that in that case we might as well -- since we need a
 compiler anyway -- generate the machine code in user space; the JIT
 solution really only is useful if it can provide something that we can't
 do otherwise, e.g. enable it in secure boot environments.
 
 I can see there may be some setups which don't have a compiler
 (e.g. I know some people don't use systemtap because of that)
 But this needs a custom gcc install too as far as I understand.
 

Yes... but no compiler and secure boot tend to go together, or at least
will in the future.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Alexei Starovoitov

On Thu, Dec 5, 2013 at 3:37 PM, Steven Rostedt rost...@goodmis.org wrote:
 On Thu, 5 Dec 2013 14:36:58 -0800
 Alexei Starovoitov a...@plumgrid.com wrote:

 On Thu, Dec 5, 2013 at 5:46 AM, Steven Rostedt rost...@goodmis.org wrote:
 
  I know that it would be great to have the bpf filter run before
  recording of the tracepoint, but as that becomes quite awkward for a
  user interface, because it requires intimate knowledge of the kernel
  source, this speed up on the filter itself may be worth while to have
  it happen after the recording of the buffer. When it happens after the
  record, then the bpf has direct access to the event entry and its
  fields as described by the trace event format files.

 I don't understand that 'awkward' part yet. What do you mean by 'knowledge of
 the kernel'? By accessing pt_regs structure? Something else ?
 Can we try fixing the interface first before compromising on performance?

 Let me ask you this. If you do not have the source of the kernel on
 hand, can you use BPF to filter the sched_switch tracepoint on prev pid?

 The current filter interface allows you to filter with just what the
 running kernel provides. No need for debug info from the vmlinux or
 anything else.

Understood and agreed. For the users that are satisfied with amount of info
that single trace_event provides (like sched_switch) there is probably
little reason to do complex filtering. Either they're fine with all
the events or will
just filter based on pid only.

 I'm fine if it becomes a requirement to have a vmlinux built with
 DEBUG_INFO to use BPF and have a tool like perf to translate the
 filters. But it that must not replace what the current filters do now.
 That is, it can be an add on, but not a replacement.

Of course. tracing filters via bpf is an additional tool for kernel debugging.
bpf by itself has use cases beyond tracing.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Alexei Starovoitov

On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen a...@firstfloor.org wrote:
 the difference is bigger now: 484-145 vs 185-145

 This is a obvious improvement, but imho not big enough to be extremely
 compelling ( cost 1-2 cache misses, no orders of magnitude improvements
 that would justify a lot of code)

hmm. we're comparing against ktap here…
which has 5x more kernel code and 8x slower in this test...

 Your code requires a compiler, so from my perspective it
 wouldn't be a lot easier or faster to use than just changing
 the code directly and recompile.

 The users want something simple too that shields them from
 having to learn all the internals. They don't want to recompile.
 As far as I can tell your code is a bit too low level for that,
 and the requirement for the compiler may also scare them.

 Where exactly does it fit?

the goal is to have llvm compiler next to perf, wrapped in a user friendly way.

compiling small filter vs recompiling full kernel…
inserting into live kernel vs rebooting …
not sure how you're saying it's equivalent.

In my kernel debugging experience current tools (tracing, systemtap)
were rarely enough.
I always had to add my own printks through the code, recompile and reboot.
Often just to see that it's not the place where I want to print things
or it's too verbose.
Then I would adjust printks, recompile and reboot again.
That was slow and tedious, since I would be crashing things from time to time
just because skb doesn't always have a valid dev or I made a typo.
For debugging I do really need something quick and dirty that lets me
add my own printk
of whatever structs I want anywhere in the kernel without crashing it.
That's exactly what bpf tracing filters do.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Jovi Zhangwei

Hi Alexei,

On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov a...@plumgrid.com wrote:
 On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen a...@firstfloor.org wrote:

 Can you do some performance comparison compared to e.g. ktap?
 How much faster is it?

 Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
 trace skb:kfree_skb {
 if (arg2 == 0x100) {
 printf(%x %x\n, arg1, arg2)
 }
 }
 1M skb alloc/free 350315 (usecs)

 baseline without any tracing:
 1M skb alloc/free 145400 (usecs)

 then equivalent bpf test:
 void filter(struct bpf_context *ctx)
 {
 void *loc = (void *)ctx-regs.dx;
 if (loc == 0x100) {
 struct sk_buff *skb = (struct sk_buff *)ctx-regs.si;
 char fmt[] = skb %p loc %p\n;
 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
 }
 }
 1M skb alloc/free 183214 (usecs)

 so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145

 obviously ktap is an interpreter, so it's not really fair.

 To make it really unfair I did:
 trace skb:kfree_skb {
 if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 
 ||
 arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 
 ||
 arg2 == 0x900 || arg2 == 0x1000) {
 printf(%x %x\n, arg1, arg2)
 }
 }
 1M skb alloc/free 484280 (usecs)

 and corresponding bpf:
 void filter(struct bpf_context *ctx)
 {
 void *loc = (void *)ctx-regs.dx;
 if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 ||
 loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 ||
 loc == 0x900 || loc == 0x1000) {
 struct sk_buff *skb = (struct sk_buff *)ctx-regs.si;
 char fmt[] = skb %p loc %p\n;
 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
 }
 }
 1M skb alloc/free 185660 (usecs)

 the difference is bigger now: 484-145 vs 185-145

There have big differences for compare arg2(in ktap) with direct register
access(ctx-regs.dx).

The current argument fetching(arg2 in above testcase) implementation in ktap
is very inefficient, see ktap/interpreter/lib_kdebug.c:kp_event_getarg.
The only way to speedup is kernel tracing code change, let external tracing
module access event field not through list lookup. This work is not
started yet. :)

Of course, I'm not saying this argument fetching issue is the performance
root cause compared with bpf and Systemtap, the bytecode executing speed
wouldn't compare with raw machine code.
(There have a plan to use JIT in ktap core, like luajit project, but
it need some
time to work on)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Alexei Starovoitov

On Thu, Dec 5, 2013 at 2:38 AM, Ingo Molnar mi...@kernel.org wrote:

 Also I'm thinking to add 'license_string' section to bpf binary format
 and call license_is_gpl_compatible() on it during load.
 If false, then just reject it…. not even messing with taint flags...
 That would be way stronger indication of bpf licensing terms than what
 we have for .ko

 But will BFP tools generate such gpl-compatible license tags by
 default? If yes then this might work, combined with the facility
 below. If not then it's just a nuisance to users.

yes. similar to existing .ko module_license() tag. see below.

 My concern would be solved by adding a facility to always be able to
 dump source code as well, i.e. trivially transform it to C or so, so
 that people can review it - or just edit it on the fly, recompile and
 reinsert? Most BFP scripts ought to be pretty simple.

C code has '#include' in them, so without storing fully preprocessed code
it will not be equivalent. but then true source will be gigantic.
Can be zipped, but that sounds like an overkill.
Also we might want other languages with their own dependent includes.
Sure, we can have a section in bpf binary that has the source, but it's not
enforceable. Kernel cannot know that it's an actual source.
gcc/llvm will produce different bpf code out of the same source.
the source is in C or in language X, etc.
Doesn't seem that including some form of source will help
with enforcing the license.

imo requiring module_license(gpl); line in C code and equivalent
string in all other languages that want to translate to bpf would be
stronger indication of licensing terms.
then compiler would have to include that string into 'license_string'
section and kernel can actually enforce it.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Jovi Zhangwei

On Fri, Dec 6, 2013 at 9:20 AM, Andi Kleen a...@firstfloor.org wrote:
 H. Peter Anvin h...@zytor.com writes:

 Not to mention that in that case we might as well -- since we need a
 compiler anyway -- generate the machine code in user space; the JIT
 solution really only is useful if it can provide something that we can't
 do otherwise, e.g. enable it in secure boot environments.

 I can see there may be some setups which don't have a compiler
 (e.g. I know some people don't use systemtap because of that)
 But this needs a custom gcc install too as far as I understand.

If it's depend on gcc, then it's look like Systemtap. There have big
inconvenient for embedded environment and many production system
to install gcc.
(not sure if it need kernel compilation environment as well)

It seems the event filter is binding to specific event, it's not possible
to trace many events in a cooperation style, look Systemtap and ktap
samples, many event handler need to cooperate, the simplest
example is record syscall execution time(duration of exit - entry).

If this design is intentional, then I would think it's target for speed up
current kernel tracing filter.(but need extra usespace filter compiler)

And I guess bpf filter still need to take mind on usespace tracing :),
if it want to be a complete and integrated tracing solution.
(use a separated userspace compiler or translator to resolve symbol)

Thanks

Jovi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-05 Thread Jovi Zhangwei

On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov a...@plumgrid.com wrote:
 On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen a...@firstfloor.org wrote:

 Can you do some performance comparison compared to e.g. ktap?
 How much faster is it?

 Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
 trace skb:kfree_skb {
 if (arg2 == 0x100) {
 printf(%x %x\n, arg1, arg2)
 }
 }
 1M skb alloc/free 350315 (usecs)

 baseline without any tracing:
 1M skb alloc/free 145400 (usecs)

 then equivalent bpf test:
 void filter(struct bpf_context *ctx)
 {
 void *loc = (void *)ctx-regs.dx;
 if (loc == 0x100) {
 struct sk_buff *skb = (struct sk_buff *)ctx-regs.si;
 char fmt[] = skb %p loc %p\n;
 bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
 }
 }
 1M skb alloc/free 183214 (usecs)

 so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145

 obviously ktap is an interpreter, so it's not really fair.

 To make it really unfair I did:
 trace skb:kfree_skb {
 if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 
 ||
 arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 
 ||
 arg2 == 0x900 || arg2 == 0x1000) {
 printf(%x %x\n, arg1, arg2)
 }
 }
 1M skb alloc/free 484280 (usecs)

I've lost my mind for a while. :)

If bpf only focus on filter, then it's not good to compare with ktap
like that, since
ktap can easily make use on current kernel filter, you should use below script:

trace skb:kfree_skb /location == 0x100 || location == 0x200 || .../ {
printf(%x %x\n, arg1, arg2)
}

As ktap is a user of current simple kernel tracing filter, I fully
agree with Steven,
it can be an add on, but not a replacement.


Thanks,

Jovi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-04 Thread Alexei Starovoitov

> On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen  wrote:
>>
>> Can you do some performance comparison compared to e.g. ktap?
>> How much faster is it?

Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
trace skb:kfree_skb {
if (arg2 == 0x100) {
printf("%x %x\n", arg1, arg2)
}
}
1M skb alloc/free 350315 (usecs)

baseline without any tracing:
1M skb alloc/free 145400 (usecs)

then equivalent bpf test:
void filter(struct bpf_context *ctx)
{
void *loc = (void *)ctx->regs.dx;
if (loc == 0x100) {
struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
char fmt[] = "skb %p loc %p\n";
bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
}
}
1M skb alloc/free 183214 (usecs)

so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145

obviously ktap is an interpreter, so it's not really fair.

To make it really unfair I did:
trace skb:kfree_skb {
if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 ||
arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 ||
arg2 == 0x900 || arg2 == 0x1000) {
printf("%x %x\n", arg1, arg2)
}
}
1M skb alloc/free 484280 (usecs)

and corresponding bpf:
void filter(struct bpf_context *ctx)
{
void *loc = (void *)ctx->regs.dx;
if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 ||
loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 ||
loc == 0x900 || loc == 0x1000) {
struct sk_buff *skb = (struct sk_buff *)ctx->regs.si;
char fmt[] = "skb %p loc %p\n";
bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
}
}
1M skb alloc/free 185660 (usecs)

the difference is bigger now: 484-145 vs 185-145

9 extra 'if' conditions for bpf is almost nothing, since they
translate into 18 new x86 instructions after JITing, but for
interpreter it's obviously costly.

Why 0x100 instead of 0x1? To make sure that compiler doesn't optimize
them into < >
Otherwise it's really really unfair.

ktap is a nice tool. Great job Jovi!
I noticed that it doesn't always clear created kprobes after run and I
see a bunch of .../tracing/events/ktap_kprobes_xxx, but that's a minor
thing.

Thanks
Alexei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-04 Thread Alexei Starovoitov

On Wed, Dec 4, 2013 at 1:34 AM, Ingo Molnar  wrote:
>
> * Alexei Starovoitov  wrote:
>
>> On Tue, Dec 3, 2013 at 1:16 AM, Ingo Molnar  wrote:
>> >
>> > Very cool! (Added various other folks who might be interested in
>> > this to the Cc: list.)
>> >
>> > I have one generic concern:
>> >
>> > It would be important to make it easy to extract loaded BPF code
>> > from the kernel in source code equivalent form, which compiles to
>> > the same BPF code.
>> >
>> > I.e. I think it would be fundamentally important to make sure that
>> > this is all within the kernel's license domain, to make it very
>> > clear there can be no 'binary only' BPF scripts.
>> >
>> > By up-loading BPF into a kernel the person loading it agrees to
>> > make that code available to all users of that system who can
>> > access it, under the same license as the kernel's code (or under a
>> > more permissive license).
>> >
>> > The last thing we want is people getting funny ideas and writing
>> > drivers in BPF and hiding the code or making license claims over
>> > it
>>
>> all makes sense. In case of kernel modules all export_symbols are
>> accessible and module has to have kernel compatible license. Same
>> licensing terms apply to anything else that interacts with kernel
>> functions. In case of BPF the list of accessible functions is tiny,
>> so it's much easier to enforce specific limited use case. For
>> tracing filters it's just bpf_load_xx/trace_printk/dump_stack. Even
>> if someone has funny ideas they cannot be brought to life, since
>> drivers need a lot more than this set of functions and BPF checker
>> will reject any attempts to call something outside of this tiny
>> list. imo the same applies to existing BPF as well. Meaning that
>> tcpdump filter string and seccomp filters, if distributed, has to
>> have their source code available.
>
> I mean more than that, I mean the licensing of BFP filters a user can
> find on his own system's kernel should be very clear: by the act of
> loading a BFP script into the kernel the user doing the 'upload' gives
> permission for it to be redistributed on kernel-compatible license
> terms.
>
> The easiest way to achieve that is to make sure that all loaded BFP
> scripts are 'registered' and are dumpable, viewable and reusable.
> That's good for debugging and it's good for transparency.
>
> This means a minimal BFP decoder will have to be in the kernel as
> well, but that's OK, we actually have several x86 instruction decoder
> in the kernel already, so there's no complexity threshold.

sure. there is pr_info_bpf_insn() in bpf_run.c that dumps bpf insn in
human readable format.
I'll hook it up to trace_seq, so that "cat
/sys/kernel/debug/.../filter" will dump it.

Also I'm thinking to add 'license_string' section to bpf binary format
and call license_is_gpl_compatible() on it during load.
If false, then just reject it…. not even messing with taint flags...
That would be way stronger indication of bpf licensing terms than what
we have for .ko

>> wow. I guess if the whole thing takes off, we would need an
>> in-kernel directory to store upstreamed bpf filters as well :)
>
> I see no reason why not, but more importantly all currently loaded BFP
> scripts should be dumpable, displayable and reusable in a kernel
> license compatible fashion.

ok. will add global bpf list as well (was hesitating to do something
like this because of central lock)
and something in debugfs that dumps bodies of all currently loaded filters.

Will that solve the concern?

Thanks
Alexei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-04 Thread Ingo Molnar


* Alexei Starovoitov  wrote:

> On Tue, Dec 3, 2013 at 1:16 AM, Ingo Molnar  wrote:
> >
> > Very cool! (Added various other folks who might be interested in 
> > this to the Cc: list.)
> >
> > I have one generic concern:
> >
> > It would be important to make it easy to extract loaded BPF code 
> > from the kernel in source code equivalent form, which compiles to 
> > the same BPF code.
> >
> > I.e. I think it would be fundamentally important to make sure that 
> > this is all within the kernel's license domain, to make it very 
> > clear there can be no 'binary only' BPF scripts.
> >
> > By up-loading BPF into a kernel the person loading it agrees to 
> > make that code available to all users of that system who can 
> > access it, under the same license as the kernel's code (or under a 
> > more permissive license).
> >
> > The last thing we want is people getting funny ideas and writing 
> > drivers in BPF and hiding the code or making license claims over 
> > it
> 
> all makes sense. In case of kernel modules all export_symbols are 
> accessible and module has to have kernel compatible license. Same 
> licensing terms apply to anything else that interacts with kernel 
> functions. In case of BPF the list of accessible functions is tiny, 
> so it's much easier to enforce specific limited use case. For 
> tracing filters it's just bpf_load_xx/trace_printk/dump_stack. Even 
> if someone has funny ideas they cannot be brought to life, since 
> drivers need a lot more than this set of functions and BPF checker 
> will reject any attempts to call something outside of this tiny 
> list. imo the same applies to existing BPF as well. Meaning that 
> tcpdump filter string and seccomp filters, if distributed, has to 
> have their source code available.

I mean more than that, I mean the licensing of BFP filters a user can 
find on his own system's kernel should be very clear: by the act of 
loading a BFP script into the kernel the user doing the 'upload' gives 
permission for it to be redistributed on kernel-compatible license 
terms.

The easiest way to achieve that is to make sure that all loaded BFP 
scripts are 'registered' and are dumpable, viewable and reusable. 
That's good for debugging and it's good for transparency.

This means a minimal BFP decoder will have to be in the kernel as 
well, but that's OK, we actually have several x86 instruction decoder 
in the kernel already, so there's no complexity threshold.

> > I.e. we want to allow flexible plugins technologically, but make 
> > sure people who run into such a plugin can modify and improve it 
> > under the same license as they can modify and improve the kernel 
> > itself!
> 
> wow. I guess if the whole thing takes off, we would need an 
> in-kernel directory to store upstreamed bpf filters as well :)

I see no reason why not, but more importantly all currently loaded BFP 
scripts should be dumpable, displayable and reusable in a kernel 
license compatible fashion.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-04 Thread Ingo Molnar


* Alexei Starovoitov a...@plumgrid.com wrote:

 On Tue, Dec 3, 2013 at 1:16 AM, Ingo Molnar mi...@kernel.org wrote:
 
  Very cool! (Added various other folks who might be interested in 
  this to the Cc: list.)
 
  I have one generic concern:
 
  It would be important to make it easy to extract loaded BPF code 
  from the kernel in source code equivalent form, which compiles to 
  the same BPF code.
 
  I.e. I think it would be fundamentally important to make sure that 
  this is all within the kernel's license domain, to make it very 
  clear there can be no 'binary only' BPF scripts.
 
  By up-loading BPF into a kernel the person loading it agrees to 
  make that code available to all users of that system who can 
  access it, under the same license as the kernel's code (or under a 
  more permissive license).
 
  The last thing we want is people getting funny ideas and writing 
  drivers in BPF and hiding the code or making license claims over 
  it
 
 all makes sense. In case of kernel modules all export_symbols are 
 accessible and module has to have kernel compatible license. Same 
 licensing terms apply to anything else that interacts with kernel 
 functions. In case of BPF the list of accessible functions is tiny, 
 so it's much easier to enforce specific limited use case. For 
 tracing filters it's just bpf_load_xx/trace_printk/dump_stack. Even 
 if someone has funny ideas they cannot be brought to life, since 
 drivers need a lot more than this set of functions and BPF checker 
 will reject any attempts to call something outside of this tiny 
 list. imo the same applies to existing BPF as well. Meaning that 
 tcpdump filter string and seccomp filters, if distributed, has to 
 have their source code available.

I mean more than that, I mean the licensing of BFP filters a user can 
find on his own system's kernel should be very clear: by the act of 
loading a BFP script into the kernel the user doing the 'upload' gives 
permission for it to be redistributed on kernel-compatible license 
terms.

The easiest way to achieve that is to make sure that all loaded BFP 
scripts are 'registered' and are dumpable, viewable and reusable. 
That's good for debugging and it's good for transparency.

This means a minimal BFP decoder will have to be in the kernel as 
well, but that's OK, we actually have several x86 instruction decoder 
in the kernel already, so there's no complexity threshold.

  I.e. we want to allow flexible plugins technologically, but make 
  sure people who run into such a plugin can modify and improve it 
  under the same license as they can modify and improve the kernel 
  itself!
 
 wow. I guess if the whole thing takes off, we would need an 
 in-kernel directory to store upstreamed bpf filters as well :)

I see no reason why not, but more importantly all currently loaded BFP 
scripts should be dumpable, displayable and reusable in a kernel 
license compatible fashion.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-04 Thread Alexei Starovoitov

On Wed, Dec 4, 2013 at 1:34 AM, Ingo Molnar mi...@kernel.org wrote:

 * Alexei Starovoitov a...@plumgrid.com wrote:

 On Tue, Dec 3, 2013 at 1:16 AM, Ingo Molnar mi...@kernel.org wrote:
 
  Very cool! (Added various other folks who might be interested in
  this to the Cc: list.)
 
  I have one generic concern:
 
  It would be important to make it easy to extract loaded BPF code
  from the kernel in source code equivalent form, which compiles to
  the same BPF code.
 
  I.e. I think it would be fundamentally important to make sure that
  this is all within the kernel's license domain, to make it very
  clear there can be no 'binary only' BPF scripts.
 
  By up-loading BPF into a kernel the person loading it agrees to
  make that code available to all users of that system who can
  access it, under the same license as the kernel's code (or under a
  more permissive license).
 
  The last thing we want is people getting funny ideas and writing
  drivers in BPF and hiding the code or making license claims over
  it

 all makes sense. In case of kernel modules all export_symbols are
 accessible and module has to have kernel compatible license. Same
 licensing terms apply to anything else that interacts with kernel
 functions. In case of BPF the list of accessible functions is tiny,
 so it's much easier to enforce specific limited use case. For
 tracing filters it's just bpf_load_xx/trace_printk/dump_stack. Even
 if someone has funny ideas they cannot be brought to life, since
 drivers need a lot more than this set of functions and BPF checker
 will reject any attempts to call something outside of this tiny
 list. imo the same applies to existing BPF as well. Meaning that
 tcpdump filter string and seccomp filters, if distributed, has to
 have their source code available.

 I mean more than that, I mean the licensing of BFP filters a user can
 find on his own system's kernel should be very clear: by the act of
 loading a BFP script into the kernel the user doing the 'upload' gives
 permission for it to be redistributed on kernel-compatible license
 terms.

 The easiest way to achieve that is to make sure that all loaded BFP
 scripts are 'registered' and are dumpable, viewable and reusable.
 That's good for debugging and it's good for transparency.

 This means a minimal BFP decoder will have to be in the kernel as
 well, but that's OK, we actually have several x86 instruction decoder
 in the kernel already, so there's no complexity threshold.

sure. there is pr_info_bpf_insn() in bpf_run.c that dumps bpf insn in
human readable format.
I'll hook it up to trace_seq, so that cat
/sys/kernel/debug/.../filter will dump it.

Also I'm thinking to add 'license_string' section to bpf binary format
and call license_is_gpl_compatible() on it during load.
If false, then just reject it…. not even messing with taint flags...
That would be way stronger indication of bpf licensing terms than what
we have for .ko

 wow. I guess if the whole thing takes off, we would need an
 in-kernel directory to store upstreamed bpf filters as well :)

 I see no reason why not, but more importantly all currently loaded BFP
 scripts should be dumpable, displayable and reusable in a kernel
 license compatible fashion.

ok. will add global bpf list as well (was hesitating to do something
like this because of central lock)
and something in debugfs that dumps bodies of all currently loaded filters.

Will that solve the concern?

Thanks
Alexei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-04 Thread Alexei Starovoitov

 On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen a...@firstfloor.org wrote:

 Can you do some performance comparison compared to e.g. ktap?
 How much faster is it?

Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email:
trace skb:kfree_skb {
if (arg2 == 0x100) {
printf(%x %x\n, arg1, arg2)
}
}
1M skb alloc/free 350315 (usecs)

baseline without any tracing:
1M skb alloc/free 145400 (usecs)

then equivalent bpf test:
void filter(struct bpf_context *ctx)
{
void *loc = (void *)ctx-regs.dx;
if (loc == 0x100) {
struct sk_buff *skb = (struct sk_buff *)ctx-regs.si;
char fmt[] = skb %p loc %p\n;
bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
}
}
1M skb alloc/free 183214 (usecs)

so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145

obviously ktap is an interpreter, so it's not really fair.

To make it really unfair I did:
trace skb:kfree_skb {
if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 ||
arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 ||
arg2 == 0x900 || arg2 == 0x1000) {
printf(%x %x\n, arg1, arg2)
}
}
1M skb alloc/free 484280 (usecs)

and corresponding bpf:
void filter(struct bpf_context *ctx)
{
void *loc = (void *)ctx-regs.dx;
if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 ||
loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 ||
loc == 0x900 || loc == 0x1000) {
struct sk_buff *skb = (struct sk_buff *)ctx-regs.si;
char fmt[] = skb %p loc %p\n;
bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0);
}
}
1M skb alloc/free 185660 (usecs)

the difference is bigger now: 484-145 vs 185-145

9 extra 'if' conditions for bpf is almost nothing, since they
translate into 18 new x86 instructions after JITing, but for
interpreter it's obviously costly.

Why 0x100 instead of 0x1? To make sure that compiler doesn't optimize
them into  
Otherwise it's really really unfair.

ktap is a nice tool. Great job Jovi!
I noticed that it doesn't always clear created kprobes after run and I
see a bunch of .../tracing/events/ktap_kprobes_xxx, but that's a minor
thing.

Thanks
Alexei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-03 Thread Alexei Starovoitov

On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen  wrote:
> Alexei Starovoitov  writes:
>
> Can you do some performance comparison compared to e.g. ktap?
> How much faster is it?

imo the most interesting ktap scripts (like kmalloc-top.kp) need
tables and timers.
tables are almost ready for prime time, but timers I prefer to keep
out of kernel.
I would like bpf filter to fill tables with interesting data in kernel
up to predefined limit
and periodically read and clear the tables from userspace.
This way I will be able to do nettop.stp, iotop.stp like programs.
So I'm still thinking what should be clean kernel/user interface for
bpf-defined tables.
Format of keys and elements of the table is defined within bpf program.
During load of bpf program, the tables are allocated and bpf program
can now lookup/update into them. At the same time corresponding
userspace program can read tables of this particular bpf program over
netlink.
Creating its own debugfs files for every filter feels too slow and
feature limited, since files are all or nothing interface. Netlink
access to bpf tables feels cleaner. Userspace will use libmnl to
access them. Other ideas?

In the mean time I'll do some simple
  trace probe:xx { print }
performance test…

> While it sounds interesting, I would strongly advise to make this
> capability only available to root. Traditionally lots of complex byte
> code languages which were designed to be "safe" and verifiable weren't
> really. e.g. i managed to crash things with "safe" systemtap multiple
> times. And we all know what happened to Java.
>
> So the likelyhood of this having some hole somewhere (either in
> the byte code or in some library function) is high.

Tracing filters are for root only today and should stay this way.
As far as safety of bpf… hard to argue systemtap point ;)
Though existing bpf is generally accepted to be safe.
extended bpf needs time to prove itself.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-03 Thread Masami Hiramatsu

(2013/12/04 3:26), Alexei Starovoitov wrote:
> On Tue, Dec 3, 2013 at 7:33 AM, Steven Rostedt  wrote:
>> On Tue, 3 Dec 2013 10:16:55 +0100
>> Ingo Molnar  wrote:
>>
>>
>>> So, to do the math:
>>>
>>>tracing   'all' overhead:   95 nsecs per event
>>>tracing 'eth5 + old filter' overhead:  157 nsecs per event
>>>tracing 'eth5 + BPF filter' overhead:   54 nsecs per event
>>>
>>> So via BPF and a fairly trivial filter, we are able to reduce tracing
>>> overhead for real - while old-style filters.
>>
>> Yep, seems that BPF can do what I wasn't able to do with the normal
>> filters. Although, I haven't looked at the code yet, I'm assuming that
>> the BPF works on the parameters passed into the trace event. The normal
>> filters can only process the results of the trace (what's being
>> recorded) not the parameters of the trace event itself. To get what's
>> recorded, we need to write to the buffer first, and then we decided if
>> we want to keep the event or not and discard the event from the buffer
>> if we do not.
>>
>> That method does not reduce overhead at all, and only adds to it, as
>> Alexei's tests have shown. The purpose of the filter was not to reduce
>> overhead, but to reduce filling the buffer with needless data.
> 
> Precisely.
> Assumption is that filters will filter out majority of the events.
> So filter takes pt_regs as input, has to interpret them and call
> bpf_trace_printk
> if it really wants to store something for the human to see.
> We can extend bpf trace filters to return true/false to indicate
> whether TP_printk-format
> specified as part of the event should be printed as well, but imo
> that's unnecessary.
> When I was using bpf filters to debug networking bits I didn't need
> that printk format of the event. I only used event as an entry point,
> filtering out things and printing different fields vs initial event.
> More like what developers do when they sprinkle
> trace_printk/dump_stack through the code while debugging.
> 
> the only inconvenience so far is to know how parameters are getting
> into registers.
> on x86-64, arg1 is in rdi, arg2 is in rsi,... I want to improve that
> after first step is done.

Actually, that part is done by the perf-probe and ftrace dynamic events
(kernel/trace/trace_probe.c). I think this generic BPF is good for
re-implementing fetch methods. :)

Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu...@hitachi.com


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-03 Thread Andi Kleen

Alexei Starovoitov  writes:

Can you do some performance comparison compared to e.g. ktap?
How much faster is it?

While it sounds interesting, I would strongly advise to make this
capability only available to root. Traditionally lots of complex byte
code languages which were designed to be "safe" and verifiable weren't
really. e.g. i managed to crash things with "safe" systemtap multiple
times. And we all know what happened to Java.

So the likelyhood of this having some hole somewhere (either in
the byte code or in some library function) is high.
 
-Andi 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-03 Thread Alexei Starovoitov

On Tue, Dec 3, 2013 at 7:33 AM, Steven Rostedt  wrote:
> On Tue, 3 Dec 2013 10:16:55 +0100
> Ingo Molnar  wrote:
>
>
>> So, to do the math:
>>
>>tracing   'all' overhead:   95 nsecs per event
>>tracing 'eth5 + old filter' overhead:  157 nsecs per event
>>tracing 'eth5 + BPF filter' overhead:   54 nsecs per event
>>
>> So via BPF and a fairly trivial filter, we are able to reduce tracing
>> overhead for real - while old-style filters.
>
> Yep, seems that BPF can do what I wasn't able to do with the normal
> filters. Although, I haven't looked at the code yet, I'm assuming that
> the BPF works on the parameters passed into the trace event. The normal
> filters can only process the results of the trace (what's being
> recorded) not the parameters of the trace event itself. To get what's
> recorded, we need to write to the buffer first, and then we decided if
> we want to keep the event or not and discard the event from the buffer
> if we do not.
>
> That method does not reduce overhead at all, and only adds to it, as
> Alexei's tests have shown. The purpose of the filter was not to reduce
> overhead, but to reduce filling the buffer with needless data.

Precisely.
Assumption is that filters will filter out majority of the events.
So filter takes pt_regs as input, has to interpret them and call
bpf_trace_printk
if it really wants to store something for the human to see.
We can extend bpf trace filters to return true/false to indicate
whether TP_printk-format
specified as part of the event should be printed as well, but imo
that's unnecessary.
When I was using bpf filters to debug networking bits I didn't need
that printk format of the event. I only used event as an entry point,
filtering out things and printing different fields vs initial event.
More like what developers do when they sprinkle
trace_printk/dump_stack through the code while debugging.

the only inconvenience so far is to know how parameters are getting
into registers.
on x86-64, arg1 is in rdi, arg2 is in rsi,... I want to improve that
after first step is done.
In the proposed patches bpf_context == pt_regs at the event entry point.
Would be cleaner to have struct {arg1,arg2,…} as bpf_context instead.
But that needed more code and I wanted to keep the first patch to the
minimum.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-03 Thread Alexei Starovoitov

On Tue, Dec 3, 2013 at 1:16 AM, Ingo Molnar  wrote:
>
> Very cool! (Added various other folks who might be interested in this
> to the Cc: list.)
>
> I have one generic concern:
>
> It would be important to make it easy to extract loaded BPF code from
> the kernel in source code equivalent form, which compiles to the same
> BPF code.
>
> I.e. I think it would be fundamentally important to make sure that
> this is all within the kernel's license domain, to make it very clear
> there can be no 'binary only' BPF scripts.
>
> By up-loading BPF into a kernel the person loading it agrees to make
> that code available to all users of that system who can access it,
> under the same license as the kernel's code (or under a more
> permissive license).
>
> The last thing we want is people getting funny ideas and writing
> drivers in BPF and hiding the code or making license claims over it

all makes sense.
In case of kernel modules all export_symbols are accessible and module
has to have kernel compatible license. Same licensing terms apply to
anything else that interacts with kernel functions.
In case of BPF the list of accessible functions is tiny, so it's much
easier to enforce specific limited use case.
For tracing filters it's just bpf_load_xx/trace_printk/dump_stack.
Even if someone has funny ideas they cannot be brought to life, since
drivers need a lot more than this set of functions and BPF checker
will reject any attempts to call something outside of this tiny list.
imo the same applies to existing BPF as well. Meaning that tcpdump
filter string and seccomp filters, if distributed, has to have their
source code available.

> I.e. we want to allow flexible plugins technologically, but make sure
> people who run into such a plugin can modify and improve it under the
> same license as they can modify and improve the kernel itself!

wow. I guess if the whole thing takes off, we would need an in-kernel
directory to store upstreamed bpf filters as well :)

>> opcode encoding is the same between old BPF and extended BPF.
>> Original BPF has two 32-bit registers.
>> Extended BPF has ten 64-bit registers.
>> That is the main difference.
>>
>> Old BPF was using jt/jf fields for jump-insn only.
>> New BPF combines them into generic 'off' field for jump and non-jump insns.
>> k==imm field has the same meaning.
>
> This only affects the internal JIT representation, not the BPF byte
> code, right?

that is the ebpf vs bpf code difference. JIT doesn't keep another
representation.
Just converts it to x86

>>  32 files changed, 3332 insertions(+), 24 deletions(-)
>
> Impressive!
>
> I'm wondering, will the new nftable code in works make use of the BPF
> JIT as well, or is that a separate implementation?

nft is much higher level state machine customized for specific nftable use case.
imo iptables/nftable rules can be compiled into extended bpf.
One needs to define bpf_context and set of functions to do packet
lookup via bpf_callbacks...
but let's do it one step at a a time.

Thanks
Alexei
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-03 Thread Steven Rostedt

On Tue, 3 Dec 2013 10:16:55 +0100
Ingo Molnar  wrote:

> So, to do the math:
> 
>tracing   'all' overhead:   95 nsecs per event
>tracing 'eth5 + old filter' overhead:  157 nsecs per event
>tracing 'eth5 + BPF filter' overhead:   54 nsecs per event
> 
> So via BPF and a fairly trivial filter, we are able to reduce tracing 
> overhead for real - while old-style filters.

Yep, seems that BPF can do what I wasn't able to do with the normal
filters. Although, I haven't looked at the code yet, I'm assuming that
the BPF works on the parameters passed into the trace event. The normal
filters can only process the results of the trace (what's being
recorded) not the parameters of the trace event itself. To get what's
recorded, we need to write to the buffer first, and then we decided if
we want to keep the event or not and discard the event from the buffer
if we do not.

That method does not reduce overhead at all, and only adds to it, as
Alexei's tests have shown. The purpose of the filter was not to reduce
overhead, but to reduce filling the buffer with needless data.

It looks as if the BPF filter works on the parameters of the trace
event and not what is written to the buffers (as they can be
different). I've been looking for a way to do just that, and if this
does accomplish it, I'll be very happy :-)

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-03 Thread Masami Hiramatsu

(2013/12/03 13:28), Alexei Starovoitov wrote:
> Hi All,
> 
> the following set of patches adds BPF support to trace filters.
> 
> Trace filters can be written in C and allow safe read-only access to any
> kernel data structure. Like systemtap but with safety guaranteed by kernel.
> 
> The user can do:
> cat bpf_program > /sys/kernel/debug/tracing/.../filter
> if tracing event is either static or dynamic via kprobe_events.

Oh, thank you for this great work! :D

> 
> The filter program may look like:
> void filter(struct bpf_context *ctx)
> {
> char devname[4] = "eth5";
> struct net_device *dev;
> struct sk_buff *skb = 0;
> 
> dev = (struct net_device *)ctx->regs.si;
> if (bpf_memcmp(dev->name, devname, 4) == 0) {
> char fmt[] = "skb %p dev %p eth5\n";
> bpf_trace_printk(fmt, skb, dev, 0, 0);
> }
> }
> 
> The kernel will do static analysis of bpf program to make sure that it cannot
> crash the kernel (doesn't have loops, valid memory/register accesses, etc).
> Then kernel will map bpf instructions to x86 instructions and let it
> run in the place of trace filter.
> 
> To demonstrate performance I did a synthetic test:
> dev = init_net.loopback_dev;
> do_gettimeofday(_tv);
> for (i = 0; i < 100; i++) {
> struct sk_buff *skb;
> skb = netdev_alloc_skb(dev, 128);
> kfree_skb(skb);
> }
> do_gettimeofday(_tv);
> time = end_tv.tv_sec - start_tv.tv_sec;
> time *= USEC_PER_SEC;
> time += (long long)((long)end_tv.tv_usec - (long)start_tv.tv_usec);
> 
> printk("1M skb alloc/free %lld (usecs)\n", time);
> 
> no tracing
> [   33.450966] 1M skb alloc/free 145179 (usecs)
> 
> echo 1 > enable
> [   97.186379] 1M skb alloc/free 240419 (usecs)
> (tracing slows down kfree_skb() due to event_buffer_lock/buffer_unlock_commit)
> 
> echo 'name==eth5' > filter
> [  139.644161] 1M skb alloc/free 302552 (usecs)
> (running filter_match_preds() for every skb and discarding
> event_buffer is even slower)
> 
> cat bpf_prog > filter
> [  171.150566] 1M skb alloc/free 199463 (usecs)
> (JITed bpf program is safely checking dev->name == eth5 and discarding)
> 
> echo 0 > enable
> [  258.073593] 1M skb alloc/free 144919 (usecs)
> (tracing is disabled, performance is back to original)
> 
> The C program compiled into BPF and then JITed into x86 is faster than
> filter_match_preds() approach (199-145 msec vs 302-145 msec)

Great! :)

> tracing+bpf is a tool for safe read-only access to variables without 
> recompiling
> the kernel and without affecting running programs.

Hmm, this feature and trace-event trigger actions can give us
powerful on-the-fly scripting functionality...

> BPF filters can be written manually (see tools/bpf/trace/filter_ex1.c)
> or better compiled from restricted C via GCC or LLVM
> 
> Q: What is the difference between existing BPF and extended BPF?
> A:
> Existing BPF insn from uapi/linux/filter.h
> struct sock_filter {
> __u16   code;   /* Actual filter code */
> __u8jt; /* Jump true */
> __u8jf; /* Jump false */
> __u32   k;  /* Generic multiuse field */
> };
> 
> Extended BPF insn from linux/bpf.h
> struct bpf_insn {
> __u8code;/* opcode */
> __u8a_reg:4; /* dest register*/
> __u8x_reg:4; /* source register */
> __s16   off; /* signed offset */
> __s32   imm; /* signed immediate constant */
> };
> 
> opcode encoding is the same between old BPF and extended BPF.
> Original BPF has two 32-bit registers.
> Extended BPF has ten 64-bit registers.
> That is the main difference.
> 
> Old BPF was using jt/jf fields for jump-insn only.
> New BPF combines them into generic 'off' field for jump and non-jump insns.
> k==imm field has the same meaning.

Looks very interesting. :)

Thank you!

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu...@hitachi.com


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-03 Thread Ingo Molnar

* Alexei Starovoitov  wrote:

> Hi All,
> 
> the following set of patches adds BPF support to trace filters.
> 
> Trace filters can be written in C and allow safe read-only access to 
> any kernel data structure. Like systemtap but with safety guaranteed 
> by kernel.

Very cool! (Added various other folks who might be interested in this 
to the Cc: list.)

I have one generic concern:

It would be important to make it easy to extract loaded BPF code from 
the kernel in source code equivalent form, which compiles to the same 
BPF code.

I.e. I think it would be fundamentally important to make sure that 
this is all within the kernel's license domain, to make it very clear 
there can be no 'binary only' BPF scripts.

By up-loading BPF into a kernel the person loading it agrees to make 
that code available to all users of that system who can access it, 
under the same license as the kernel's code (or under a more 
permissive license).

The last thing we want is people getting funny ideas and writing 
drivers in BPF and hiding the code or making license claims over it 
...

I.e. we want to allow flexible plugins technologically, but make sure 
people who run into such a plugin can modify and improve it under the 
same license as they can modify and improve the kernel itself!

[ People can still 'hide' their sekrit plugins if they want to, by not
  distributing them to anyone who'd redistribute it widely. ]

> The user can do:
> cat bpf_program > /sys/kernel/debug/tracing/.../filter
> if tracing event is either static or dynamic via kprobe_events.
> 
> The filter program may look like:
> void filter(struct bpf_context *ctx)
> {
> char devname[4] = "eth5";
> struct net_device *dev;
> struct sk_buff *skb = 0;
> 
> dev = (struct net_device *)ctx->regs.si;
> if (bpf_memcmp(dev->name, devname, 4) == 0) {
> char fmt[] = "skb %p dev %p eth5\n";
> bpf_trace_printk(fmt, skb, dev, 0, 0);
> }
> }
> 
> The kernel will do static analysis of bpf program to make sure that 
> it cannot crash the kernel (doesn't have loops, valid 
> memory/register accesses, etc). Then kernel will map bpf 
> instructions to x86 instructions and let it run in the place of 
> trace filter.
> 
> To demonstrate performance I did a synthetic test:
> dev = init_net.loopback_dev;
> do_gettimeofday(_tv);
> for (i = 0; i < 100; i++) {
> struct sk_buff *skb;
> skb = netdev_alloc_skb(dev, 128);
> kfree_skb(skb);
> }
> do_gettimeofday(_tv);
> time = end_tv.tv_sec - start_tv.tv_sec;
> time *= USEC_PER_SEC;
> time += (long long)((long)end_tv.tv_usec - (long)start_tv.tv_usec);
> 
> printk("1M skb alloc/free %lld (usecs)\n", time);
> 
> no tracing
> [   33.450966] 1M skb alloc/free 145179 (usecs)
> 
> echo 1 > enable
> [   97.186379] 1M skb alloc/free 240419 (usecs)
> (tracing slows down kfree_skb() due to event_buffer_lock/buffer_unlock_commit)
> 
> echo 'name==eth5' > filter
> [  139.644161] 1M skb alloc/free 302552 (usecs)
> (running filter_match_preds() for every skb and discarding
> event_buffer is even slower)
> 
> cat bpf_prog > filter
> [  171.150566] 1M skb alloc/free 199463 (usecs)
> (JITed bpf program is safely checking dev->name == eth5 and discarding)

So, to do the math:

   tracing   'all' overhead:   95 nsecs per event
   tracing 'eth5 + old filter' overhead:  157 nsecs per event
   tracing 'eth5 + BPF filter' overhead:   54 nsecs per event

So via BPF and a fairly trivial filter, we are able to reduce tracing 
overhead for real - while old-style filters.

In addition to that we now also have arbitrary BPF scripts, full C 
programs (or written in any other language from which BPF bytecode can 
be generated) enabled.

Seems like a massive win-win scenario to me ;-)

> echo 0 > enable
> [  258.073593] 1M skb alloc/free 144919 (usecs)
> (tracing is disabled, performance is back to original)
> 
> The C program compiled into BPF and then JITed into x86 is faster 
> than filter_match_preds() approach (199-145 msec vs 302-145 msec)
> 
> tracing+bpf is a tool for safe read-only access to variables without 
> recompiling the kernel and without affecting running programs.
> 
> BPF filters can be written manually (see 
> tools/bpf/trace/filter_ex1.c) or better compiled from restricted C 
> via GCC or LLVM

> Q: What is the difference between existing BPF and extended BPF?
> A:
> Existing BPF insn from uapi/linux/filter.h
> struct sock_filter {
> __u16   code;   /* Actual filter code */
> __u8jt; /* Jump true */
> __u8jf; /* Jump false */
> __u32   k;  /* Generic multiuse field */
> };
> 
> Extended BPF insn from linux/bpf.h
> struct bpf_insn {
> __u8code;/* opcode */
> __u8a_reg:4; /* dest register*/
> __u8x_reg:4; /* source register */
>

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-03 Thread Ingo Molnar


* Alexei Starovoitov a...@plumgrid.com wrote:

 Hi All,
 
 the following set of patches adds BPF support to trace filters.
 
 Trace filters can be written in C and allow safe read-only access to 
 any kernel data structure. Like systemtap but with safety guaranteed 
 by kernel.

Very cool! (Added various other folks who might be interested in this 
to the Cc: list.)

I have one generic concern:

It would be important to make it easy to extract loaded BPF code from 
the kernel in source code equivalent form, which compiles to the same 
BPF code.

I.e. I think it would be fundamentally important to make sure that 
this is all within the kernel's license domain, to make it very clear 
there can be no 'binary only' BPF scripts.

By up-loading BPF into a kernel the person loading it agrees to make 
that code available to all users of that system who can access it, 
under the same license as the kernel's code (or under a more 
permissive license).

The last thing we want is people getting funny ideas and writing 
drivers in BPF and hiding the code or making license claims over it 
...

I.e. we want to allow flexible plugins technologically, but make sure 
people who run into such a plugin can modify and improve it under the 
same license as they can modify and improve the kernel itself!

[ People can still 'hide' their sekrit plugins if they want to, by not
  distributing them to anyone who'd redistribute it widely. ]

 The user can do:
 cat bpf_program  /sys/kernel/debug/tracing/.../filter
 if tracing event is either static or dynamic via kprobe_events.
 
 The filter program may look like:
 void filter(struct bpf_context *ctx)
 {
 char devname[4] = eth5;
 struct net_device *dev;
 struct sk_buff *skb = 0;
 
 dev = (struct net_device *)ctx-regs.si;
 if (bpf_memcmp(dev-name, devname, 4) == 0) {
 char fmt[] = skb %p dev %p eth5\n;
 bpf_trace_printk(fmt, skb, dev, 0, 0);
 }
 }
 
 The kernel will do static analysis of bpf program to make sure that 
 it cannot crash the kernel (doesn't have loops, valid 
 memory/register accesses, etc). Then kernel will map bpf 
 instructions to x86 instructions and let it run in the place of 
 trace filter.
 
 To demonstrate performance I did a synthetic test:
 dev = init_net.loopback_dev;
 do_gettimeofday(start_tv);
 for (i = 0; i  100; i++) {
 struct sk_buff *skb;
 skb = netdev_alloc_skb(dev, 128);
 kfree_skb(skb);
 }
 do_gettimeofday(end_tv);
 time = end_tv.tv_sec - start_tv.tv_sec;
 time *= USEC_PER_SEC;
 time += (long long)((long)end_tv.tv_usec - (long)start_tv.tv_usec);
 
 printk(1M skb alloc/free %lld (usecs)\n, time);
 
 no tracing
 [   33.450966] 1M skb alloc/free 145179 (usecs)
 
 echo 1  enable
 [   97.186379] 1M skb alloc/free 240419 (usecs)
 (tracing slows down kfree_skb() due to event_buffer_lock/buffer_unlock_commit)
 
 echo 'name==eth5'  filter
 [  139.644161] 1M skb alloc/free 302552 (usecs)
 (running filter_match_preds() for every skb and discarding
 event_buffer is even slower)
 
 cat bpf_prog  filter
 [  171.150566] 1M skb alloc/free 199463 (usecs)
 (JITed bpf program is safely checking dev-name == eth5 and discarding)

So, to do the math:

   tracing   'all' overhead:   95 nsecs per event
   tracing 'eth5 + old filter' overhead:  157 nsecs per event
   tracing 'eth5 + BPF filter' overhead:   54 nsecs per event

So via BPF and a fairly trivial filter, we are able to reduce tracing 
overhead for real - while old-style filters.

In addition to that we now also have arbitrary BPF scripts, full C 
programs (or written in any other language from which BPF bytecode can 
be generated) enabled.

Seems like a massive win-win scenario to me ;-)

 echo 0  enable
 [  258.073593] 1M skb alloc/free 144919 (usecs)
 (tracing is disabled, performance is back to original)
 
 The C program compiled into BPF and then JITed into x86 is faster 
 than filter_match_preds() approach (199-145 msec vs 302-145 msec)
 
 tracing+bpf is a tool for safe read-only access to variables without 
 recompiling the kernel and without affecting running programs.
 
 BPF filters can be written manually (see 
 tools/bpf/trace/filter_ex1.c) or better compiled from restricted C 
 via GCC or LLVM

 Q: What is the difference between existing BPF and extended BPF?
 A:
 Existing BPF insn from uapi/linux/filter.h
 struct sock_filter {
 __u16   code;   /* Actual filter code */
 __u8jt; /* Jump true */
 __u8jf; /* Jump false */
 __u32   k;  /* Generic multiuse field */
 };
 
 Extended BPF insn from linux/bpf.h
 struct bpf_insn {
 __u8code;/* opcode */
 __u8a_reg:4; /* dest register*/
 __u8x_reg:4; /* source register */
 __s16   off; /* signed offset */
 __s32   imm; /* signed immediate

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-03 Thread Masami Hiramatsu

(2013/12/03 13:28), Alexei Starovoitov wrote:
 Hi All,
 
 the following set of patches adds BPF support to trace filters.
 
 Trace filters can be written in C and allow safe read-only access to any
 kernel data structure. Like systemtap but with safety guaranteed by kernel.
 
 The user can do:
 cat bpf_program  /sys/kernel/debug/tracing/.../filter
 if tracing event is either static or dynamic via kprobe_events.

Oh, thank you for this great work! :D

 
 The filter program may look like:
 void filter(struct bpf_context *ctx)
 {
 char devname[4] = eth5;
 struct net_device *dev;
 struct sk_buff *skb = 0;
 
 dev = (struct net_device *)ctx-regs.si;
 if (bpf_memcmp(dev-name, devname, 4) == 0) {
 char fmt[] = skb %p dev %p eth5\n;
 bpf_trace_printk(fmt, skb, dev, 0, 0);
 }
 }
 
 The kernel will do static analysis of bpf program to make sure that it cannot
 crash the kernel (doesn't have loops, valid memory/register accesses, etc).
 Then kernel will map bpf instructions to x86 instructions and let it
 run in the place of trace filter.
 
 To demonstrate performance I did a synthetic test:
 dev = init_net.loopback_dev;
 do_gettimeofday(start_tv);
 for (i = 0; i  100; i++) {
 struct sk_buff *skb;
 skb = netdev_alloc_skb(dev, 128);
 kfree_skb(skb);
 }
 do_gettimeofday(end_tv);
 time = end_tv.tv_sec - start_tv.tv_sec;
 time *= USEC_PER_SEC;
 time += (long long)((long)end_tv.tv_usec - (long)start_tv.tv_usec);
 
 printk(1M skb alloc/free %lld (usecs)\n, time);
 
 no tracing
 [   33.450966] 1M skb alloc/free 145179 (usecs)
 
 echo 1  enable
 [   97.186379] 1M skb alloc/free 240419 (usecs)
 (tracing slows down kfree_skb() due to event_buffer_lock/buffer_unlock_commit)
 
 echo 'name==eth5'  filter
 [  139.644161] 1M skb alloc/free 302552 (usecs)
 (running filter_match_preds() for every skb and discarding
 event_buffer is even slower)
 
 cat bpf_prog  filter
 [  171.150566] 1M skb alloc/free 199463 (usecs)
 (JITed bpf program is safely checking dev-name == eth5 and discarding)
 
 echo 0  enable
 [  258.073593] 1M skb alloc/free 144919 (usecs)
 (tracing is disabled, performance is back to original)
 
 The C program compiled into BPF and then JITed into x86 is faster than
 filter_match_preds() approach (199-145 msec vs 302-145 msec)

Great! :)

 tracing+bpf is a tool for safe read-only access to variables without 
 recompiling
 the kernel and without affecting running programs.

Hmm, this feature and trace-event trigger actions can give us
powerful on-the-fly scripting functionality...

 BPF filters can be written manually (see tools/bpf/trace/filter_ex1.c)
 or better compiled from restricted C via GCC or LLVM
 
 Q: What is the difference between existing BPF and extended BPF?
 A:
 Existing BPF insn from uapi/linux/filter.h
 struct sock_filter {
 __u16   code;   /* Actual filter code */
 __u8jt; /* Jump true */
 __u8jf; /* Jump false */
 __u32   k;  /* Generic multiuse field */
 };
 
 Extended BPF insn from linux/bpf.h
 struct bpf_insn {
 __u8code;/* opcode */
 __u8a_reg:4; /* dest register*/
 __u8x_reg:4; /* source register */
 __s16   off; /* signed offset */
 __s32   imm; /* signed immediate constant */
 };
 
 opcode encoding is the same between old BPF and extended BPF.
 Original BPF has two 32-bit registers.
 Extended BPF has ten 64-bit registers.
 That is the main difference.
 
 Old BPF was using jt/jf fields for jump-insn only.
 New BPF combines them into generic 'off' field for jump and non-jump insns.
 k==imm field has the same meaning.

Looks very interesting. :)

Thank you!

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu...@hitachi.com


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-03 Thread Steven Rostedt

On Tue, 3 Dec 2013 10:16:55 +0100
Ingo Molnar mi...@kernel.org wrote:

 
 So, to do the math:
 
tracing   'all' overhead:   95 nsecs per event
tracing 'eth5 + old filter' overhead:  157 nsecs per event
tracing 'eth5 + BPF filter' overhead:   54 nsecs per event
 
 So via BPF and a fairly trivial filter, we are able to reduce tracing 
 overhead for real - while old-style filters.

Yep, seems that BPF can do what I wasn't able to do with the normal
filters. Although, I haven't looked at the code yet, I'm assuming that
the BPF works on the parameters passed into the trace event. The normal
filters can only process the results of the trace (what's being
recorded) not the parameters of the trace event itself. To get what's
recorded, we need to write to the buffer first, and then we decided if
we want to keep the event or not and discard the event from the buffer
if we do not.

That method does not reduce overhead at all, and only adds to it, as
Alexei's tests have shown. The purpose of the filter was not to reduce
overhead, but to reduce filling the buffer with needless data.

It looks as if the BPF filter works on the parameters of the trace
event and not what is written to the buffers (as they can be
different). I've been looking for a way to do just that, and if this
does accomplish it, I'll be very happy :-)

-- Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-03 Thread Alexei Starovoitov

On Tue, Dec 3, 2013 at 1:16 AM, Ingo Molnar mi...@kernel.org wrote:

 Very cool! (Added various other folks who might be interested in this
 to the Cc: list.)

 I have one generic concern:

 It would be important to make it easy to extract loaded BPF code from
 the kernel in source code equivalent form, which compiles to the same
 BPF code.

 I.e. I think it would be fundamentally important to make sure that
 this is all within the kernel's license domain, to make it very clear
 there can be no 'binary only' BPF scripts.

 By up-loading BPF into a kernel the person loading it agrees to make
 that code available to all users of that system who can access it,
 under the same license as the kernel's code (or under a more
 permissive license).

 The last thing we want is people getting funny ideas and writing
 drivers in BPF and hiding the code or making license claims over it

all makes sense.
In case of kernel modules all export_symbols are accessible and module
has to have kernel compatible license. Same licensing terms apply to
anything else that interacts with kernel functions.
In case of BPF the list of accessible functions is tiny, so it's much
easier to enforce specific limited use case.
For tracing filters it's just bpf_load_xx/trace_printk/dump_stack.
Even if someone has funny ideas they cannot be brought to life, since
drivers need a lot more than this set of functions and BPF checker
will reject any attempts to call something outside of this tiny list.
imo the same applies to existing BPF as well. Meaning that tcpdump
filter string and seccomp filters, if distributed, has to have their
source code available.

 I.e. we want to allow flexible plugins technologically, but make sure
 people who run into such a plugin can modify and improve it under the
 same license as they can modify and improve the kernel itself!

wow. I guess if the whole thing takes off, we would need an in-kernel
directory to store upstreamed bpf filters as well :)

 opcode encoding is the same between old BPF and extended BPF.
 Original BPF has two 32-bit registers.
 Extended BPF has ten 64-bit registers.
 That is the main difference.

 Old BPF was using jt/jf fields for jump-insn only.
 New BPF combines them into generic 'off' field for jump and non-jump insns.
 k==imm field has the same meaning.

 This only affects the internal JIT representation, not the BPF byte
 code, right?

that is the ebpf vs bpf code difference. JIT doesn't keep another
representation.
Just converts it to x86

  32 files changed, 3332 insertions(+), 24 deletions(-)

 Impressive!

 I'm wondering, will the new nftable code in works make use of the BPF
 JIT as well, or is that a separate implementation?

nft is much higher level state machine customized for specific nftable use case.
imo iptables/nftable rules can be compiled into extended bpf.
One needs to define bpf_context and set of functions to do packet
lookup via bpf_callbacks...
but let's do it one step at a a time.

Thanks
Alexei
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-03 Thread Alexei Starovoitov

On Tue, Dec 3, 2013 at 7:33 AM, Steven Rostedt rost...@goodmis.org wrote:
 On Tue, 3 Dec 2013 10:16:55 +0100
 Ingo Molnar mi...@kernel.org wrote:


 So, to do the math:

tracing   'all' overhead:   95 nsecs per event
tracing 'eth5 + old filter' overhead:  157 nsecs per event
tracing 'eth5 + BPF filter' overhead:   54 nsecs per event

 So via BPF and a fairly trivial filter, we are able to reduce tracing
 overhead for real - while old-style filters.

 Yep, seems that BPF can do what I wasn't able to do with the normal
 filters. Although, I haven't looked at the code yet, I'm assuming that
 the BPF works on the parameters passed into the trace event. The normal
 filters can only process the results of the trace (what's being
 recorded) not the parameters of the trace event itself. To get what's
 recorded, we need to write to the buffer first, and then we decided if
 we want to keep the event or not and discard the event from the buffer
 if we do not.

 That method does not reduce overhead at all, and only adds to it, as
 Alexei's tests have shown. The purpose of the filter was not to reduce
 overhead, but to reduce filling the buffer with needless data.

Precisely.
Assumption is that filters will filter out majority of the events.
So filter takes pt_regs as input, has to interpret them and call
bpf_trace_printk
if it really wants to store something for the human to see.
We can extend bpf trace filters to return true/false to indicate
whether TP_printk-format
specified as part of the event should be printed as well, but imo
that's unnecessary.
When I was using bpf filters to debug networking bits I didn't need
that printk format of the event. I only used event as an entry point,
filtering out things and printing different fields vs initial event.
More like what developers do when they sprinkle
trace_printk/dump_stack through the code while debugging.

the only inconvenience so far is to know how parameters are getting
into registers.
on x86-64, arg1 is in rdi, arg2 is in rsi,... I want to improve that
after first step is done.
In the proposed patches bpf_context == pt_regs at the event entry point.
Would be cleaner to have struct {arg1,arg2,…} as bpf_context instead.
But that needed more code and I wanted to keep the first patch to the
minimum.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-03 Thread Andi Kleen

Alexei Starovoitov a...@plumgrid.com writes:

Can you do some performance comparison compared to e.g. ktap?
How much faster is it?

While it sounds interesting, I would strongly advise to make this
capability only available to root. Traditionally lots of complex byte
code languages which were designed to be safe and verifiable weren't
really. e.g. i managed to crash things with safe systemtap multiple
times. And we all know what happened to Java.

So the likelyhood of this having some hole somewhere (either in
the byte code or in some library function) is high.
 
-Andi 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-03 Thread Masami Hiramatsu

(2013/12/04 3:26), Alexei Starovoitov wrote:
 On Tue, Dec 3, 2013 at 7:33 AM, Steven Rostedt rost...@goodmis.org wrote:
 On Tue, 3 Dec 2013 10:16:55 +0100
 Ingo Molnar mi...@kernel.org wrote:


 So, to do the math:

tracing   'all' overhead:   95 nsecs per event
tracing 'eth5 + old filter' overhead:  157 nsecs per event
tracing 'eth5 + BPF filter' overhead:   54 nsecs per event

 So via BPF and a fairly trivial filter, we are able to reduce tracing
 overhead for real - while old-style filters.

 Yep, seems that BPF can do what I wasn't able to do with the normal
 filters. Although, I haven't looked at the code yet, I'm assuming that
 the BPF works on the parameters passed into the trace event. The normal
 filters can only process the results of the trace (what's being
 recorded) not the parameters of the trace event itself. To get what's
 recorded, we need to write to the buffer first, and then we decided if
 we want to keep the event or not and discard the event from the buffer
 if we do not.

 That method does not reduce overhead at all, and only adds to it, as
 Alexei's tests have shown. The purpose of the filter was not to reduce
 overhead, but to reduce filling the buffer with needless data.
 
 Precisely.
 Assumption is that filters will filter out majority of the events.
 So filter takes pt_regs as input, has to interpret them and call
 bpf_trace_printk
 if it really wants to store something for the human to see.
 We can extend bpf trace filters to return true/false to indicate
 whether TP_printk-format
 specified as part of the event should be printed as well, but imo
 that's unnecessary.
 When I was using bpf filters to debug networking bits I didn't need
 that printk format of the event. I only used event as an entry point,
 filtering out things and printing different fields vs initial event.
 More like what developers do when they sprinkle
 trace_printk/dump_stack through the code while debugging.
 
 the only inconvenience so far is to know how parameters are getting
 into registers.
 on x86-64, arg1 is in rdi, arg2 is in rsi,... I want to improve that
 after first step is done.

Actually, that part is done by the perf-probe and ftrace dynamic events
(kernel/trace/trace_probe.c). I think this generic BPF is good for
re-implementing fetch methods. :)

Thank you,

-- 
Masami HIRAMATSU
IT Management Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu...@hitachi.com


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH tip 0/5] tracing filters with BPF

2013-12-03 Thread Alexei Starovoitov

On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen a...@firstfloor.org wrote:
 Alexei Starovoitov a...@plumgrid.com writes:

 Can you do some performance comparison compared to e.g. ktap?
 How much faster is it?

imo the most interesting ktap scripts (like kmalloc-top.kp) need
tables and timers.
tables are almost ready for prime time, but timers I prefer to keep
out of kernel.
I would like bpf filter to fill tables with interesting data in kernel
up to predefined limit
and periodically read and clear the tables from userspace.
This way I will be able to do nettop.stp, iotop.stp like programs.
So I'm still thinking what should be clean kernel/user interface for
bpf-defined tables.
Format of keys and elements of the table is defined within bpf program.
During load of bpf program, the tables are allocated and bpf program
can now lookup/update into them. At the same time corresponding
userspace program can read tables of this particular bpf program over
netlink.
Creating its own debugfs files for every filter feels too slow and
feature limited, since files are all or nothing interface. Netlink
access to bpf tables feels cleaner. Userspace will use libmnl to
access them. Other ideas?

In the mean time I'll do some simple
  trace probe:xx { print }
performance test…

 While it sounds interesting, I would strongly advise to make this
 capability only available to root. Traditionally lots of complex byte
 code languages which were designed to be safe and verifiable weren't
 really. e.g. i managed to crash things with safe systemtap multiple
 times. And we all know what happened to Java.

 So the likelyhood of this having some hole somewhere (either in
 the byte code or in some library function) is high.

Tracing filters are for root only today and should stay this way.
As far as safety of bpf… hard to argue systemtap point ;)
Though existing bpf is generally accepted to be safe.
extended bpf needs time to prove itself.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH tip 0/5] tracing filters with BPF

2013-12-02 Thread Alexei Starovoitov

Hi All,

the following set of patches adds BPF support to trace filters.

Trace filters can be written in C and allow safe read-only access to any
kernel data structure. Like systemtap but with safety guaranteed by kernel.

The user can do:
cat bpf_program > /sys/kernel/debug/tracing/.../filter
if tracing event is either static or dynamic via kprobe_events.

The filter program may look like:
void filter(struct bpf_context *ctx)
{
char devname[4] = "eth5";
struct net_device *dev;
struct sk_buff *skb = 0;

dev = (struct net_device *)ctx->regs.si;
if (bpf_memcmp(dev->name, devname, 4) == 0) {
char fmt[] = "skb %p dev %p eth5\n";
bpf_trace_printk(fmt, skb, dev, 0, 0);
}
}

The kernel will do static analysis of bpf program to make sure that it cannot
crash the kernel (doesn't have loops, valid memory/register accesses, etc).
Then kernel will map bpf instructions to x86 instructions and let it
run in the place of trace filter.

To demonstrate performance I did a synthetic test:
dev = init_net.loopback_dev;
do_gettimeofday(_tv);
for (i = 0; i < 100; i++) {
struct sk_buff *skb;
skb = netdev_alloc_skb(dev, 128);
kfree_skb(skb);
}
do_gettimeofday(_tv);
time = end_tv.tv_sec - start_tv.tv_sec;
time *= USEC_PER_SEC;
time += (long long)((long)end_tv.tv_usec - (long)start_tv.tv_usec);

printk("1M skb alloc/free %lld (usecs)\n", time);

no tracing
[   33.450966] 1M skb alloc/free 145179 (usecs)

echo 1 > enable
[   97.186379] 1M skb alloc/free 240419 (usecs)
(tracing slows down kfree_skb() due to event_buffer_lock/buffer_unlock_commit)

echo 'name==eth5' > filter
[  139.644161] 1M skb alloc/free 302552 (usecs)
(running filter_match_preds() for every skb and discarding
event_buffer is even slower)

cat bpf_prog > filter
[  171.150566] 1M skb alloc/free 199463 (usecs)
(JITed bpf program is safely checking dev->name == eth5 and discarding)

echo 0 > enable
[  258.073593] 1M skb alloc/free 144919 (usecs)
(tracing is disabled, performance is back to original)

The C program compiled into BPF and then JITed into x86 is faster than
filter_match_preds() approach (199-145 msec vs 302-145 msec)

tracing+bpf is a tool for safe read-only access to variables without recompiling
the kernel and without affecting running programs.

BPF filters can be written manually (see tools/bpf/trace/filter_ex1.c)
or better compiled from restricted C via GCC or LLVM

Q: What is the difference between existing BPF and extended BPF?
A:
Existing BPF insn from uapi/linux/filter.h
struct sock_filter {
__u16   code;   /* Actual filter code */
__u8jt; /* Jump true */
__u8jf; /* Jump false */
__u32   k;  /* Generic multiuse field */
};

Extended BPF insn from linux/bpf.h
struct bpf_insn {
__u8code;/* opcode */
__u8a_reg:4; /* dest register*/
__u8x_reg:4; /* source register */
__s16   off; /* signed offset */
__s32   imm; /* signed immediate constant */
};

opcode encoding is the same between old BPF and extended BPF.
Original BPF has two 32-bit registers.
Extended BPF has ten 64-bit registers.
That is the main difference.

Old BPF was using jt/jf fields for jump-insn only.
New BPF combines them into generic 'off' field for jump and non-jump insns.
k==imm field has the same meaning.

Thanks

Alexei Starovoitov (5):
  Extended BPF core framework
  Extended BPF JIT for x86-64
  Extended BPF (64-bit BPF) design document
  use BPF in tracing filters
  tracing filter examples in BPF

 Documentation/bpf_jit.txt|  204 +++
 arch/x86/Kconfig |1 +
 arch/x86/net/Makefile|1 +
 arch/x86/net/bpf64_jit_comp.c|  625 
 arch/x86/net/bpf_jit_comp.c  |   23 +-
 arch/x86/net/bpf_jit_comp.h  |   35 ++
 include/linux/bpf.h  |  149 +
 include/linux/bpf_jit.h  |  129 +
 include/linux/ftrace_event.h |3 +
 include/trace/bpf_trace.h|   27 +
 include/trace/ftrace.h   |   14 +
 kernel/Makefile  |1 +
 kernel/bpf_jit/Makefile  |3 +
 kernel/bpf_jit/bpf_check.c   | 1054 ++
 kernel/bpf_jit/bpf_run.c |  452 +++
 kernel/trace/Kconfig |1 +
 kernel/trace/Makefile|1 +
 kernel/trace/bpf_trace_callbacks.c   |  191 ++
 kernel/trace/trace.c |7 +
 kernel/trace/trace.h |   11 +-
 kernel/trace/trace_events.c  |9 +-
 kernel/trace/trace_events_filter.c   |   61 +-
 kernel/trace/trace_kprobe.c  |6 +
 lib/Kconfig.debug|   15 +
 tools/bpf/llvm/README.txt|6 +
 tools/bpf/trace/Makefile

[RFC PATCH tip 0/5] tracing filters with BPF

2013-12-02 Thread Alexei Starovoitov

Hi All,

the following set of patches adds BPF support to trace filters.

Trace filters can be written in C and allow safe read-only access to any
kernel data structure. Like systemtap but with safety guaranteed by kernel.

The user can do:
cat bpf_program  /sys/kernel/debug/tracing/.../filter
if tracing event is either static or dynamic via kprobe_events.

The filter program may look like:
void filter(struct bpf_context *ctx)
{
char devname[4] = eth5;
struct net_device *dev;
struct sk_buff *skb = 0;

dev = (struct net_device *)ctx-regs.si;
if (bpf_memcmp(dev-name, devname, 4) == 0) {
char fmt[] = skb %p dev %p eth5\n;
bpf_trace_printk(fmt, skb, dev, 0, 0);
}
}

The kernel will do static analysis of bpf program to make sure that it cannot
crash the kernel (doesn't have loops, valid memory/register accesses, etc).
Then kernel will map bpf instructions to x86 instructions and let it
run in the place of trace filter.

To demonstrate performance I did a synthetic test:
dev = init_net.loopback_dev;
do_gettimeofday(start_tv);
for (i = 0; i  100; i++) {
struct sk_buff *skb;
skb = netdev_alloc_skb(dev, 128);
kfree_skb(skb);
}
do_gettimeofday(end_tv);
time = end_tv.tv_sec - start_tv.tv_sec;
time *= USEC_PER_SEC;
time += (long long)((long)end_tv.tv_usec - (long)start_tv.tv_usec);

printk(1M skb alloc/free %lld (usecs)\n, time);

no tracing
[   33.450966] 1M skb alloc/free 145179 (usecs)

echo 1  enable
[   97.186379] 1M skb alloc/free 240419 (usecs)
(tracing slows down kfree_skb() due to event_buffer_lock/buffer_unlock_commit)

echo 'name==eth5'  filter
[  139.644161] 1M skb alloc/free 302552 (usecs)
(running filter_match_preds() for every skb and discarding
event_buffer is even slower)

cat bpf_prog  filter
[  171.150566] 1M skb alloc/free 199463 (usecs)
(JITed bpf program is safely checking dev-name == eth5 and discarding)

echo 0  enable
[  258.073593] 1M skb alloc/free 144919 (usecs)
(tracing is disabled, performance is back to original)

The C program compiled into BPF and then JITed into x86 is faster than
filter_match_preds() approach (199-145 msec vs 302-145 msec)

tracing+bpf is a tool for safe read-only access to variables without recompiling
the kernel and without affecting running programs.

BPF filters can be written manually (see tools/bpf/trace/filter_ex1.c)
or better compiled from restricted C via GCC or LLVM

Q: What is the difference between existing BPF and extended BPF?
A:
Existing BPF insn from uapi/linux/filter.h
struct sock_filter {
__u16   code;   /* Actual filter code */
__u8jt; /* Jump true */
__u8jf; /* Jump false */
__u32   k;  /* Generic multiuse field */
};

Extended BPF insn from linux/bpf.h
struct bpf_insn {
__u8code;/* opcode */
__u8a_reg:4; /* dest register*/
__u8x_reg:4; /* source register */
__s16   off; /* signed offset */
__s32   imm; /* signed immediate constant */
};

opcode encoding is the same between old BPF and extended BPF.
Original BPF has two 32-bit registers.
Extended BPF has ten 64-bit registers.
That is the main difference.

Old BPF was using jt/jf fields for jump-insn only.
New BPF combines them into generic 'off' field for jump and non-jump insns.
k==imm field has the same meaning.

Thanks

Alexei Starovoitov (5):
  Extended BPF core framework
  Extended BPF JIT for x86-64
  Extended BPF (64-bit BPF) design document
  use BPF in tracing filters
  tracing filter examples in BPF

 Documentation/bpf_jit.txt|  204 +++
 arch/x86/Kconfig |1 +
 arch/x86/net/Makefile|1 +
 arch/x86/net/bpf64_jit_comp.c|  625 
 arch/x86/net/bpf_jit_comp.c  |   23 +-
 arch/x86/net/bpf_jit_comp.h  |   35 ++
 include/linux/bpf.h  |  149 +
 include/linux/bpf_jit.h  |  129 +
 include/linux/ftrace_event.h |3 +
 include/trace/bpf_trace.h|   27 +
 include/trace/ftrace.h   |   14 +
 kernel/Makefile  |1 +
 kernel/bpf_jit/Makefile  |3 +
 kernel/bpf_jit/bpf_check.c   | 1054 ++
 kernel/bpf_jit/bpf_run.c |  452 +++
 kernel/trace/Kconfig |1 +
 kernel/trace/Makefile|1 +
 kernel/trace/bpf_trace_callbacks.c   |  191 ++
 kernel/trace/trace.c |7 +
 kernel/trace/trace.h |   11 +-
 kernel/trace/trace_events.c  |9 +-
 kernel/trace/trace_events_filter.c   |   61 +-
 kernel/trace/trace_kprobe.c  |6 +
 lib/Kconfig.debug|   15 +
 tools/bpf/llvm/README.txt|6 +
 tools/bpf/trace/Makefile

92 matches

Mail list logo