Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Tue, Dec 10, 2013 at 7:35 PM, Masami Hiramatsu wrote: > (2013/12/11 11:32), Alexei Starovoitov wrote: >> On Tue, Dec 10, 2013 at 7:47 AM, Ingo Molnar wrote: >>> >>> * Alexei Starovoitov wrote: >>> > I'm fine if it becomes a requirement to have a vmlinux built with > DEBUG_INFO to use BPF and have a tool like perf to translate the > filters. But it that must not replace what the current filters do > now. That is, it can be an add on, but not a replacement. Of course. tracing filters via bpf is an additional tool for kernel debugging. bpf by itself has use cases beyond tracing. >>> >>> Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for >>> most people. >> >> there is a misunderstanding here. >> I was saying 'of course' to 'not replace current filter infra'. >> >> bpf does not depend on debug info. >> That's the key difference between 'perf probe' approach and bpf filters. >> >> Masami is right that what I was trying to achieve with bpf filters >> is similar to 'perf probe': insert a dynamic probe anywhere >> in the kernel, walk pointers, data structures, print interesting stuff. >> >> 'perf probe' does it via scanning vmlinux with debug info. >> bpf filters don't need it. >> tools/bpf/trace/*_orig.c examples only depend on linux headers >> in /lib/modules/../build/include/ >> Today bpf compiler struct layout is the same as x86_64. >> >> Tomorrow bpf compiler will have flags to adjust endianness, pointer size, etc >> of the front-end. Similar to -m32/-m64 and -m*-endian flags. >> Neat part is that I don't need to do any work, just enable it properly in >> the bpf backend. From gcc/llvm point of view, bpf is yet another 'hw' >> architecture that compiler is emitting code for. >> So when C code of filter_ex1_orig.c does 'skb->dev', compiler determines >> field offset by looking at /lib/modules/.../include/skbuff.h >> whereas for 'perf probe' 'skb->dev' means walk debug info. > > Right, the offset of the data structure can get from the header etc. > > However, how would the bpf get the register or stack assignment of > skb itself? In the tracepoint macro, it will be able to get it from > function parameters (it needs a trick, like jprobe does). > I doubt you can do that on kprobes/uprobes without any debuginfo > support. :( the 4/5 diff actually shows how it's working ;) for kprobes it works at the function entry, since arguments are still in the registers and walks the pointers further down. It cannot do func+line_number as perf-probe does, of course. for tracepoints it's the same trick: call no-inline func with traceprobe args and call inlined crash_setup_regs() that stores the regs. Of course, there are limitations. Like 7th func argument goes into stack and requires more work to get out. If struct is not defined in .h, it would need to be redefined in filter.c Corner cases as you said. Today user of bpf filter needs to know that arg1 goes into %rdi and so on. that is easy to cleanup. >> Another use case is to optimize fetch sequences of dynamic probes >> as Masami suggested, but backward compatibility requirement >> would preserve to ways of doing it as well. > > The backward compatibility issue is only for the interface, but not > for the implementation, I think. :) The fetch method and filter > pred do already parse the argument into a syntax tree. IMHO, bpf > can optimize that tree to just a simple opcode stream. ahh. yes. that's doable. Thanks Alexei -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Tue, Dec 10, 2013 at 7:35 PM, Masami Hiramatsu masami.hiramatsu...@hitachi.com wrote: (2013/12/11 11:32), Alexei Starovoitov wrote: On Tue, Dec 10, 2013 at 7:47 AM, Ingo Molnar mi...@kernel.org wrote: * Alexei Starovoitov a...@plumgrid.com wrote: I'm fine if it becomes a requirement to have a vmlinux built with DEBUG_INFO to use BPF and have a tool like perf to translate the filters. But it that must not replace what the current filters do now. That is, it can be an add on, but not a replacement. Of course. tracing filters via bpf is an additional tool for kernel debugging. bpf by itself has use cases beyond tracing. Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for most people. there is a misunderstanding here. I was saying 'of course' to 'not replace current filter infra'. bpf does not depend on debug info. That's the key difference between 'perf probe' approach and bpf filters. Masami is right that what I was trying to achieve with bpf filters is similar to 'perf probe': insert a dynamic probe anywhere in the kernel, walk pointers, data structures, print interesting stuff. 'perf probe' does it via scanning vmlinux with debug info. bpf filters don't need it. tools/bpf/trace/*_orig.c examples only depend on linux headers in /lib/modules/../build/include/ Today bpf compiler struct layout is the same as x86_64. Tomorrow bpf compiler will have flags to adjust endianness, pointer size, etc of the front-end. Similar to -m32/-m64 and -m*-endian flags. Neat part is that I don't need to do any work, just enable it properly in the bpf backend. From gcc/llvm point of view, bpf is yet another 'hw' architecture that compiler is emitting code for. So when C code of filter_ex1_orig.c does 'skb-dev', compiler determines field offset by looking at /lib/modules/.../include/skbuff.h whereas for 'perf probe' 'skb-dev' means walk debug info. Right, the offset of the data structure can get from the header etc. However, how would the bpf get the register or stack assignment of skb itself? In the tracepoint macro, it will be able to get it from function parameters (it needs a trick, like jprobe does). I doubt you can do that on kprobes/uprobes without any debuginfo support. :( the 4/5 diff actually shows how it's working ;) for kprobes it works at the function entry, since arguments are still in the registers and walks the pointers further down. It cannot do func+line_number as perf-probe does, of course. for tracepoints it's the same trick: call no-inline func with traceprobe args and call inlined crash_setup_regs() that stores the regs. Of course, there are limitations. Like 7th func argument goes into stack and requires more work to get out. If struct is not defined in .h, it would need to be redefined in filter.c Corner cases as you said. Today user of bpf filter needs to know that arg1 goes into %rdi and so on. that is easy to cleanup. Another use case is to optimize fetch sequences of dynamic probes as Masami suggested, but backward compatibility requirement would preserve to ways of doing it as well. The backward compatibility issue is only for the interface, but not for the implementation, I think. :) The fetch method and filter pred do already parse the argument into a syntax tree. IMHO, bpf can optimize that tree to just a simple opcode stream. ahh. yes. that's doable. Thanks Alexei -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
(2013/12/11 11:32), Alexei Starovoitov wrote: > On Tue, Dec 10, 2013 at 7:47 AM, Ingo Molnar wrote: >> >> * Alexei Starovoitov wrote: >> I'm fine if it becomes a requirement to have a vmlinux built with DEBUG_INFO to use BPF and have a tool like perf to translate the filters. But it that must not replace what the current filters do now. That is, it can be an add on, but not a replacement. >>> >>> Of course. tracing filters via bpf is an additional tool for kernel >>> debugging. bpf by itself has use cases beyond tracing. >> >> Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for >> most people. > > there is a misunderstanding here. > I was saying 'of course' to 'not replace current filter infra'. > > bpf does not depend on debug info. > That's the key difference between 'perf probe' approach and bpf filters. > > Masami is right that what I was trying to achieve with bpf filters > is similar to 'perf probe': insert a dynamic probe anywhere > in the kernel, walk pointers, data structures, print interesting stuff. > > 'perf probe' does it via scanning vmlinux with debug info. > bpf filters don't need it. > tools/bpf/trace/*_orig.c examples only depend on linux headers > in /lib/modules/../build/include/ > Today bpf compiler struct layout is the same as x86_64. > > Tomorrow bpf compiler will have flags to adjust endianness, pointer size, etc > of the front-end. Similar to -m32/-m64 and -m*-endian flags. > Neat part is that I don't need to do any work, just enable it properly in > the bpf backend. From gcc/llvm point of view, bpf is yet another 'hw' > architecture that compiler is emitting code for. > So when C code of filter_ex1_orig.c does 'skb->dev', compiler determines > field offset by looking at /lib/modules/.../include/skbuff.h > whereas for 'perf probe' 'skb->dev' means walk debug info. Right, the offset of the data structure can get from the header etc. However, how would the bpf get the register or stack assignment of skb itself? In the tracepoint macro, it will be able to get it from function parameters (it needs a trick, like jprobe does). I doubt you can do that on kprobes/uprobes without any debuginfo support. :( And is it possible to trace a field in a data structure which is defined locally in somewhere.c ? :) (maybe it's just a corner case) > Something like: cc1 -mlayout_x86_64 filter.c will produce bpf code that > walks all data structures in the same way x86_64 does it. > Even if the user makes a mistake and uses -mlayout_aarch64, it won't crash. > Note that all -m* flags will be in one compiler. It won't grow any bigger > because of that. All of it already supported by C front-ends. > It may sound complex, but really very little code for the bpf backend. > > I didn't look inside systemtap/ktap enough to say how much they're > relying on presence of debug info to make a comparison. > > I see two main use cases for bpf tracing filters: debugging live kernel > and collecting stats. Same tricks that [sk]tap do with their maps. > Or may be some of the stats that 'perf record' collects in userspace > can be collected by bpf filter in kernel and stored into generic bpf table? > >> Would it be possible to make BFP filters recognize exposed details >> like the current filters do, without depending on the vmlinux? > > Well, if you say that presence of linux headers is also too much to ask, > I can hook bpf after probes stored all the args. > > This way current simple filter syntax can move to userspace. > 'arg1==x || arg2!=y' can be parsed by userspace, bpf code > generated and fed into kernel. It will be faster than walk_pred_tree(), > but if we cannot remove 2k lines from trace_events_filter.c > because of backward compatibility, extra performance becomes > the only reason to have two different implementations. > > Another use case is to optimize fetch sequences of dynamic probes > as Masami suggested, but backward compatibility requirement > would preserve to ways of doing it as well. The backward compatibility issue is only for the interface, but not for the implementation, I think. :) The fetch method and filter pred do already parse the argument into a syntax tree. IMHO, bpf can optimize that tree to just a simple opcode stream. Thank you, -- Masami HIRAMATSU IT Management Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu...@hitachi.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Tue, Dec 10, 2013 at 7:47 AM, Ingo Molnar wrote: > > * Alexei Starovoitov wrote: > >> > I'm fine if it becomes a requirement to have a vmlinux built with >> > DEBUG_INFO to use BPF and have a tool like perf to translate the >> > filters. But it that must not replace what the current filters do >> > now. That is, it can be an add on, but not a replacement. >> >> Of course. tracing filters via bpf is an additional tool for kernel >> debugging. bpf by itself has use cases beyond tracing. > > Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for > most people. there is a misunderstanding here. I was saying 'of course' to 'not replace current filter infra'. bpf does not depend on debug info. That's the key difference between 'perf probe' approach and bpf filters. Masami is right that what I was trying to achieve with bpf filters is similar to 'perf probe': insert a dynamic probe anywhere in the kernel, walk pointers, data structures, print interesting stuff. 'perf probe' does it via scanning vmlinux with debug info. bpf filters don't need it. tools/bpf/trace/*_orig.c examples only depend on linux headers in /lib/modules/../build/include/ Today bpf compiler struct layout is the same as x86_64. Tomorrow bpf compiler will have flags to adjust endianness, pointer size, etc of the front-end. Similar to -m32/-m64 and -m*-endian flags. Neat part is that I don't need to do any work, just enable it properly in the bpf backend. From gcc/llvm point of view, bpf is yet another 'hw' architecture that compiler is emitting code for. So when C code of filter_ex1_orig.c does 'skb->dev', compiler determines field offset by looking at /lib/modules/.../include/skbuff.h whereas for 'perf probe' 'skb->dev' means walk debug info. Something like: cc1 -mlayout_x86_64 filter.c will produce bpf code that walks all data structures in the same way x86_64 does it. Even if the user makes a mistake and uses -mlayout_aarch64, it won't crash. Note that all -m* flags will be in one compiler. It won't grow any bigger because of that. All of it already supported by C front-ends. It may sound complex, but really very little code for the bpf backend. I didn't look inside systemtap/ktap enough to say how much they're relying on presence of debug info to make a comparison. I see two main use cases for bpf tracing filters: debugging live kernel and collecting stats. Same tricks that [sk]tap do with their maps. Or may be some of the stats that 'perf record' collects in userspace can be collected by bpf filter in kernel and stored into generic bpf table? > Would it be possible to make BFP filters recognize exposed details > like the current filters do, without depending on the vmlinux? Well, if you say that presence of linux headers is also too much to ask, I can hook bpf after probes stored all the args. This way current simple filter syntax can move to userspace. 'arg1==x || arg2!=y' can be parsed by userspace, bpf code generated and fed into kernel. It will be faster than walk_pred_tree(), but if we cannot remove 2k lines from trace_events_filter.c because of backward compatibility, extra performance becomes the only reason to have two different implementations. Another use case is to optimize fetch sequences of dynamic probes as Masami suggested, but backward compatibility requirement would preserve to ways of doing it as well. imo the current hook of bpf into tracing is more compelling, but let me think more about reusing data stored in the ring buffer. Thanks Alexei -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
* Alexei Starovoitov wrote: > > I'm fine if it becomes a requirement to have a vmlinux built with > > DEBUG_INFO to use BPF and have a tool like perf to translate the > > filters. But it that must not replace what the current filters do > > now. That is, it can be an add on, but not a replacement. > > Of course. tracing filters via bpf is an additional tool for kernel > debugging. bpf by itself has use cases beyond tracing. Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for most people. Would it be possible to make BFP filters recognize exposed details like the current filters do, without depending on the vmlinux? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Tue, Dec 10, 2013 at 7:47 AM, Ingo Molnar mi...@kernel.org wrote: * Alexei Starovoitov a...@plumgrid.com wrote: I'm fine if it becomes a requirement to have a vmlinux built with DEBUG_INFO to use BPF and have a tool like perf to translate the filters. But it that must not replace what the current filters do now. That is, it can be an add on, but not a replacement. Of course. tracing filters via bpf is an additional tool for kernel debugging. bpf by itself has use cases beyond tracing. Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for most people. there is a misunderstanding here. I was saying 'of course' to 'not replace current filter infra'. bpf does not depend on debug info. That's the key difference between 'perf probe' approach and bpf filters. Masami is right that what I was trying to achieve with bpf filters is similar to 'perf probe': insert a dynamic probe anywhere in the kernel, walk pointers, data structures, print interesting stuff. 'perf probe' does it via scanning vmlinux with debug info. bpf filters don't need it. tools/bpf/trace/*_orig.c examples only depend on linux headers in /lib/modules/../build/include/ Today bpf compiler struct layout is the same as x86_64. Tomorrow bpf compiler will have flags to adjust endianness, pointer size, etc of the front-end. Similar to -m32/-m64 and -m*-endian flags. Neat part is that I don't need to do any work, just enable it properly in the bpf backend. From gcc/llvm point of view, bpf is yet another 'hw' architecture that compiler is emitting code for. So when C code of filter_ex1_orig.c does 'skb-dev', compiler determines field offset by looking at /lib/modules/.../include/skbuff.h whereas for 'perf probe' 'skb-dev' means walk debug info. Something like: cc1 -mlayout_x86_64 filter.c will produce bpf code that walks all data structures in the same way x86_64 does it. Even if the user makes a mistake and uses -mlayout_aarch64, it won't crash. Note that all -m* flags will be in one compiler. It won't grow any bigger because of that. All of it already supported by C front-ends. It may sound complex, but really very little code for the bpf backend. I didn't look inside systemtap/ktap enough to say how much they're relying on presence of debug info to make a comparison. I see two main use cases for bpf tracing filters: debugging live kernel and collecting stats. Same tricks that [sk]tap do with their maps. Or may be some of the stats that 'perf record' collects in userspace can be collected by bpf filter in kernel and stored into generic bpf table? Would it be possible to make BFP filters recognize exposed details like the current filters do, without depending on the vmlinux? Well, if you say that presence of linux headers is also too much to ask, I can hook bpf after probes stored all the args. This way current simple filter syntax can move to userspace. 'arg1==x || arg2!=y' can be parsed by userspace, bpf code generated and fed into kernel. It will be faster than walk_pred_tree(), but if we cannot remove 2k lines from trace_events_filter.c because of backward compatibility, extra performance becomes the only reason to have two different implementations. Another use case is to optimize fetch sequences of dynamic probes as Masami suggested, but backward compatibility requirement would preserve to ways of doing it as well. imo the current hook of bpf into tracing is more compelling, but let me think more about reusing data stored in the ring buffer. Thanks Alexei -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
(2013/12/11 11:32), Alexei Starovoitov wrote: On Tue, Dec 10, 2013 at 7:47 AM, Ingo Molnar mi...@kernel.org wrote: * Alexei Starovoitov a...@plumgrid.com wrote: I'm fine if it becomes a requirement to have a vmlinux built with DEBUG_INFO to use BPF and have a tool like perf to translate the filters. But it that must not replace what the current filters do now. That is, it can be an add on, but not a replacement. Of course. tracing filters via bpf is an additional tool for kernel debugging. bpf by itself has use cases beyond tracing. Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for most people. there is a misunderstanding here. I was saying 'of course' to 'not replace current filter infra'. bpf does not depend on debug info. That's the key difference between 'perf probe' approach and bpf filters. Masami is right that what I was trying to achieve with bpf filters is similar to 'perf probe': insert a dynamic probe anywhere in the kernel, walk pointers, data structures, print interesting stuff. 'perf probe' does it via scanning vmlinux with debug info. bpf filters don't need it. tools/bpf/trace/*_orig.c examples only depend on linux headers in /lib/modules/../build/include/ Today bpf compiler struct layout is the same as x86_64. Tomorrow bpf compiler will have flags to adjust endianness, pointer size, etc of the front-end. Similar to -m32/-m64 and -m*-endian flags. Neat part is that I don't need to do any work, just enable it properly in the bpf backend. From gcc/llvm point of view, bpf is yet another 'hw' architecture that compiler is emitting code for. So when C code of filter_ex1_orig.c does 'skb-dev', compiler determines field offset by looking at /lib/modules/.../include/skbuff.h whereas for 'perf probe' 'skb-dev' means walk debug info. Right, the offset of the data structure can get from the header etc. However, how would the bpf get the register or stack assignment of skb itself? In the tracepoint macro, it will be able to get it from function parameters (it needs a trick, like jprobe does). I doubt you can do that on kprobes/uprobes without any debuginfo support. :( And is it possible to trace a field in a data structure which is defined locally in somewhere.c ? :) (maybe it's just a corner case) Something like: cc1 -mlayout_x86_64 filter.c will produce bpf code that walks all data structures in the same way x86_64 does it. Even if the user makes a mistake and uses -mlayout_aarch64, it won't crash. Note that all -m* flags will be in one compiler. It won't grow any bigger because of that. All of it already supported by C front-ends. It may sound complex, but really very little code for the bpf backend. I didn't look inside systemtap/ktap enough to say how much they're relying on presence of debug info to make a comparison. I see two main use cases for bpf tracing filters: debugging live kernel and collecting stats. Same tricks that [sk]tap do with their maps. Or may be some of the stats that 'perf record' collects in userspace can be collected by bpf filter in kernel and stored into generic bpf table? Would it be possible to make BFP filters recognize exposed details like the current filters do, without depending on the vmlinux? Well, if you say that presence of linux headers is also too much to ask, I can hook bpf after probes stored all the args. This way current simple filter syntax can move to userspace. 'arg1==x || arg2!=y' can be parsed by userspace, bpf code generated and fed into kernel. It will be faster than walk_pred_tree(), but if we cannot remove 2k lines from trace_events_filter.c because of backward compatibility, extra performance becomes the only reason to have two different implementations. Another use case is to optimize fetch sequences of dynamic probes as Masami suggested, but backward compatibility requirement would preserve to ways of doing it as well. The backward compatibility issue is only for the interface, but not for the implementation, I think. :) The fetch method and filter pred do already parse the argument into a syntax tree. IMHO, bpf can optimize that tree to just a simple opcode stream. Thank you, -- Masami HIRAMATSU IT Management Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu...@hitachi.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
* Alexei Starovoitov a...@plumgrid.com wrote: I'm fine if it becomes a requirement to have a vmlinux built with DEBUG_INFO to use BPF and have a tool like perf to translate the filters. But it that must not replace what the current filters do now. That is, it can be an add on, but not a replacement. Of course. tracing filters via bpf is an additional tool for kernel debugging. bpf by itself has use cases beyond tracing. Well, Steve has a point: forcing DEBUG_INFO is a big showstopper for most people. Would it be possible to make BFP filters recognize exposed details like the current filters do, without depending on the vmlinux? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: binary blob no more! Was: [RFC PATCH tip 0/5] tracing filters with BPF
On Sun, 8 Dec 2013 19:36:18 -0800 Alexei Starovoitov wrote: > Actually I think there are few ways to include the source equivalent > in bpf image. > > Approach #1 > include original C source code into bpf image: > bpf_image = bpf_insns + original_C > this will imply that C code can have #include's of linux kernel headers only > and it can only be C source. > this way the user can do 'cat /sys/kernel/debug/bpf/filter', kernel > will print original_C and these restrictions will guarantee that it > will compile into similar bpf code whether gcc or llvm compiler is > used. > > Approach #2 > include original llvm bitcode: > bpf_image = bpf_insns + llvm_bc > The user can do 'cat .../filter' and use llvm-dis to see human readable > bitcode. > It takes practice to read it, but it's high level enough to understand > what filter is doing. llvm-llc can be used to generate bpf_insns > again, or generate C from bitcode. > Pro vs 1: bitcode is very compact > Con: only llvm compiler can used to generate bpf instructions > > Enforcement can be done by having a user space daemon that > walks over all loaded filters and recompiles them from C or from bitcode. > > Please let me know which approach you prefer. I don't like either. And different compilers may produce different results, so that daemon may not be able to verify what is in the C code is really what's in the bitcode. > > I still think that bpf_image = bpf_insns + license_string is just as good, > since bpf code can only call tiny set of functions, so no matter what > the code does its scope is very limited and license enforcement > guarantees that original source has to be available, > but I'm ok whichever way. I like that approach much better. That is, all binary code must state that it is under the GPL. That way, if you give a binary to someone, you must also supply the source under the GPL license. Having a disassembler in the kernel to see what code is loaded, adds the added benefit that you can see what is there. We can have a userspace tool to make even more sense out of the disassembled code. I don't think the kernel should have anything more than a disassembler though. Maybe that's even too much, but at least a human can inspect it a little without needing extra tools. > > Also please indicate whether gcc or llvm backend is preferred to > be hosted in tools. If we end up placing a compiler in tools, than that compiler should also be able to be used to compile the entire kernel. Maybe we will finally get our kcc ;-) -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
(2013/12/09 16:29), Namhyung Kim wrote: > Hi Masami, > > On Wed, 04 Dec 2013 10:13:37 +0900, Masami Hiramatsu wrote: >> (2013/12/04 3:26), Alexei Starovoitov wrote: >>> the only inconvenience so far is to know how parameters are getting >>> into registers. >>> on x86-64, arg1 is in rdi, arg2 is in rsi,... I want to improve that >>> after first step is done. >> >> Actually, that part is done by the perf-probe and ftrace dynamic events >> (kernel/trace/trace_probe.c). I think this generic BPF is good for >> re-implementing fetch methods. :) > > For implementing patch method, it seems that it needs to access to user > memory, stack and/or current (task_struct - for utask or vma later) from > the BPF VM as well. Isn't it OK from the security perspective? Would you mean security or safety? :) For safety, I think we can check the BPF binary doesn't break anything. Anyway, for fetch method, I think we have to make a generic syntax tree for the archs which don't support BPF, and BPF bytecode will be generated by the syntax tree. IOW, I'd like to use BPF just for optimizing memory address calculation. For security, it is hard to check what is the sensitive information in the kernel, I think it should be restricted to root user a while. > Anyway, I'll take a look at it later if I have time, but I want to get > the existing/pending implementation merged first. :) Yes, of course ! :) Thank you, -- Masami HIRAMATSU IT Management Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu...@hitachi.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
(2013/12/09 16:29), Namhyung Kim wrote: Hi Masami, On Wed, 04 Dec 2013 10:13:37 +0900, Masami Hiramatsu wrote: (2013/12/04 3:26), Alexei Starovoitov wrote: the only inconvenience so far is to know how parameters are getting into registers. on x86-64, arg1 is in rdi, arg2 is in rsi,... I want to improve that after first step is done. Actually, that part is done by the perf-probe and ftrace dynamic events (kernel/trace/trace_probe.c). I think this generic BPF is good for re-implementing fetch methods. :) For implementing patch method, it seems that it needs to access to user memory, stack and/or current (task_struct - for utask or vma later) from the BPF VM as well. Isn't it OK from the security perspective? Would you mean security or safety? :) For safety, I think we can check the BPF binary doesn't break anything. Anyway, for fetch method, I think we have to make a generic syntax tree for the archs which don't support BPF, and BPF bytecode will be generated by the syntax tree. IOW, I'd like to use BPF just for optimizing memory address calculation. For security, it is hard to check what is the sensitive information in the kernel, I think it should be restricted to root user a while. Anyway, I'll take a look at it later if I have time, but I want to get the existing/pending implementation merged first. :) Yes, of course ! :) Thank you, -- Masami HIRAMATSU IT Management Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu...@hitachi.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: binary blob no more! Was: [RFC PATCH tip 0/5] tracing filters with BPF
On Sun, 8 Dec 2013 19:36:18 -0800 Alexei Starovoitov a...@plumgrid.com wrote: Actually I think there are few ways to include the source equivalent in bpf image. Approach #1 include original C source code into bpf image: bpf_image = bpf_insns + original_C this will imply that C code can have #include's of linux kernel headers only and it can only be C source. this way the user can do 'cat /sys/kernel/debug/bpf/filter', kernel will print original_C and these restrictions will guarantee that it will compile into similar bpf code whether gcc or llvm compiler is used. Approach #2 include original llvm bitcode: bpf_image = bpf_insns + llvm_bc The user can do 'cat .../filter' and use llvm-dis to see human readable bitcode. It takes practice to read it, but it's high level enough to understand what filter is doing. llvm-llc can be used to generate bpf_insns again, or generate C from bitcode. Pro vs 1: bitcode is very compact Con: only llvm compiler can used to generate bpf instructions Enforcement can be done by having a user space daemon that walks over all loaded filters and recompiles them from C or from bitcode. Please let me know which approach you prefer. I don't like either. And different compilers may produce different results, so that daemon may not be able to verify what is in the C code is really what's in the bitcode. I still think that bpf_image = bpf_insns + license_string is just as good, since bpf code can only call tiny set of functions, so no matter what the code does its scope is very limited and license enforcement guarantees that original source has to be available, but I'm ok whichever way. I like that approach much better. That is, all binary code must state that it is under the GPL. That way, if you give a binary to someone, you must also supply the source under the GPL license. Having a disassembler in the kernel to see what code is loaded, adds the added benefit that you can see what is there. We can have a userspace tool to make even more sense out of the disassembled code. I don't think the kernel should have anything more than a disassembler though. Maybe that's even too much, but at least a human can inspect it a little without needing extra tools. Also please indicate whether gcc or llvm backend is preferred to be hosted in tools. If we end up placing a compiler in tools, than that compiler should also be able to be used to compile the entire kernel. Maybe we will finally get our kcc ;-) -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
Hi Masami, On Wed, 04 Dec 2013 10:13:37 +0900, Masami Hiramatsu wrote: > (2013/12/04 3:26), Alexei Starovoitov wrote: >> the only inconvenience so far is to know how parameters are getting >> into registers. >> on x86-64, arg1 is in rdi, arg2 is in rsi,... I want to improve that >> after first step is done. > > Actually, that part is done by the perf-probe and ftrace dynamic events > (kernel/trace/trace_probe.c). I think this generic BPF is good for > re-implementing fetch methods. :) For implementing patch method, it seems that it needs to access to user memory, stack and/or current (task_struct - for utask or vma later) from the BPF VM as well. Isn't it OK from the security perspective? Anyway, I'll take a look at it later if I have time, but I want to get the existing/pending implementation merged first. :) Thanks, Namhyung -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF
(2013/12/08 1:21), Jovi Zhangwei wrote: > On Sat, Dec 7, 2013 at 7:58 AM, Masami Hiramatsu > wrote: >> (2013/12/06 14:19), Jovi Zhangwei wrote: >>> Hi Alexei, >>> >>> On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov >>> wrote: > On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen wrote: >> >> Can you do some performance comparison compared to e.g. ktap? >> How much faster is it? Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email: trace skb:kfree_skb { if (arg2 == 0x100) { printf("%x %x\n", arg1, arg2) } } 1M skb alloc/free 350315 (usecs) baseline without any tracing: 1M skb alloc/free 145400 (usecs) then equivalent bpf test: void filter(struct bpf_context *ctx) { void *loc = (void *)ctx->regs.dx; if (loc == 0x100) { struct sk_buff *skb = (struct sk_buff *)ctx->regs.si; char fmt[] = "skb %p loc %p\n"; bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); } } 1M skb alloc/free 183214 (usecs) so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145 obviously ktap is an interpreter, so it's not really fair. To make it really unfair I did: trace skb:kfree_skb { if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 || arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 || arg2 == 0x900 || arg2 == 0x1000) { printf("%x %x\n", arg1, arg2) } } 1M skb alloc/free 484280 (usecs) and corresponding bpf: void filter(struct bpf_context *ctx) { void *loc = (void *)ctx->regs.dx; if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 || loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 || loc == 0x900 || loc == 0x1000) { struct sk_buff *skb = (struct sk_buff *)ctx->regs.si; char fmt[] = "skb %p loc %p\n"; bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); } } 1M skb alloc/free 185660 (usecs) the difference is bigger now: 484-145 vs 185-145 >>> There have big differences for compare arg2(in ktap) with direct register >>> access(ctx->regs.dx). >>> >>> The current argument fetching(arg2 in above testcase) implementation in ktap >>> is very inefficient, see ktap/interpreter/lib_kdebug.c:kp_event_getarg. >>> The only way to speedup is kernel tracing code change, let external tracing >>> module access event field not through list lookup. This work is not >>> started yet. :) >> >> I'm not sure why you can't access it directly from ftrace-event buffer. >> There is just a packed data structure and it is exposed via debugfs. >> You can decode it and can get an offset/size by using libtraceevent. >> > Then it means there need pass the event field info into kernel through trunk, > it looks strange because the kernel structure is the source of event field > info, > it's like loop-back, and need to engage with libtraceevent in userspace. No, the static traceevents have its own kernel data structure, but the dynamic events don't. They expose the data format (offset/type) via debugfs, but do not define new data structure. So, I meant the script is enough to take an offset and a method casting to corresponding size. > (the side effect is it will make compilation slow, and consume more memory, > sometimes it will process 20K events in one script, like 'trace > probe:big_dso:*') I doubt it, since you just need to get formats only for the events what the script using. > So "the only way" which I said is wrong, your approach indeed is another way. > I just think maybe use array instead of list for event fields would be more > efficient if list is not must needed. we can check it more in future. Ah, perhaps, I misunderstood ktap implementation. Does it define dynamic events right before loading a bytecode? In that case, I recommend you to change a loader to adjust the bytecode after defining event to tune the offset information, which fits to the target event format. e.g. 1) compile a bytecode with dummy offsets 2) define new additional dynamic events 3) get the field offset information from the events 4) modify the bytecode to replace offsets with correct one on memory 5) load the bytecode Thank you, -- Masami HIRAMATSU IT Management Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu...@hitachi.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read
binary blob no more! Was: [RFC PATCH tip 0/5] tracing filters with BPF
On Thu, Dec 5, 2013 at 9:43 PM, Alexei Starovoitov wrote: > On Thu, Dec 5, 2013 at 2:38 AM, Ingo Molnar wrote: >> >>> Also I'm thinking to add 'license_string' section to bpf binary format >>> and call license_is_gpl_compatible() on it during load. >>> If false, then just reject it…. not even messing with taint flags... >>> That would be way stronger indication of bpf licensing terms than what >>> we have for .ko >> >> But will BFP tools generate such gpl-compatible license tags by >> default? If yes then this might work, combined with the facility >> below. If not then it's just a nuisance to users. > > yes. similar to existing .ko module_license() tag. see below. > >> My concern would be solved by adding a facility to always be able to >> dump source code as well, i.e. trivially transform it to C or so, so >> that people can review it - or just edit it on the fly, recompile and >> reinsert? Most BFP scripts ought to be pretty simple. > > C code has '#include' in them, so without storing fully preprocessed code > it will not be equivalent. but then true source will be gigantic. > Can be zipped, but that sounds like an overkill. > Also we might want other languages with their own dependent includes. > Sure, we can have a section in bpf binary that has the source, but it's not > enforceable. Kernel cannot know that it's an actual source. > gcc/llvm will produce different bpf code out of the same source. > the source is in C or in language X, etc. > Doesn't seem that including some form of source will help > with enforcing the license. > > imo requiring module_license("gpl"); line in C code and equivalent > string in all other languages that want to translate to bpf would be > stronger indication of licensing terms. > then compiler would have to include that string into 'license_string' > section and kernel can actually enforce it. Actually I think there are few ways to include the source equivalent in bpf image. Approach #1 include original C source code into bpf image: bpf_image = bpf_insns + original_C this will imply that C code can have #include's of linux kernel headers only and it can only be C source. this way the user can do 'cat /sys/kernel/debug/bpf/filter', kernel will print original_C and these restrictions will guarantee that it will compile into similar bpf code whether gcc or llvm compiler is used. Approach #2 include original llvm bitcode: bpf_image = bpf_insns + llvm_bc The user can do 'cat .../filter' and use llvm-dis to see human readable bitcode. It takes practice to read it, but it's high level enough to understand what filter is doing. llvm-llc can be used to generate bpf_insns again, or generate C from bitcode. Pro vs 1: bitcode is very compact Con: only llvm compiler can used to generate bpf instructions Enforcement can be done by having a user space daemon that walks over all loaded filters and recompiles them from C or from bitcode. Please let me know which approach you prefer. I still think that bpf_image = bpf_insns + license_string is just as good, since bpf code can only call tiny set of functions, so no matter what the code does its scope is very limited and license enforcement guarantees that original source has to be available, but I'm ok whichever way. Also please indicate whether gcc or llvm backend is preferred to be hosted in tools. Build of gcc backend is slow (takes ~100 sec), since front-end, optimizer and backend are single binary of ~13M. It doesn't need any other files to compile filter.c into bpf_image Build of llvm backend ('llc') takes ~10 sec, since it has to compile only bpf backend files. But it would need clang package to translate C into llvm bitcode and 'llc' (single 8M binary) to compile bitcode into bpf_image. Thanks Alexei -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
binary blob no more! Was: [RFC PATCH tip 0/5] tracing filters with BPF
On Thu, Dec 5, 2013 at 9:43 PM, Alexei Starovoitov a...@plumgrid.com wrote: On Thu, Dec 5, 2013 at 2:38 AM, Ingo Molnar mi...@kernel.org wrote: Also I'm thinking to add 'license_string' section to bpf binary format and call license_is_gpl_compatible() on it during load. If false, then just reject it…. not even messing with taint flags... That would be way stronger indication of bpf licensing terms than what we have for .ko But will BFP tools generate such gpl-compatible license tags by default? If yes then this might work, combined with the facility below. If not then it's just a nuisance to users. yes. similar to existing .ko module_license() tag. see below. My concern would be solved by adding a facility to always be able to dump source code as well, i.e. trivially transform it to C or so, so that people can review it - or just edit it on the fly, recompile and reinsert? Most BFP scripts ought to be pretty simple. C code has '#include' in them, so without storing fully preprocessed code it will not be equivalent. but then true source will be gigantic. Can be zipped, but that sounds like an overkill. Also we might want other languages with their own dependent includes. Sure, we can have a section in bpf binary that has the source, but it's not enforceable. Kernel cannot know that it's an actual source. gcc/llvm will produce different bpf code out of the same source. the source is in C or in language X, etc. Doesn't seem that including some form of source will help with enforcing the license. imo requiring module_license(gpl); line in C code and equivalent string in all other languages that want to translate to bpf would be stronger indication of licensing terms. then compiler would have to include that string into 'license_string' section and kernel can actually enforce it. Actually I think there are few ways to include the source equivalent in bpf image. Approach #1 include original C source code into bpf image: bpf_image = bpf_insns + original_C this will imply that C code can have #include's of linux kernel headers only and it can only be C source. this way the user can do 'cat /sys/kernel/debug/bpf/filter', kernel will print original_C and these restrictions will guarantee that it will compile into similar bpf code whether gcc or llvm compiler is used. Approach #2 include original llvm bitcode: bpf_image = bpf_insns + llvm_bc The user can do 'cat .../filter' and use llvm-dis to see human readable bitcode. It takes practice to read it, but it's high level enough to understand what filter is doing. llvm-llc can be used to generate bpf_insns again, or generate C from bitcode. Pro vs 1: bitcode is very compact Con: only llvm compiler can used to generate bpf instructions Enforcement can be done by having a user space daemon that walks over all loaded filters and recompiles them from C or from bitcode. Please let me know which approach you prefer. I still think that bpf_image = bpf_insns + license_string is just as good, since bpf code can only call tiny set of functions, so no matter what the code does its scope is very limited and license enforcement guarantees that original source has to be available, but I'm ok whichever way. Also please indicate whether gcc or llvm backend is preferred to be hosted in tools. Build of gcc backend is slow (takes ~100 sec), since front-end, optimizer and backend are single binary of ~13M. It doesn't need any other files to compile filter.c into bpf_image Build of llvm backend ('llc') takes ~10 sec, since it has to compile only bpf backend files. But it would need clang package to translate C into llvm bitcode and 'llc' (single 8M binary) to compile bitcode into bpf_image. Thanks Alexei -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF
(2013/12/08 1:21), Jovi Zhangwei wrote: On Sat, Dec 7, 2013 at 7:58 AM, Masami Hiramatsu masami.hiramatsu...@hitachi.com wrote: (2013/12/06 14:19), Jovi Zhangwei wrote: Hi Alexei, On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov a...@plumgrid.com wrote: On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen a...@firstfloor.org wrote: Can you do some performance comparison compared to e.g. ktap? How much faster is it? Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email: trace skb:kfree_skb { if (arg2 == 0x100) { printf(%x %x\n, arg1, arg2) } } 1M skb alloc/free 350315 (usecs) baseline without any tracing: 1M skb alloc/free 145400 (usecs) then equivalent bpf test: void filter(struct bpf_context *ctx) { void *loc = (void *)ctx-regs.dx; if (loc == 0x100) { struct sk_buff *skb = (struct sk_buff *)ctx-regs.si; char fmt[] = skb %p loc %p\n; bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); } } 1M skb alloc/free 183214 (usecs) so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145 obviously ktap is an interpreter, so it's not really fair. To make it really unfair I did: trace skb:kfree_skb { if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 || arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 || arg2 == 0x900 || arg2 == 0x1000) { printf(%x %x\n, arg1, arg2) } } 1M skb alloc/free 484280 (usecs) and corresponding bpf: void filter(struct bpf_context *ctx) { void *loc = (void *)ctx-regs.dx; if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 || loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 || loc == 0x900 || loc == 0x1000) { struct sk_buff *skb = (struct sk_buff *)ctx-regs.si; char fmt[] = skb %p loc %p\n; bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); } } 1M skb alloc/free 185660 (usecs) the difference is bigger now: 484-145 vs 185-145 There have big differences for compare arg2(in ktap) with direct register access(ctx-regs.dx). The current argument fetching(arg2 in above testcase) implementation in ktap is very inefficient, see ktap/interpreter/lib_kdebug.c:kp_event_getarg. The only way to speedup is kernel tracing code change, let external tracing module access event field not through list lookup. This work is not started yet. :) I'm not sure why you can't access it directly from ftrace-event buffer. There is just a packed data structure and it is exposed via debugfs. You can decode it and can get an offset/size by using libtraceevent. Then it means there need pass the event field info into kernel through trunk, it looks strange because the kernel structure is the source of event field info, it's like loop-back, and need to engage with libtraceevent in userspace. No, the static traceevents have its own kernel data structure, but the dynamic events don't. They expose the data format (offset/type) via debugfs, but do not define new data structure. So, I meant the script is enough to take an offset and a method casting to corresponding size. (the side effect is it will make compilation slow, and consume more memory, sometimes it will process 20K events in one script, like 'trace probe:big_dso:*') I doubt it, since you just need to get formats only for the events what the script using. So the only way which I said is wrong, your approach indeed is another way. I just think maybe use array instead of list for event fields would be more efficient if list is not must needed. we can check it more in future. Ah, perhaps, I misunderstood ktap implementation. Does it define dynamic events right before loading a bytecode? In that case, I recommend you to change a loader to adjust the bytecode after defining event to tune the offset information, which fits to the target event format. e.g. 1) compile a bytecode with dummy offsets 2) define new additional dynamic events 3) get the field offset information from the events 4) modify the bytecode to replace offsets with correct one on memory 5) load the bytecode Thank you, -- Masami HIRAMATSU IT Management Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu...@hitachi.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
Hi Masami, On Wed, 04 Dec 2013 10:13:37 +0900, Masami Hiramatsu wrote: (2013/12/04 3:26), Alexei Starovoitov wrote: the only inconvenience so far is to know how parameters are getting into registers. on x86-64, arg1 is in rdi, arg2 is in rsi,... I want to improve that after first step is done. Actually, that part is done by the perf-probe and ftrace dynamic events (kernel/trace/trace_probe.c). I think this generic BPF is good for re-implementing fetch methods. :) For implementing patch method, it seems that it needs to access to user memory, stack and/or current (task_struct - for utask or vma later) from the BPF VM as well. Isn't it OK from the security perspective? Anyway, I'll take a look at it later if I have time, but I want to get the existing/pending implementation merged first. :) Thanks, Namhyung -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Sat, Dec 7, 2013 at 9:12 AM, Alexei Starovoitov wrote: > On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen wrote: >> "H. Peter Anvin" writes: >>> >>> Not to mention that in that case we might as well -- since we need a >>> compiler anyway -- generate the machine code in user space; the JIT >>> solution really only is useful if it can provide something that we can't >>> do otherwise, e.g. enable it in secure boot environments. >> >> I can see there may be some setups which don't have a compiler >> (e.g. I know some people don't use systemtap because of that) >> But this needs a custom gcc install too as far as I understand. > > fyi custom gcc is a single 13M binary. It doesn't depend on any > include files or any libraries. > and can be easily packaged together with perf... even for embedded > environment. Hmm, 13M binary is big IMO, perf is just 5M after compiled in my system, I'm not sure embed a custom gcc into perf is a good idea. (and need to compile that custom gcc every time when build perf ?) IMO gcc size is not all/main reason of why embedded system didn't install it, I saw many many production embedded system, no one install gcc, also gdb, etc. I would never expect Android will install gcc in some day, I also will really surprise if telcom-vender deliver Linux board with gcc installed to customers. Another question is: does the custom gcc of bpf-filter need kernel header file for compilation? if it need, then this issue is more bigger than gcc size for embedded system.(same problem like Systemtap) Thanks, Jovi. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Sat, Dec 7, 2013 at 7:58 AM, Masami Hiramatsu wrote: > (2013/12/06 14:19), Jovi Zhangwei wrote: >> Hi Alexei, >> >> On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov >> wrote: On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen wrote: > > Can you do some performance comparison compared to e.g. ktap? > How much faster is it? >>> >>> Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier >>> email: >>> trace skb:kfree_skb { >>> if (arg2 == 0x100) { >>> printf("%x %x\n", arg1, arg2) >>> } >>> } >>> 1M skb alloc/free 350315 (usecs) >>> >>> baseline without any tracing: >>> 1M skb alloc/free 145400 (usecs) >>> >>> then equivalent bpf test: >>> void filter(struct bpf_context *ctx) >>> { >>> void *loc = (void *)ctx->regs.dx; >>> if (loc == 0x100) { >>> struct sk_buff *skb = (struct sk_buff *)ctx->regs.si; >>> char fmt[] = "skb %p loc %p\n"; >>> bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); >>> } >>> } >>> 1M skb alloc/free 183214 (usecs) >>> >>> so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145 >>> >>> obviously ktap is an interpreter, so it's not really fair. >>> >>> To make it really unfair I did: >>> trace skb:kfree_skb { >>> if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == >>> 0x400 || >>> arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == >>> 0x800 || >>> arg2 == 0x900 || arg2 == 0x1000) { >>> printf("%x %x\n", arg1, arg2) >>> } >>> } >>> 1M skb alloc/free 484280 (usecs) >>> >>> and corresponding bpf: >>> void filter(struct bpf_context *ctx) >>> { >>> void *loc = (void *)ctx->regs.dx; >>> if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 || >>> loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 || >>> loc == 0x900 || loc == 0x1000) { >>> struct sk_buff *skb = (struct sk_buff *)ctx->regs.si; >>> char fmt[] = "skb %p loc %p\n"; >>> bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); >>> } >>> } >>> 1M skb alloc/free 185660 (usecs) >>> >>> the difference is bigger now: 484-145 vs 185-145 >>> >> There have big differences for compare arg2(in ktap) with direct register >> access(ctx->regs.dx). >> >> The current argument fetching(arg2 in above testcase) implementation in ktap >> is very inefficient, see ktap/interpreter/lib_kdebug.c:kp_event_getarg. >> The only way to speedup is kernel tracing code change, let external tracing >> module access event field not through list lookup. This work is not >> started yet. :) > > I'm not sure why you can't access it directly from ftrace-event buffer. > There is just a packed data structure and it is exposed via debugfs. > You can decode it and can get an offset/size by using libtraceevent. > Then it means there need pass the event field info into kernel through trunk, it looks strange because the kernel structure is the source of event field info, it's like loop-back, and need to engage with libtraceevent in userspace. (the side effect is it will make compilation slow, and consume more memory, sometimes it will process 20K events in one script, like 'trace probe:big_dso:*') So "the only way" which I said is wrong, your approach indeed is another way. I just think maybe use array instead of list for event fields would be more efficient if list is not must needed. we can check it more in future. Thanks. Jovi. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Sat, Dec 7, 2013 at 7:58 AM, Masami Hiramatsu masami.hiramatsu...@hitachi.com wrote: (2013/12/06 14:19), Jovi Zhangwei wrote: Hi Alexei, On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov a...@plumgrid.com wrote: On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen a...@firstfloor.org wrote: Can you do some performance comparison compared to e.g. ktap? How much faster is it? Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email: trace skb:kfree_skb { if (arg2 == 0x100) { printf(%x %x\n, arg1, arg2) } } 1M skb alloc/free 350315 (usecs) baseline without any tracing: 1M skb alloc/free 145400 (usecs) then equivalent bpf test: void filter(struct bpf_context *ctx) { void *loc = (void *)ctx-regs.dx; if (loc == 0x100) { struct sk_buff *skb = (struct sk_buff *)ctx-regs.si; char fmt[] = skb %p loc %p\n; bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); } } 1M skb alloc/free 183214 (usecs) so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145 obviously ktap is an interpreter, so it's not really fair. To make it really unfair I did: trace skb:kfree_skb { if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 || arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 || arg2 == 0x900 || arg2 == 0x1000) { printf(%x %x\n, arg1, arg2) } } 1M skb alloc/free 484280 (usecs) and corresponding bpf: void filter(struct bpf_context *ctx) { void *loc = (void *)ctx-regs.dx; if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 || loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 || loc == 0x900 || loc == 0x1000) { struct sk_buff *skb = (struct sk_buff *)ctx-regs.si; char fmt[] = skb %p loc %p\n; bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); } } 1M skb alloc/free 185660 (usecs) the difference is bigger now: 484-145 vs 185-145 There have big differences for compare arg2(in ktap) with direct register access(ctx-regs.dx). The current argument fetching(arg2 in above testcase) implementation in ktap is very inefficient, see ktap/interpreter/lib_kdebug.c:kp_event_getarg. The only way to speedup is kernel tracing code change, let external tracing module access event field not through list lookup. This work is not started yet. :) I'm not sure why you can't access it directly from ftrace-event buffer. There is just a packed data structure and it is exposed via debugfs. You can decode it and can get an offset/size by using libtraceevent. Then it means there need pass the event field info into kernel through trunk, it looks strange because the kernel structure is the source of event field info, it's like loop-back, and need to engage with libtraceevent in userspace. (the side effect is it will make compilation slow, and consume more memory, sometimes it will process 20K events in one script, like 'trace probe:big_dso:*') So the only way which I said is wrong, your approach indeed is another way. I just think maybe use array instead of list for event fields would be more efficient if list is not must needed. we can check it more in future. Thanks. Jovi. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Sat, Dec 7, 2013 at 9:12 AM, Alexei Starovoitov a...@plumgrid.com wrote: On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen a...@firstfloor.org wrote: H. Peter Anvin h...@zytor.com writes: Not to mention that in that case we might as well -- since we need a compiler anyway -- generate the machine code in user space; the JIT solution really only is useful if it can provide something that we can't do otherwise, e.g. enable it in secure boot environments. I can see there may be some setups which don't have a compiler (e.g. I know some people don't use systemtap because of that) But this needs a custom gcc install too as far as I understand. fyi custom gcc is a single 13M binary. It doesn't depend on any include files or any libraries. and can be easily packaged together with perf... even for embedded environment. Hmm, 13M binary is big IMO, perf is just 5M after compiled in my system, I'm not sure embed a custom gcc into perf is a good idea. (and need to compile that custom gcc every time when build perf ?) IMO gcc size is not all/main reason of why embedded system didn't install it, I saw many many production embedded system, no one install gcc, also gdb, etc. I would never expect Android will install gcc in some day, I also will really surprise if telcom-vender deliver Linux board with gcc installed to customers. Another question is: does the custom gcc of bpf-filter need kernel header file for compilation? if it need, then this issue is more bigger than gcc size for embedded system.(same problem like Systemtap) Thanks, Jovi. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen wrote: > "H. Peter Anvin" writes: >> >> Not to mention that in that case we might as well -- since we need a >> compiler anyway -- generate the machine code in user space; the JIT >> solution really only is useful if it can provide something that we can't >> do otherwise, e.g. enable it in secure boot environments. > > I can see there may be some setups which don't have a compiler > (e.g. I know some people don't use systemtap because of that) > But this needs a custom gcc install too as far as I understand. fyi custom gcc is a single 13M binary. It doesn't depend on any include files or any libraries. and can be easily packaged together with perf... even for embedded environment. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Fri, Dec 6, 2013 at 3:54 PM, Masami Hiramatsu wrote: > (2013/12/06 14:16), Alexei Starovoitov wrote: >> On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen wrote: the difference is bigger now: 484-145 vs 185-145 >>> >>> This is a obvious improvement, but imho not big enough to be extremely >>> compelling (< cost 1-2 cache misses, no orders of magnitude improvements >>> that would justify a lot of code) >> >> hmm. we're comparing against ktap here… >> which has 5x more kernel code and 8x slower in this test... >> >>> Your code requires a compiler, so from my perspective it >>> wouldn't be a lot easier or faster to use than just changing >>> the code directly and recompile. >>> >>> The users want something simple too that shields them from >>> having to learn all the internals. They don't want to recompile. >>> As far as I can tell your code is a bit too low level for that, >>> and the requirement for the compiler may also scare them. >>> >>> Where exactly does it fit? >> >> the goal is to have llvm compiler next to perf, wrapped in a user friendly >> way. >> >> compiling small filter vs recompiling full kernel… >> inserting into live kernel vs rebooting … >> not sure how you're saying it's equivalent. >> >> In my kernel debugging experience current tools (tracing, systemtap) >> were rarely enough. >> I always had to add my own printks through the code, recompile and reboot. >> Often just to see that it's not the place where I want to print things >> or it's too verbose. >> Then I would adjust printks, recompile and reboot again. >> That was slow and tedious, since I would be crashing things from time to time >> just because skb doesn't always have a valid dev or I made a typo. >> For debugging I do really need something quick and dirty that lets me >> add my own printk >> of whatever structs I want anywhere in the kernel without crashing it. >> That's exactly what bpf tracing filters do. > > I recommend you to use perf-probe. That will give you an easy solution. :) it is indeed very cool. Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF
(2013/12/06 14:19), Jovi Zhangwei wrote: > Hi Alexei, > > On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov wrote: >>> On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen wrote: Can you do some performance comparison compared to e.g. ktap? How much faster is it? >> >> Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email: >> trace skb:kfree_skb { >> if (arg2 == 0x100) { >> printf("%x %x\n", arg1, arg2) >> } >> } >> 1M skb alloc/free 350315 (usecs) >> >> baseline without any tracing: >> 1M skb alloc/free 145400 (usecs) >> >> then equivalent bpf test: >> void filter(struct bpf_context *ctx) >> { >> void *loc = (void *)ctx->regs.dx; >> if (loc == 0x100) { >> struct sk_buff *skb = (struct sk_buff *)ctx->regs.si; >> char fmt[] = "skb %p loc %p\n"; >> bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); >> } >> } >> 1M skb alloc/free 183214 (usecs) >> >> so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145 >> >> obviously ktap is an interpreter, so it's not really fair. >> >> To make it really unfair I did: >> trace skb:kfree_skb { >> if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 >> || >> arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 >> || >> arg2 == 0x900 || arg2 == 0x1000) { >> printf("%x %x\n", arg1, arg2) >> } >> } >> 1M skb alloc/free 484280 (usecs) >> >> and corresponding bpf: >> void filter(struct bpf_context *ctx) >> { >> void *loc = (void *)ctx->regs.dx; >> if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 || >> loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 || >> loc == 0x900 || loc == 0x1000) { >> struct sk_buff *skb = (struct sk_buff *)ctx->regs.si; >> char fmt[] = "skb %p loc %p\n"; >> bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); >> } >> } >> 1M skb alloc/free 185660 (usecs) >> >> the difference is bigger now: 484-145 vs 185-145 >> > There have big differences for compare arg2(in ktap) with direct register > access(ctx->regs.dx). > > The current argument fetching(arg2 in above testcase) implementation in ktap > is very inefficient, see ktap/interpreter/lib_kdebug.c:kp_event_getarg. > The only way to speedup is kernel tracing code change, let external tracing > module access event field not through list lookup. This work is not > started yet. :) I'm not sure why you can't access it directly from ftrace-event buffer. There is just a packed data structure and it is exposed via debugfs. You can decode it and can get an offset/size by using libtraceevent. Thank you, -- Masami HIRAMATSU IT Management Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu...@hitachi.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF
(2013/12/06 14:16), Alexei Starovoitov wrote: > On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen wrote: >>> the difference is bigger now: 484-145 vs 185-145 >> >> This is a obvious improvement, but imho not big enough to be extremely >> compelling (< cost 1-2 cache misses, no orders of magnitude improvements >> that would justify a lot of code) > > hmm. we're comparing against ktap here… > which has 5x more kernel code and 8x slower in this test... > >> Your code requires a compiler, so from my perspective it >> wouldn't be a lot easier or faster to use than just changing >> the code directly and recompile. >> >> The users want something simple too that shields them from >> having to learn all the internals. They don't want to recompile. >> As far as I can tell your code is a bit too low level for that, >> and the requirement for the compiler may also scare them. >> >> Where exactly does it fit? > > the goal is to have llvm compiler next to perf, wrapped in a user friendly > way. > > compiling small filter vs recompiling full kernel… > inserting into live kernel vs rebooting … > not sure how you're saying it's equivalent. > > In my kernel debugging experience current tools (tracing, systemtap) > were rarely enough. > I always had to add my own printks through the code, recompile and reboot. > Often just to see that it's not the place where I want to print things > or it's too verbose. > Then I would adjust printks, recompile and reboot again. > That was slow and tedious, since I would be crashing things from time to time > just because skb doesn't always have a valid dev or I made a typo. > For debugging I do really need something quick and dirty that lets me > add my own printk > of whatever structs I want anywhere in the kernel without crashing it. > That's exactly what bpf tracing filters do. I recommend you to use perf-probe. That will give you an easy solution. :) Thank you, -- Masami HIRAMATSU IT Management Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu...@hitachi.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
hpa wrote: >> I can see there may be some setups which don't have a compiler >> (e.g. I know some people don't use systemtap because of that) >> But this needs a custom gcc install too as far as I understand. > > Yes... but no compiler and secure boot tend to go together, or at > least will in the future. (Maybe not: we're already experimenting with support for secureboot in systemtap.) - FChE -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
hpa wrote: I can see there may be some setups which don't have a compiler (e.g. I know some people don't use systemtap because of that) But this needs a custom gcc install too as far as I understand. Yes... but no compiler and secure boot tend to go together, or at least will in the future. (Maybe not: we're already experimenting with support for secureboot in systemtap.) - FChE -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF
(2013/12/06 14:16), Alexei Starovoitov wrote: On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen a...@firstfloor.org wrote: the difference is bigger now: 484-145 vs 185-145 This is a obvious improvement, but imho not big enough to be extremely compelling ( cost 1-2 cache misses, no orders of magnitude improvements that would justify a lot of code) hmm. we're comparing against ktap here… which has 5x more kernel code and 8x slower in this test... Your code requires a compiler, so from my perspective it wouldn't be a lot easier or faster to use than just changing the code directly and recompile. The users want something simple too that shields them from having to learn all the internals. They don't want to recompile. As far as I can tell your code is a bit too low level for that, and the requirement for the compiler may also scare them. Where exactly does it fit? the goal is to have llvm compiler next to perf, wrapped in a user friendly way. compiling small filter vs recompiling full kernel… inserting into live kernel vs rebooting … not sure how you're saying it's equivalent. In my kernel debugging experience current tools (tracing, systemtap) were rarely enough. I always had to add my own printks through the code, recompile and reboot. Often just to see that it's not the place where I want to print things or it's too verbose. Then I would adjust printks, recompile and reboot again. That was slow and tedious, since I would be crashing things from time to time just because skb doesn't always have a valid dev or I made a typo. For debugging I do really need something quick and dirty that lets me add my own printk of whatever structs I want anywhere in the kernel without crashing it. That's exactly what bpf tracing filters do. I recommend you to use perf-probe. That will give you an easy solution. :) Thank you, -- Masami HIRAMATSU IT Management Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu...@hitachi.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF
(2013/12/06 14:19), Jovi Zhangwei wrote: Hi Alexei, On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov a...@plumgrid.com wrote: On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen a...@firstfloor.org wrote: Can you do some performance comparison compared to e.g. ktap? How much faster is it? Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email: trace skb:kfree_skb { if (arg2 == 0x100) { printf(%x %x\n, arg1, arg2) } } 1M skb alloc/free 350315 (usecs) baseline without any tracing: 1M skb alloc/free 145400 (usecs) then equivalent bpf test: void filter(struct bpf_context *ctx) { void *loc = (void *)ctx-regs.dx; if (loc == 0x100) { struct sk_buff *skb = (struct sk_buff *)ctx-regs.si; char fmt[] = skb %p loc %p\n; bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); } } 1M skb alloc/free 183214 (usecs) so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145 obviously ktap is an interpreter, so it's not really fair. To make it really unfair I did: trace skb:kfree_skb { if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 || arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 || arg2 == 0x900 || arg2 == 0x1000) { printf(%x %x\n, arg1, arg2) } } 1M skb alloc/free 484280 (usecs) and corresponding bpf: void filter(struct bpf_context *ctx) { void *loc = (void *)ctx-regs.dx; if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 || loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 || loc == 0x900 || loc == 0x1000) { struct sk_buff *skb = (struct sk_buff *)ctx-regs.si; char fmt[] = skb %p loc %p\n; bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); } } 1M skb alloc/free 185660 (usecs) the difference is bigger now: 484-145 vs 185-145 There have big differences for compare arg2(in ktap) with direct register access(ctx-regs.dx). The current argument fetching(arg2 in above testcase) implementation in ktap is very inefficient, see ktap/interpreter/lib_kdebug.c:kp_event_getarg. The only way to speedup is kernel tracing code change, let external tracing module access event field not through list lookup. This work is not started yet. :) I'm not sure why you can't access it directly from ftrace-event buffer. There is just a packed data structure and it is exposed via debugfs. You can decode it and can get an offset/size by using libtraceevent. Thank you, -- Masami HIRAMATSU IT Management Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu...@hitachi.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Fri, Dec 6, 2013 at 3:54 PM, Masami Hiramatsu masami.hiramatsu...@hitachi.com wrote: (2013/12/06 14:16), Alexei Starovoitov wrote: On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen a...@firstfloor.org wrote: the difference is bigger now: 484-145 vs 185-145 This is a obvious improvement, but imho not big enough to be extremely compelling ( cost 1-2 cache misses, no orders of magnitude improvements that would justify a lot of code) hmm. we're comparing against ktap here… which has 5x more kernel code and 8x slower in this test... Your code requires a compiler, so from my perspective it wouldn't be a lot easier or faster to use than just changing the code directly and recompile. The users want something simple too that shields them from having to learn all the internals. They don't want to recompile. As far as I can tell your code is a bit too low level for that, and the requirement for the compiler may also scare them. Where exactly does it fit? the goal is to have llvm compiler next to perf, wrapped in a user friendly way. compiling small filter vs recompiling full kernel… inserting into live kernel vs rebooting … not sure how you're saying it's equivalent. In my kernel debugging experience current tools (tracing, systemtap) were rarely enough. I always had to add my own printks through the code, recompile and reboot. Often just to see that it's not the place where I want to print things or it's too verbose. Then I would adjust printks, recompile and reboot again. That was slow and tedious, since I would be crashing things from time to time just because skb doesn't always have a valid dev or I made a typo. For debugging I do really need something quick and dirty that lets me add my own printk of whatever structs I want anywhere in the kernel without crashing it. That's exactly what bpf tracing filters do. I recommend you to use perf-probe. That will give you an easy solution. :) it is indeed very cool. Thanks! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen a...@firstfloor.org wrote: H. Peter Anvin h...@zytor.com writes: Not to mention that in that case we might as well -- since we need a compiler anyway -- generate the machine code in user space; the JIT solution really only is useful if it can provide something that we can't do otherwise, e.g. enable it in secure boot environments. I can see there may be some setups which don't have a compiler (e.g. I know some people don't use systemtap because of that) But this needs a custom gcc install too as far as I understand. fyi custom gcc is a single 13M binary. It doesn't depend on any include files or any libraries. and can be easily packaged together with perf... even for embedded environment. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov wrote: >> On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen wrote: >>> >>> Can you do some performance comparison compared to e.g. ktap? >>> How much faster is it? > > Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email: > trace skb:kfree_skb { > if (arg2 == 0x100) { > printf("%x %x\n", arg1, arg2) > } > } > 1M skb alloc/free 350315 (usecs) > > baseline without any tracing: > 1M skb alloc/free 145400 (usecs) > > then equivalent bpf test: > void filter(struct bpf_context *ctx) > { > void *loc = (void *)ctx->regs.dx; > if (loc == 0x100) { > struct sk_buff *skb = (struct sk_buff *)ctx->regs.si; > char fmt[] = "skb %p loc %p\n"; > bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); > } > } > 1M skb alloc/free 183214 (usecs) > > so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145 > > obviously ktap is an interpreter, so it's not really fair. > > To make it really unfair I did: > trace skb:kfree_skb { > if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 > || > arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 > || > arg2 == 0x900 || arg2 == 0x1000) { > printf("%x %x\n", arg1, arg2) > } > } > 1M skb alloc/free 484280 (usecs) > I've lost my mind for a while. :) If bpf only focus on filter, then it's not good to compare with ktap like that, since ktap can easily make use on current kernel filter, you should use below script: trace skb:kfree_skb /location == 0x100 || location == 0x200 || .../ { printf("%x %x\n", arg1, arg2) } As ktap is a user of current simple kernel tracing filter, I fully agree with Steven, "it can be an add on, but not a replacement." Thanks, Jovi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Fri, Dec 6, 2013 at 9:20 AM, Andi Kleen wrote: > "H. Peter Anvin" writes: >> >> Not to mention that in that case we might as well -- since we need a >> compiler anyway -- generate the machine code in user space; the JIT >> solution really only is useful if it can provide something that we can't >> do otherwise, e.g. enable it in secure boot environments. > > I can see there may be some setups which don't have a compiler > (e.g. I know some people don't use systemtap because of that) > But this needs a custom gcc install too as far as I understand. > If it's depend on gcc, then it's look like Systemtap. There have big inconvenient for embedded environment and many production system to install gcc. (not sure if it need kernel compilation environment as well) It seems the event filter is binding to specific event, it's not possible to trace many events in a cooperation style, look Systemtap and ktap samples, many event handler need to cooperate, the simplest example is record syscall execution time(duration of exit - entry). If this design is intentional, then I would think it's target for speed up current kernel tracing filter.(but need extra usespace filter compiler) And I guess bpf filter still need to take mind on usespace tracing :), if it want to be a complete and integrated tracing solution. (use a separated userspace compiler or translator to resolve symbol) Thanks Jovi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Thu, Dec 5, 2013 at 2:38 AM, Ingo Molnar wrote: > >> Also I'm thinking to add 'license_string' section to bpf binary format >> and call license_is_gpl_compatible() on it during load. >> If false, then just reject it…. not even messing with taint flags... >> That would be way stronger indication of bpf licensing terms than what >> we have for .ko > > But will BFP tools generate such gpl-compatible license tags by > default? If yes then this might work, combined with the facility > below. If not then it's just a nuisance to users. yes. similar to existing .ko module_license() tag. see below. > My concern would be solved by adding a facility to always be able to > dump source code as well, i.e. trivially transform it to C or so, so > that people can review it - or just edit it on the fly, recompile and > reinsert? Most BFP scripts ought to be pretty simple. C code has '#include' in them, so without storing fully preprocessed code it will not be equivalent. but then true source will be gigantic. Can be zipped, but that sounds like an overkill. Also we might want other languages with their own dependent includes. Sure, we can have a section in bpf binary that has the source, but it's not enforceable. Kernel cannot know that it's an actual source. gcc/llvm will produce different bpf code out of the same source. the source is in C or in language X, etc. Doesn't seem that including some form of source will help with enforcing the license. imo requiring module_license("gpl"); line in C code and equivalent string in all other languages that want to translate to bpf would be stronger indication of licensing terms. then compiler would have to include that string into 'license_string' section and kernel can actually enforce it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
Hi Alexei, On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov wrote: >> On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen wrote: >>> >>> Can you do some performance comparison compared to e.g. ktap? >>> How much faster is it? > > Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email: > trace skb:kfree_skb { > if (arg2 == 0x100) { > printf("%x %x\n", arg1, arg2) > } > } > 1M skb alloc/free 350315 (usecs) > > baseline without any tracing: > 1M skb alloc/free 145400 (usecs) > > then equivalent bpf test: > void filter(struct bpf_context *ctx) > { > void *loc = (void *)ctx->regs.dx; > if (loc == 0x100) { > struct sk_buff *skb = (struct sk_buff *)ctx->regs.si; > char fmt[] = "skb %p loc %p\n"; > bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); > } > } > 1M skb alloc/free 183214 (usecs) > > so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145 > > obviously ktap is an interpreter, so it's not really fair. > > To make it really unfair I did: > trace skb:kfree_skb { > if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 > || > arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 > || > arg2 == 0x900 || arg2 == 0x1000) { > printf("%x %x\n", arg1, arg2) > } > } > 1M skb alloc/free 484280 (usecs) > > and corresponding bpf: > void filter(struct bpf_context *ctx) > { > void *loc = (void *)ctx->regs.dx; > if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 || > loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 || > loc == 0x900 || loc == 0x1000) { > struct sk_buff *skb = (struct sk_buff *)ctx->regs.si; > char fmt[] = "skb %p loc %p\n"; > bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); > } > } > 1M skb alloc/free 185660 (usecs) > > the difference is bigger now: 484-145 vs 185-145 > There have big differences for compare arg2(in ktap) with direct register access(ctx->regs.dx). The current argument fetching(arg2 in above testcase) implementation in ktap is very inefficient, see ktap/interpreter/lib_kdebug.c:kp_event_getarg. The only way to speedup is kernel tracing code change, let external tracing module access event field not through list lookup. This work is not started yet. :) Of course, I'm not saying this argument fetching issue is the performance root cause compared with bpf and Systemtap, the bytecode executing speed wouldn't compare with raw machine code. (There have a plan to use JIT in ktap core, like luajit project, but it need some time to work on) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen wrote: >> the difference is bigger now: 484-145 vs 185-145 > > This is a obvious improvement, but imho not big enough to be extremely > compelling (< cost 1-2 cache misses, no orders of magnitude improvements > that would justify a lot of code) hmm. we're comparing against ktap here… which has 5x more kernel code and 8x slower in this test... > Your code requires a compiler, so from my perspective it > wouldn't be a lot easier or faster to use than just changing > the code directly and recompile. > > The users want something simple too that shields them from > having to learn all the internals. They don't want to recompile. > As far as I can tell your code is a bit too low level for that, > and the requirement for the compiler may also scare them. > > Where exactly does it fit? the goal is to have llvm compiler next to perf, wrapped in a user friendly way. compiling small filter vs recompiling full kernel… inserting into live kernel vs rebooting … not sure how you're saying it's equivalent. In my kernel debugging experience current tools (tracing, systemtap) were rarely enough. I always had to add my own printks through the code, recompile and reboot. Often just to see that it's not the place where I want to print things or it's too verbose. Then I would adjust printks, recompile and reboot again. That was slow and tedious, since I would be crashing things from time to time just because skb doesn't always have a valid dev or I made a typo. For debugging I do really need something quick and dirty that lets me add my own printk of whatever structs I want anywhere in the kernel without crashing it. That's exactly what bpf tracing filters do. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Thu, Dec 5, 2013 at 3:37 PM, Steven Rostedt wrote: > On Thu, 5 Dec 2013 14:36:58 -0800 > Alexei Starovoitov wrote: > >> On Thu, Dec 5, 2013 at 5:46 AM, Steven Rostedt wrote: >> > >> > I know that it would be great to have the bpf filter run before >> > recording of the tracepoint, but as that becomes quite awkward for a >> > user interface, because it requires intimate knowledge of the kernel >> > source, this speed up on the filter itself may be worth while to have >> > it happen after the recording of the buffer. When it happens after the >> > record, then the bpf has direct access to the event entry and its >> > fields as described by the trace event format files. >> >> I don't understand that 'awkward' part yet. What do you mean by 'knowledge of >> the kernel'? By accessing pt_regs structure? Something else ? >> Can we try fixing the interface first before compromising on performance? > > Let me ask you this. If you do not have the source of the kernel on > hand, can you use BPF to filter the sched_switch tracepoint on prev pid? > > The current filter interface allows you to filter with just what the > running kernel provides. No need for debug info from the vmlinux or > anything else. Understood and agreed. For the users that are satisfied with amount of info that single trace_event provides (like sched_switch) there is probably little reason to do complex filtering. Either they're fine with all the events or will just filter based on pid only. > I'm fine if it becomes a requirement to have a vmlinux built with > DEBUG_INFO to use BPF and have a tool like perf to translate the > filters. But it that must not replace what the current filters do now. > That is, it can be an add on, but not a replacement. Of course. tracing filters via bpf is an additional tool for kernel debugging. bpf by itself has use cases beyond tracing. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On 12/05/2013 05:20 PM, Andi Kleen wrote: > "H. Peter Anvin" writes: >> >> Not to mention that in that case we might as well -- since we need a >> compiler anyway -- generate the machine code in user space; the JIT >> solution really only is useful if it can provide something that we can't >> do otherwise, e.g. enable it in secure boot environments. > > I can see there may be some setups which don't have a compiler > (e.g. I know some people don't use systemtap because of that) > But this needs a custom gcc install too as far as I understand. > Yes... but no compiler and secure boot tend to go together, or at least will in the future. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
"H. Peter Anvin" writes: > > Not to mention that in that case we might as well -- since we need a > compiler anyway -- generate the machine code in user space; the JIT > solution really only is useful if it can provide something that we can't > do otherwise, e.g. enable it in secure boot environments. I can see there may be some setups which don't have a compiler (e.g. I know some people don't use systemtap because of that) But this needs a custom gcc install too as far as I understand. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On 12/05/2013 04:14 PM, Andi Kleen wrote: > > In my experience there are roughly two groups of trace users: > kernel hackers and users. The kernel hackers want something > convenient and fast, but for anything complicated or performance > critical they can always hack the kernel to include custom > instrumentation. > Not to mention that in that case we might as well -- since we need a compiler anyway -- generate the machine code in user space; the JIT solution really only is useful if it can provide something that we can't do otherwise, e.g. enable it in secure boot environments. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
> 1M skb alloc/free 185660 (usecs) > > the difference is bigger now: 484-145 vs 185-145 Thanks for the data. This is a obvious improvement, but imho not big enough to be extremely compelling (< cost 1-2 cache misses, no orders of magnitude improvements that would justify a lot of code) One larger problem I have with your patchkit is where exactly it fits with the user base. In my experience there are roughly two groups of trace users: kernel hackers and users. The kernel hackers want something convenient and fast, but for anything complicated or performance critical they can always hack the kernel to include custom instrumentation. Your code requires a compiler, so from my perspective it wouldn't be a lot easier or faster to use than just changing the code directly and recompile. The users want something simple too that shields them from having to learn all the internals. They don't want to recompile. As far as I can tell your code is a bit too low level for that, and the requirement for the compiler may also scare them. Where exactly does it fit? -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Thu, 5 Dec 2013 14:36:58 -0800 Alexei Starovoitov wrote: > On Thu, Dec 5, 2013 at 5:46 AM, Steven Rostedt wrote: > > > > I know that it would be great to have the bpf filter run before > > recording of the tracepoint, but as that becomes quite awkward for a > > user interface, because it requires intimate knowledge of the kernel > > source, this speed up on the filter itself may be worth while to have > > it happen after the recording of the buffer. When it happens after the > > record, then the bpf has direct access to the event entry and its > > fields as described by the trace event format files. > > I don't understand that 'awkward' part yet. What do you mean by 'knowledge of > the kernel'? By accessing pt_regs structure? Something else ? > Can we try fixing the interface first before compromising on performance? Let me ask you this. If you do not have the source of the kernel on hand, can you use BPF to filter the sched_switch tracepoint on prev pid? The current filter interface allows you to filter with just what the running kernel provides. No need for debug info from the vmlinux or anything else. pt_regs is not that useful without having something to translate what that means. I'm fine if it becomes a requirement to have a vmlinux built with DEBUG_INFO to use BPF and have a tool like perf to translate the filters. But it that must not replace what the current filters do now. That is, it can be an add on, but not a replacement. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Thu, Dec 5, 2013 at 5:46 AM, Steven Rostedt wrote: > > I know that it would be great to have the bpf filter run before > recording of the tracepoint, but as that becomes quite awkward for a > user interface, because it requires intimate knowledge of the kernel > source, this speed up on the filter itself may be worth while to have > it happen after the recording of the buffer. When it happens after the > record, then the bpf has direct access to the event entry and its > fields as described by the trace event format files. I don't understand that 'awkward' part yet. What do you mean by 'knowledge of the kernel'? By accessing pt_regs structure? Something else ? Can we try fixing the interface first before compromising on performance? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Thu, Dec 5, 2013 at 8:11 AM, Frank Ch. Eigler wrote: > > ast wrote: > >>>[...] >> Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email: >> trace skb:kfree_skb { >> if (arg2 == 0x100) { >> printf("%x %x\n", arg1, arg2) >> } >> } >> [...] > > For reference, you might try putting systemtap into the performance > comparison matrix too: > > # stap -e 'probe kernel.trace("kfree_skb") { > if ($location == 0x100 /* || $location == 0x200 etc. */ ) { > printf("%x %x\n", $skb, $location) > } >}' stap with one 'if': 1M skb alloc/free 200696 (usecs) stap with 10 'if': 1M skb alloc/free 202135 (usecs) so systemtap entry overhead is a bit higher than bpf and extra if-s show the same progression as expected. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
Andi Kleen writes: > [...] While it sounds interesting, I would strongly advise to make > this capability only available to root. Traditionally lots of > complex byte code languages which were designed to be "safe" and > verifiable weren't really. e.g. i managed to crash things with > "safe" systemtap multiple times. [...] Note that systemtap has never been a byte code language, that avenue being considered lkml-futile at the time, but instead pure C. Its safety comes from a mix of compiled-in checks (which you can inspect via "stap -p3") and script-to-C translation checks (which are self-explanatory). Its risks come from bugs in the checks (quite rare), problems in the runtime library (rare), and problems in underlying kernel facilities (rare or frequent - consider kprobes). > So the likelyhood of this having some hole somewhere (either in > the byte code or in some library function) is high. Very true! - FChE -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
ast wrote: >>[...] > Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email: > trace skb:kfree_skb { > if (arg2 == 0x100) { > printf("%x %x\n", arg1, arg2) > } > } > [...] For reference, you might try putting systemtap into the performance comparison matrix too: # stap -e 'probe kernel.trace("kfree_skb") { if ($location == 0x100 /* || $location == 0x200 etc. */ ) { printf("%x %x\n", $skb, $location) } }' - FChE -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Thu, 5 Dec 2013 11:41:13 +0100 Ingo Molnar wrote: > > so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145 > > > > obviously ktap is an interpreter, so it's not really fair. > > > > To make it really unfair I did: > > trace skb:kfree_skb { > > if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == > > 0x400 || > > arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == > > 0x800 || > > arg2 == 0x900 || arg2 == 0x1000) { > > printf("%x %x\n", arg1, arg2) > > } > > } > > 1M skb alloc/free 484280 (usecs) > > Real life scripts, for examples the ones related to network protocol > analysis will often have such patterns in them, so I don't think this > measurement is particularly unfair. I agree. As the size of the if statement grows, the filter logic gets lineally expensive, but the bpf filter does not. I know that it would be great to have the bpf filter run before recording of the tracepoint, but as that becomes quite awkward for a user interface, because it requires intimate knowledge of the kernel source, this speed up on the filter itself may be worth while to have it happen after the recording of the buffer. When it happens after the record, then the bpf has direct access to the event entry and its fields as described by the trace event format files. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
* Alexei Starovoitov wrote: > > On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen wrote: > >> > >> Can you do some performance comparison compared to e.g. ktap? > >> How much faster is it? > > Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email: > trace skb:kfree_skb { > if (arg2 == 0x100) { > printf("%x %x\n", arg1, arg2) > } > } > 1M skb alloc/free 350315 (usecs) > > baseline without any tracing: > 1M skb alloc/free 145400 (usecs) > > then equivalent bpf test: > void filter(struct bpf_context *ctx) > { > void *loc = (void *)ctx->regs.dx; > if (loc == 0x100) { > struct sk_buff *skb = (struct sk_buff *)ctx->regs.si; > char fmt[] = "skb %p loc %p\n"; > bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); > } > } > 1M skb alloc/free 183214 (usecs) > > so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145 > > obviously ktap is an interpreter, so it's not really fair. > > To make it really unfair I did: > trace skb:kfree_skb { > if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 > || > arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 > || > arg2 == 0x900 || arg2 == 0x1000) { > printf("%x %x\n", arg1, arg2) > } > } > 1M skb alloc/free 484280 (usecs) Real life scripts, for examples the ones related to network protocol analysis will often have such patterns in them, so I don't think this measurement is particularly unfair. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
* Alexei Starovoitov wrote: > > I mean more than that, I mean the licensing of BFP filters a user > > can find on his own system's kernel should be very clear: by the > > act of loading a BFP script into the kernel the user doing the > > 'upload' gives permission for it to be redistributed on > > kernel-compatible license terms. > > > > The easiest way to achieve that is to make sure that all loaded > > BFP scripts are 'registered' and are dumpable, viewable and > > reusable. That's good for debugging and it's good for > > transparency. > > > > This means a minimal BFP decoder will have to be in the kernel as > > well, but that's OK, we actually have several x86 instruction > > decoder in the kernel already, so there's no complexity threshold. > > sure. there is pr_info_bpf_insn() in bpf_run.c that dumps bpf insn in > human readable format. > I'll hook it up to trace_seq, so that "cat > /sys/kernel/debug/.../filter" will dump it. > > Also I'm thinking to add 'license_string' section to bpf binary format > and call license_is_gpl_compatible() on it during load. > If false, then just reject it…. not even messing with taint flags... > That would be way stronger indication of bpf licensing terms than what > we have for .ko But will BFP tools generate such gpl-compatible license tags by default? If yes then this might work, combined with the facility below. If not then it's just a nuisance to users. Also, 'tainting' is a non-issue here, as we don't want the kernel to load license-incompatible scripts at all. This should be made clear in the design of the facility and the tooling itself. > >> wow. I guess if the whole thing takes off, we would need an > >> in-kernel directory to store upstreamed bpf filters as well :) > > > > I see no reason why not, but more importantly all currently loaded > > BFP scripts should be dumpable, displayable and reusable in a > > kernel license compatible fashion. > > ok. will add global bpf list as well (was hesitating to do something > like this because of central lock) A lock + list is no big issue here I think, we do such central lookup locks all the time. If it ever becomes measurable it can be made scalable via numerous techniques. > and something in debugfs that dumps bodies of all currently loaded > filters. > > Will that solve the concern? My concern would be solved by adding a facility to always be able to dump source code as well, i.e. trivially transform it to C or so, so that people can review it - or just edit it on the fly, recompile and reinsert? Most BFP scripts ought to be pretty simple. (For example the most common way to load OpenGL shaders is to load the GLSL source code and that source code can be queried after insertion as well, so this is not an unusual model for small plugin-alike scriptlets.) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
* Alexei Starovoitov a...@plumgrid.com wrote: I mean more than that, I mean the licensing of BFP filters a user can find on his own system's kernel should be very clear: by the act of loading a BFP script into the kernel the user doing the 'upload' gives permission for it to be redistributed on kernel-compatible license terms. The easiest way to achieve that is to make sure that all loaded BFP scripts are 'registered' and are dumpable, viewable and reusable. That's good for debugging and it's good for transparency. This means a minimal BFP decoder will have to be in the kernel as well, but that's OK, we actually have several x86 instruction decoder in the kernel already, so there's no complexity threshold. sure. there is pr_info_bpf_insn() in bpf_run.c that dumps bpf insn in human readable format. I'll hook it up to trace_seq, so that cat /sys/kernel/debug/.../filter will dump it. Also I'm thinking to add 'license_string' section to bpf binary format and call license_is_gpl_compatible() on it during load. If false, then just reject it…. not even messing with taint flags... That would be way stronger indication of bpf licensing terms than what we have for .ko But will BFP tools generate such gpl-compatible license tags by default? If yes then this might work, combined with the facility below. If not then it's just a nuisance to users. Also, 'tainting' is a non-issue here, as we don't want the kernel to load license-incompatible scripts at all. This should be made clear in the design of the facility and the tooling itself. wow. I guess if the whole thing takes off, we would need an in-kernel directory to store upstreamed bpf filters as well :) I see no reason why not, but more importantly all currently loaded BFP scripts should be dumpable, displayable and reusable in a kernel license compatible fashion. ok. will add global bpf list as well (was hesitating to do something like this because of central lock) A lock + list is no big issue here I think, we do such central lookup locks all the time. If it ever becomes measurable it can be made scalable via numerous techniques. and something in debugfs that dumps bodies of all currently loaded filters. Will that solve the concern? My concern would be solved by adding a facility to always be able to dump source code as well, i.e. trivially transform it to C or so, so that people can review it - or just edit it on the fly, recompile and reinsert? Most BFP scripts ought to be pretty simple. (For example the most common way to load OpenGL shaders is to load the GLSL source code and that source code can be queried after insertion as well, so this is not an unusual model for small plugin-alike scriptlets.) Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
* Alexei Starovoitov a...@plumgrid.com wrote: On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen a...@firstfloor.org wrote: Can you do some performance comparison compared to e.g. ktap? How much faster is it? Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email: trace skb:kfree_skb { if (arg2 == 0x100) { printf(%x %x\n, arg1, arg2) } } 1M skb alloc/free 350315 (usecs) baseline without any tracing: 1M skb alloc/free 145400 (usecs) then equivalent bpf test: void filter(struct bpf_context *ctx) { void *loc = (void *)ctx-regs.dx; if (loc == 0x100) { struct sk_buff *skb = (struct sk_buff *)ctx-regs.si; char fmt[] = skb %p loc %p\n; bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); } } 1M skb alloc/free 183214 (usecs) so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145 obviously ktap is an interpreter, so it's not really fair. To make it really unfair I did: trace skb:kfree_skb { if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 || arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 || arg2 == 0x900 || arg2 == 0x1000) { printf(%x %x\n, arg1, arg2) } } 1M skb alloc/free 484280 (usecs) Real life scripts, for examples the ones related to network protocol analysis will often have such patterns in them, so I don't think this measurement is particularly unfair. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Thu, 5 Dec 2013 11:41:13 +0100 Ingo Molnar mi...@kernel.org wrote: so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145 obviously ktap is an interpreter, so it's not really fair. To make it really unfair I did: trace skb:kfree_skb { if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 || arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 || arg2 == 0x900 || arg2 == 0x1000) { printf(%x %x\n, arg1, arg2) } } 1M skb alloc/free 484280 (usecs) Real life scripts, for examples the ones related to network protocol analysis will often have such patterns in them, so I don't think this measurement is particularly unfair. I agree. As the size of the if statement grows, the filter logic gets lineally expensive, but the bpf filter does not. I know that it would be great to have the bpf filter run before recording of the tracepoint, but as that becomes quite awkward for a user interface, because it requires intimate knowledge of the kernel source, this speed up on the filter itself may be worth while to have it happen after the recording of the buffer. When it happens after the record, then the bpf has direct access to the event entry and its fields as described by the trace event format files. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
ast wrote: [...] Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email: trace skb:kfree_skb { if (arg2 == 0x100) { printf(%x %x\n, arg1, arg2) } } [...] For reference, you might try putting systemtap into the performance comparison matrix too: # stap -e 'probe kernel.trace(kfree_skb) { if ($location == 0x100 /* || $location == 0x200 etc. */ ) { printf(%x %x\n, $skb, $location) } }' - FChE -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
Andi Kleen a...@firstfloor.org writes: [...] While it sounds interesting, I would strongly advise to make this capability only available to root. Traditionally lots of complex byte code languages which were designed to be safe and verifiable weren't really. e.g. i managed to crash things with safe systemtap multiple times. [...] Note that systemtap has never been a byte code language, that avenue being considered lkml-futile at the time, but instead pure C. Its safety comes from a mix of compiled-in checks (which you can inspect via stap -p3) and script-to-C translation checks (which are self-explanatory). Its risks come from bugs in the checks (quite rare), problems in the runtime library (rare), and problems in underlying kernel facilities (rare or frequent - consider kprobes). So the likelyhood of this having some hole somewhere (either in the byte code or in some library function) is high. Very true! - FChE -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Thu, Dec 5, 2013 at 8:11 AM, Frank Ch. Eigler f...@redhat.com wrote: ast wrote: [...] Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email: trace skb:kfree_skb { if (arg2 == 0x100) { printf(%x %x\n, arg1, arg2) } } [...] For reference, you might try putting systemtap into the performance comparison matrix too: # stap -e 'probe kernel.trace(kfree_skb) { if ($location == 0x100 /* || $location == 0x200 etc. */ ) { printf(%x %x\n, $skb, $location) } }' stap with one 'if': 1M skb alloc/free 200696 (usecs) stap with 10 'if': 1M skb alloc/free 202135 (usecs) so systemtap entry overhead is a bit higher than bpf and extra if-s show the same progression as expected. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Thu, Dec 5, 2013 at 5:46 AM, Steven Rostedt rost...@goodmis.org wrote: I know that it would be great to have the bpf filter run before recording of the tracepoint, but as that becomes quite awkward for a user interface, because it requires intimate knowledge of the kernel source, this speed up on the filter itself may be worth while to have it happen after the recording of the buffer. When it happens after the record, then the bpf has direct access to the event entry and its fields as described by the trace event format files. I don't understand that 'awkward' part yet. What do you mean by 'knowledge of the kernel'? By accessing pt_regs structure? Something else ? Can we try fixing the interface first before compromising on performance? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Thu, 5 Dec 2013 14:36:58 -0800 Alexei Starovoitov a...@plumgrid.com wrote: On Thu, Dec 5, 2013 at 5:46 AM, Steven Rostedt rost...@goodmis.org wrote: I know that it would be great to have the bpf filter run before recording of the tracepoint, but as that becomes quite awkward for a user interface, because it requires intimate knowledge of the kernel source, this speed up on the filter itself may be worth while to have it happen after the recording of the buffer. When it happens after the record, then the bpf has direct access to the event entry and its fields as described by the trace event format files. I don't understand that 'awkward' part yet. What do you mean by 'knowledge of the kernel'? By accessing pt_regs structure? Something else ? Can we try fixing the interface first before compromising on performance? Let me ask you this. If you do not have the source of the kernel on hand, can you use BPF to filter the sched_switch tracepoint on prev pid? The current filter interface allows you to filter with just what the running kernel provides. No need for debug info from the vmlinux or anything else. pt_regs is not that useful without having something to translate what that means. I'm fine if it becomes a requirement to have a vmlinux built with DEBUG_INFO to use BPF and have a tool like perf to translate the filters. But it that must not replace what the current filters do now. That is, it can be an add on, but not a replacement. -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
1M skb alloc/free 185660 (usecs) the difference is bigger now: 484-145 vs 185-145 Thanks for the data. This is a obvious improvement, but imho not big enough to be extremely compelling ( cost 1-2 cache misses, no orders of magnitude improvements that would justify a lot of code) One larger problem I have with your patchkit is where exactly it fits with the user base. In my experience there are roughly two groups of trace users: kernel hackers and users. The kernel hackers want something convenient and fast, but for anything complicated or performance critical they can always hack the kernel to include custom instrumentation. Your code requires a compiler, so from my perspective it wouldn't be a lot easier or faster to use than just changing the code directly and recompile. The users want something simple too that shields them from having to learn all the internals. They don't want to recompile. As far as I can tell your code is a bit too low level for that, and the requirement for the compiler may also scare them. Where exactly does it fit? -Andi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On 12/05/2013 04:14 PM, Andi Kleen wrote: In my experience there are roughly two groups of trace users: kernel hackers and users. The kernel hackers want something convenient and fast, but for anything complicated or performance critical they can always hack the kernel to include custom instrumentation. Not to mention that in that case we might as well -- since we need a compiler anyway -- generate the machine code in user space; the JIT solution really only is useful if it can provide something that we can't do otherwise, e.g. enable it in secure boot environments. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
H. Peter Anvin h...@zytor.com writes: Not to mention that in that case we might as well -- since we need a compiler anyway -- generate the machine code in user space; the JIT solution really only is useful if it can provide something that we can't do otherwise, e.g. enable it in secure boot environments. I can see there may be some setups which don't have a compiler (e.g. I know some people don't use systemtap because of that) But this needs a custom gcc install too as far as I understand. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On 12/05/2013 05:20 PM, Andi Kleen wrote: H. Peter Anvin h...@zytor.com writes: Not to mention that in that case we might as well -- since we need a compiler anyway -- generate the machine code in user space; the JIT solution really only is useful if it can provide something that we can't do otherwise, e.g. enable it in secure boot environments. I can see there may be some setups which don't have a compiler (e.g. I know some people don't use systemtap because of that) But this needs a custom gcc install too as far as I understand. Yes... but no compiler and secure boot tend to go together, or at least will in the future. -hpa -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Thu, Dec 5, 2013 at 3:37 PM, Steven Rostedt rost...@goodmis.org wrote: On Thu, 5 Dec 2013 14:36:58 -0800 Alexei Starovoitov a...@plumgrid.com wrote: On Thu, Dec 5, 2013 at 5:46 AM, Steven Rostedt rost...@goodmis.org wrote: I know that it would be great to have the bpf filter run before recording of the tracepoint, but as that becomes quite awkward for a user interface, because it requires intimate knowledge of the kernel source, this speed up on the filter itself may be worth while to have it happen after the recording of the buffer. When it happens after the record, then the bpf has direct access to the event entry and its fields as described by the trace event format files. I don't understand that 'awkward' part yet. What do you mean by 'knowledge of the kernel'? By accessing pt_regs structure? Something else ? Can we try fixing the interface first before compromising on performance? Let me ask you this. If you do not have the source of the kernel on hand, can you use BPF to filter the sched_switch tracepoint on prev pid? The current filter interface allows you to filter with just what the running kernel provides. No need for debug info from the vmlinux or anything else. Understood and agreed. For the users that are satisfied with amount of info that single trace_event provides (like sched_switch) there is probably little reason to do complex filtering. Either they're fine with all the events or will just filter based on pid only. I'm fine if it becomes a requirement to have a vmlinux built with DEBUG_INFO to use BPF and have a tool like perf to translate the filters. But it that must not replace what the current filters do now. That is, it can be an add on, but not a replacement. Of course. tracing filters via bpf is an additional tool for kernel debugging. bpf by itself has use cases beyond tracing. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Thu, Dec 5, 2013 at 5:20 PM, Andi Kleen a...@firstfloor.org wrote: the difference is bigger now: 484-145 vs 185-145 This is a obvious improvement, but imho not big enough to be extremely compelling ( cost 1-2 cache misses, no orders of magnitude improvements that would justify a lot of code) hmm. we're comparing against ktap here… which has 5x more kernel code and 8x slower in this test... Your code requires a compiler, so from my perspective it wouldn't be a lot easier or faster to use than just changing the code directly and recompile. The users want something simple too that shields them from having to learn all the internals. They don't want to recompile. As far as I can tell your code is a bit too low level for that, and the requirement for the compiler may also scare them. Where exactly does it fit? the goal is to have llvm compiler next to perf, wrapped in a user friendly way. compiling small filter vs recompiling full kernel… inserting into live kernel vs rebooting … not sure how you're saying it's equivalent. In my kernel debugging experience current tools (tracing, systemtap) were rarely enough. I always had to add my own printks through the code, recompile and reboot. Often just to see that it's not the place where I want to print things or it's too verbose. Then I would adjust printks, recompile and reboot again. That was slow and tedious, since I would be crashing things from time to time just because skb doesn't always have a valid dev or I made a typo. For debugging I do really need something quick and dirty that lets me add my own printk of whatever structs I want anywhere in the kernel without crashing it. That's exactly what bpf tracing filters do. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
Hi Alexei, On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov a...@plumgrid.com wrote: On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen a...@firstfloor.org wrote: Can you do some performance comparison compared to e.g. ktap? How much faster is it? Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email: trace skb:kfree_skb { if (arg2 == 0x100) { printf(%x %x\n, arg1, arg2) } } 1M skb alloc/free 350315 (usecs) baseline without any tracing: 1M skb alloc/free 145400 (usecs) then equivalent bpf test: void filter(struct bpf_context *ctx) { void *loc = (void *)ctx-regs.dx; if (loc == 0x100) { struct sk_buff *skb = (struct sk_buff *)ctx-regs.si; char fmt[] = skb %p loc %p\n; bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); } } 1M skb alloc/free 183214 (usecs) so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145 obviously ktap is an interpreter, so it's not really fair. To make it really unfair I did: trace skb:kfree_skb { if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 || arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 || arg2 == 0x900 || arg2 == 0x1000) { printf(%x %x\n, arg1, arg2) } } 1M skb alloc/free 484280 (usecs) and corresponding bpf: void filter(struct bpf_context *ctx) { void *loc = (void *)ctx-regs.dx; if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 || loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 || loc == 0x900 || loc == 0x1000) { struct sk_buff *skb = (struct sk_buff *)ctx-regs.si; char fmt[] = skb %p loc %p\n; bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); } } 1M skb alloc/free 185660 (usecs) the difference is bigger now: 484-145 vs 185-145 There have big differences for compare arg2(in ktap) with direct register access(ctx-regs.dx). The current argument fetching(arg2 in above testcase) implementation in ktap is very inefficient, see ktap/interpreter/lib_kdebug.c:kp_event_getarg. The only way to speedup is kernel tracing code change, let external tracing module access event field not through list lookup. This work is not started yet. :) Of course, I'm not saying this argument fetching issue is the performance root cause compared with bpf and Systemtap, the bytecode executing speed wouldn't compare with raw machine code. (There have a plan to use JIT in ktap core, like luajit project, but it need some time to work on) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Thu, Dec 5, 2013 at 2:38 AM, Ingo Molnar mi...@kernel.org wrote: Also I'm thinking to add 'license_string' section to bpf binary format and call license_is_gpl_compatible() on it during load. If false, then just reject it…. not even messing with taint flags... That would be way stronger indication of bpf licensing terms than what we have for .ko But will BFP tools generate such gpl-compatible license tags by default? If yes then this might work, combined with the facility below. If not then it's just a nuisance to users. yes. similar to existing .ko module_license() tag. see below. My concern would be solved by adding a facility to always be able to dump source code as well, i.e. trivially transform it to C or so, so that people can review it - or just edit it on the fly, recompile and reinsert? Most BFP scripts ought to be pretty simple. C code has '#include' in them, so without storing fully preprocessed code it will not be equivalent. but then true source will be gigantic. Can be zipped, but that sounds like an overkill. Also we might want other languages with their own dependent includes. Sure, we can have a section in bpf binary that has the source, but it's not enforceable. Kernel cannot know that it's an actual source. gcc/llvm will produce different bpf code out of the same source. the source is in C or in language X, etc. Doesn't seem that including some form of source will help with enforcing the license. imo requiring module_license(gpl); line in C code and equivalent string in all other languages that want to translate to bpf would be stronger indication of licensing terms. then compiler would have to include that string into 'license_string' section and kernel can actually enforce it. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Fri, Dec 6, 2013 at 9:20 AM, Andi Kleen a...@firstfloor.org wrote: H. Peter Anvin h...@zytor.com writes: Not to mention that in that case we might as well -- since we need a compiler anyway -- generate the machine code in user space; the JIT solution really only is useful if it can provide something that we can't do otherwise, e.g. enable it in secure boot environments. I can see there may be some setups which don't have a compiler (e.g. I know some people don't use systemtap because of that) But this needs a custom gcc install too as far as I understand. If it's depend on gcc, then it's look like Systemtap. There have big inconvenient for embedded environment and many production system to install gcc. (not sure if it need kernel compilation environment as well) It seems the event filter is binding to specific event, it's not possible to trace many events in a cooperation style, look Systemtap and ktap samples, many event handler need to cooperate, the simplest example is record syscall execution time(duration of exit - entry). If this design is intentional, then I would think it's target for speed up current kernel tracing filter.(but need extra usespace filter compiler) And I guess bpf filter still need to take mind on usespace tracing :), if it want to be a complete and integrated tracing solution. (use a separated userspace compiler or translator to resolve symbol) Thanks Jovi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Thu, Dec 5, 2013 at 12:40 PM, Alexei Starovoitov a...@plumgrid.com wrote: On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen a...@firstfloor.org wrote: Can you do some performance comparison compared to e.g. ktap? How much faster is it? Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email: trace skb:kfree_skb { if (arg2 == 0x100) { printf(%x %x\n, arg1, arg2) } } 1M skb alloc/free 350315 (usecs) baseline without any tracing: 1M skb alloc/free 145400 (usecs) then equivalent bpf test: void filter(struct bpf_context *ctx) { void *loc = (void *)ctx-regs.dx; if (loc == 0x100) { struct sk_buff *skb = (struct sk_buff *)ctx-regs.si; char fmt[] = skb %p loc %p\n; bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); } } 1M skb alloc/free 183214 (usecs) so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145 obviously ktap is an interpreter, so it's not really fair. To make it really unfair I did: trace skb:kfree_skb { if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 || arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 || arg2 == 0x900 || arg2 == 0x1000) { printf(%x %x\n, arg1, arg2) } } 1M skb alloc/free 484280 (usecs) I've lost my mind for a while. :) If bpf only focus on filter, then it's not good to compare with ktap like that, since ktap can easily make use on current kernel filter, you should use below script: trace skb:kfree_skb /location == 0x100 || location == 0x200 || .../ { printf(%x %x\n, arg1, arg2) } As ktap is a user of current simple kernel tracing filter, I fully agree with Steven, it can be an add on, but not a replacement. Thanks, Jovi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
> On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen wrote: >> >> Can you do some performance comparison compared to e.g. ktap? >> How much faster is it? Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email: trace skb:kfree_skb { if (arg2 == 0x100) { printf("%x %x\n", arg1, arg2) } } 1M skb alloc/free 350315 (usecs) baseline without any tracing: 1M skb alloc/free 145400 (usecs) then equivalent bpf test: void filter(struct bpf_context *ctx) { void *loc = (void *)ctx->regs.dx; if (loc == 0x100) { struct sk_buff *skb = (struct sk_buff *)ctx->regs.si; char fmt[] = "skb %p loc %p\n"; bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); } } 1M skb alloc/free 183214 (usecs) so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145 obviously ktap is an interpreter, so it's not really fair. To make it really unfair I did: trace skb:kfree_skb { if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 || arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 || arg2 == 0x900 || arg2 == 0x1000) { printf("%x %x\n", arg1, arg2) } } 1M skb alloc/free 484280 (usecs) and corresponding bpf: void filter(struct bpf_context *ctx) { void *loc = (void *)ctx->regs.dx; if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 || loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 || loc == 0x900 || loc == 0x1000) { struct sk_buff *skb = (struct sk_buff *)ctx->regs.si; char fmt[] = "skb %p loc %p\n"; bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); } } 1M skb alloc/free 185660 (usecs) the difference is bigger now: 484-145 vs 185-145 9 extra 'if' conditions for bpf is almost nothing, since they translate into 18 new x86 instructions after JITing, but for interpreter it's obviously costly. Why 0x100 instead of 0x1? To make sure that compiler doesn't optimize them into < > Otherwise it's really really unfair. ktap is a nice tool. Great job Jovi! I noticed that it doesn't always clear created kprobes after run and I see a bunch of .../tracing/events/ktap_kprobes_xxx, but that's a minor thing. Thanks Alexei -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Wed, Dec 4, 2013 at 1:34 AM, Ingo Molnar wrote: > > * Alexei Starovoitov wrote: > >> On Tue, Dec 3, 2013 at 1:16 AM, Ingo Molnar wrote: >> > >> > Very cool! (Added various other folks who might be interested in >> > this to the Cc: list.) >> > >> > I have one generic concern: >> > >> > It would be important to make it easy to extract loaded BPF code >> > from the kernel in source code equivalent form, which compiles to >> > the same BPF code. >> > >> > I.e. I think it would be fundamentally important to make sure that >> > this is all within the kernel's license domain, to make it very >> > clear there can be no 'binary only' BPF scripts. >> > >> > By up-loading BPF into a kernel the person loading it agrees to >> > make that code available to all users of that system who can >> > access it, under the same license as the kernel's code (or under a >> > more permissive license). >> > >> > The last thing we want is people getting funny ideas and writing >> > drivers in BPF and hiding the code or making license claims over >> > it >> >> all makes sense. In case of kernel modules all export_symbols are >> accessible and module has to have kernel compatible license. Same >> licensing terms apply to anything else that interacts with kernel >> functions. In case of BPF the list of accessible functions is tiny, >> so it's much easier to enforce specific limited use case. For >> tracing filters it's just bpf_load_xx/trace_printk/dump_stack. Even >> if someone has funny ideas they cannot be brought to life, since >> drivers need a lot more than this set of functions and BPF checker >> will reject any attempts to call something outside of this tiny >> list. imo the same applies to existing BPF as well. Meaning that >> tcpdump filter string and seccomp filters, if distributed, has to >> have their source code available. > > I mean more than that, I mean the licensing of BFP filters a user can > find on his own system's kernel should be very clear: by the act of > loading a BFP script into the kernel the user doing the 'upload' gives > permission for it to be redistributed on kernel-compatible license > terms. > > The easiest way to achieve that is to make sure that all loaded BFP > scripts are 'registered' and are dumpable, viewable and reusable. > That's good for debugging and it's good for transparency. > > This means a minimal BFP decoder will have to be in the kernel as > well, but that's OK, we actually have several x86 instruction decoder > in the kernel already, so there's no complexity threshold. sure. there is pr_info_bpf_insn() in bpf_run.c that dumps bpf insn in human readable format. I'll hook it up to trace_seq, so that "cat /sys/kernel/debug/.../filter" will dump it. Also I'm thinking to add 'license_string' section to bpf binary format and call license_is_gpl_compatible() on it during load. If false, then just reject it…. not even messing with taint flags... That would be way stronger indication of bpf licensing terms than what we have for .ko >> wow. I guess if the whole thing takes off, we would need an >> in-kernel directory to store upstreamed bpf filters as well :) > > I see no reason why not, but more importantly all currently loaded BFP > scripts should be dumpable, displayable and reusable in a kernel > license compatible fashion. ok. will add global bpf list as well (was hesitating to do something like this because of central lock) and something in debugfs that dumps bodies of all currently loaded filters. Will that solve the concern? Thanks Alexei -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
* Alexei Starovoitov wrote: > On Tue, Dec 3, 2013 at 1:16 AM, Ingo Molnar wrote: > > > > Very cool! (Added various other folks who might be interested in > > this to the Cc: list.) > > > > I have one generic concern: > > > > It would be important to make it easy to extract loaded BPF code > > from the kernel in source code equivalent form, which compiles to > > the same BPF code. > > > > I.e. I think it would be fundamentally important to make sure that > > this is all within the kernel's license domain, to make it very > > clear there can be no 'binary only' BPF scripts. > > > > By up-loading BPF into a kernel the person loading it agrees to > > make that code available to all users of that system who can > > access it, under the same license as the kernel's code (or under a > > more permissive license). > > > > The last thing we want is people getting funny ideas and writing > > drivers in BPF and hiding the code or making license claims over > > it > > all makes sense. In case of kernel modules all export_symbols are > accessible and module has to have kernel compatible license. Same > licensing terms apply to anything else that interacts with kernel > functions. In case of BPF the list of accessible functions is tiny, > so it's much easier to enforce specific limited use case. For > tracing filters it's just bpf_load_xx/trace_printk/dump_stack. Even > if someone has funny ideas they cannot be brought to life, since > drivers need a lot more than this set of functions and BPF checker > will reject any attempts to call something outside of this tiny > list. imo the same applies to existing BPF as well. Meaning that > tcpdump filter string and seccomp filters, if distributed, has to > have their source code available. I mean more than that, I mean the licensing of BFP filters a user can find on his own system's kernel should be very clear: by the act of loading a BFP script into the kernel the user doing the 'upload' gives permission for it to be redistributed on kernel-compatible license terms. The easiest way to achieve that is to make sure that all loaded BFP scripts are 'registered' and are dumpable, viewable and reusable. That's good for debugging and it's good for transparency. This means a minimal BFP decoder will have to be in the kernel as well, but that's OK, we actually have several x86 instruction decoder in the kernel already, so there's no complexity threshold. > > I.e. we want to allow flexible plugins technologically, but make > > sure people who run into such a plugin can modify and improve it > > under the same license as they can modify and improve the kernel > > itself! > > wow. I guess if the whole thing takes off, we would need an > in-kernel directory to store upstreamed bpf filters as well :) I see no reason why not, but more importantly all currently loaded BFP scripts should be dumpable, displayable and reusable in a kernel license compatible fashion. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
* Alexei Starovoitov a...@plumgrid.com wrote: On Tue, Dec 3, 2013 at 1:16 AM, Ingo Molnar mi...@kernel.org wrote: Very cool! (Added various other folks who might be interested in this to the Cc: list.) I have one generic concern: It would be important to make it easy to extract loaded BPF code from the kernel in source code equivalent form, which compiles to the same BPF code. I.e. I think it would be fundamentally important to make sure that this is all within the kernel's license domain, to make it very clear there can be no 'binary only' BPF scripts. By up-loading BPF into a kernel the person loading it agrees to make that code available to all users of that system who can access it, under the same license as the kernel's code (or under a more permissive license). The last thing we want is people getting funny ideas and writing drivers in BPF and hiding the code or making license claims over it all makes sense. In case of kernel modules all export_symbols are accessible and module has to have kernel compatible license. Same licensing terms apply to anything else that interacts with kernel functions. In case of BPF the list of accessible functions is tiny, so it's much easier to enforce specific limited use case. For tracing filters it's just bpf_load_xx/trace_printk/dump_stack. Even if someone has funny ideas they cannot be brought to life, since drivers need a lot more than this set of functions and BPF checker will reject any attempts to call something outside of this tiny list. imo the same applies to existing BPF as well. Meaning that tcpdump filter string and seccomp filters, if distributed, has to have their source code available. I mean more than that, I mean the licensing of BFP filters a user can find on his own system's kernel should be very clear: by the act of loading a BFP script into the kernel the user doing the 'upload' gives permission for it to be redistributed on kernel-compatible license terms. The easiest way to achieve that is to make sure that all loaded BFP scripts are 'registered' and are dumpable, viewable and reusable. That's good for debugging and it's good for transparency. This means a minimal BFP decoder will have to be in the kernel as well, but that's OK, we actually have several x86 instruction decoder in the kernel already, so there's no complexity threshold. I.e. we want to allow flexible plugins technologically, but make sure people who run into such a plugin can modify and improve it under the same license as they can modify and improve the kernel itself! wow. I guess if the whole thing takes off, we would need an in-kernel directory to store upstreamed bpf filters as well :) I see no reason why not, but more importantly all currently loaded BFP scripts should be dumpable, displayable and reusable in a kernel license compatible fashion. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Wed, Dec 4, 2013 at 1:34 AM, Ingo Molnar mi...@kernel.org wrote: * Alexei Starovoitov a...@plumgrid.com wrote: On Tue, Dec 3, 2013 at 1:16 AM, Ingo Molnar mi...@kernel.org wrote: Very cool! (Added various other folks who might be interested in this to the Cc: list.) I have one generic concern: It would be important to make it easy to extract loaded BPF code from the kernel in source code equivalent form, which compiles to the same BPF code. I.e. I think it would be fundamentally important to make sure that this is all within the kernel's license domain, to make it very clear there can be no 'binary only' BPF scripts. By up-loading BPF into a kernel the person loading it agrees to make that code available to all users of that system who can access it, under the same license as the kernel's code (or under a more permissive license). The last thing we want is people getting funny ideas and writing drivers in BPF and hiding the code or making license claims over it all makes sense. In case of kernel modules all export_symbols are accessible and module has to have kernel compatible license. Same licensing terms apply to anything else that interacts with kernel functions. In case of BPF the list of accessible functions is tiny, so it's much easier to enforce specific limited use case. For tracing filters it's just bpf_load_xx/trace_printk/dump_stack. Even if someone has funny ideas they cannot be brought to life, since drivers need a lot more than this set of functions and BPF checker will reject any attempts to call something outside of this tiny list. imo the same applies to existing BPF as well. Meaning that tcpdump filter string and seccomp filters, if distributed, has to have their source code available. I mean more than that, I mean the licensing of BFP filters a user can find on his own system's kernel should be very clear: by the act of loading a BFP script into the kernel the user doing the 'upload' gives permission for it to be redistributed on kernel-compatible license terms. The easiest way to achieve that is to make sure that all loaded BFP scripts are 'registered' and are dumpable, viewable and reusable. That's good for debugging and it's good for transparency. This means a minimal BFP decoder will have to be in the kernel as well, but that's OK, we actually have several x86 instruction decoder in the kernel already, so there's no complexity threshold. sure. there is pr_info_bpf_insn() in bpf_run.c that dumps bpf insn in human readable format. I'll hook it up to trace_seq, so that cat /sys/kernel/debug/.../filter will dump it. Also I'm thinking to add 'license_string' section to bpf binary format and call license_is_gpl_compatible() on it during load. If false, then just reject it…. not even messing with taint flags... That would be way stronger indication of bpf licensing terms than what we have for .ko wow. I guess if the whole thing takes off, we would need an in-kernel directory to store upstreamed bpf filters as well :) I see no reason why not, but more importantly all currently loaded BFP scripts should be dumpable, displayable and reusable in a kernel license compatible fashion. ok. will add global bpf list as well (was hesitating to do something like this because of central lock) and something in debugfs that dumps bodies of all currently loaded filters. Will that solve the concern? Thanks Alexei -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen a...@firstfloor.org wrote: Can you do some performance comparison compared to e.g. ktap? How much faster is it? Did simple ktap test with 1M alloc_skb/kfree_skb toy test from earlier email: trace skb:kfree_skb { if (arg2 == 0x100) { printf(%x %x\n, arg1, arg2) } } 1M skb alloc/free 350315 (usecs) baseline without any tracing: 1M skb alloc/free 145400 (usecs) then equivalent bpf test: void filter(struct bpf_context *ctx) { void *loc = (void *)ctx-regs.dx; if (loc == 0x100) { struct sk_buff *skb = (struct sk_buff *)ctx-regs.si; char fmt[] = skb %p loc %p\n; bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); } } 1M skb alloc/free 183214 (usecs) so with one 'if' condition the difference ktap vs bpf is 350-145 vs 183-145 obviously ktap is an interpreter, so it's not really fair. To make it really unfair I did: trace skb:kfree_skb { if (arg2 == 0x100 || arg2 == 0x200 || arg2 == 0x300 || arg2 == 0x400 || arg2 == 0x500 || arg2 == 0x600 || arg2 == 0x700 || arg2 == 0x800 || arg2 == 0x900 || arg2 == 0x1000) { printf(%x %x\n, arg1, arg2) } } 1M skb alloc/free 484280 (usecs) and corresponding bpf: void filter(struct bpf_context *ctx) { void *loc = (void *)ctx-regs.dx; if (loc == 0x100 || loc == 0x200 || loc == 0x300 || loc == 0x400 || loc == 0x500 || loc == 0x600 || loc == 0x700 || loc == 0x800 || loc == 0x900 || loc == 0x1000) { struct sk_buff *skb = (struct sk_buff *)ctx-regs.si; char fmt[] = skb %p loc %p\n; bpf_trace_printk(fmt, sizeof(fmt), (long)skb, (long)loc, 0); } } 1M skb alloc/free 185660 (usecs) the difference is bigger now: 484-145 vs 185-145 9 extra 'if' conditions for bpf is almost nothing, since they translate into 18 new x86 instructions after JITing, but for interpreter it's obviously costly. Why 0x100 instead of 0x1? To make sure that compiler doesn't optimize them into Otherwise it's really really unfair. ktap is a nice tool. Great job Jovi! I noticed that it doesn't always clear created kprobes after run and I see a bunch of .../tracing/events/ktap_kprobes_xxx, but that's a minor thing. Thanks Alexei -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen wrote: > Alexei Starovoitov writes: > > Can you do some performance comparison compared to e.g. ktap? > How much faster is it? imo the most interesting ktap scripts (like kmalloc-top.kp) need tables and timers. tables are almost ready for prime time, but timers I prefer to keep out of kernel. I would like bpf filter to fill tables with interesting data in kernel up to predefined limit and periodically read and clear the tables from userspace. This way I will be able to do nettop.stp, iotop.stp like programs. So I'm still thinking what should be clean kernel/user interface for bpf-defined tables. Format of keys and elements of the table is defined within bpf program. During load of bpf program, the tables are allocated and bpf program can now lookup/update into them. At the same time corresponding userspace program can read tables of this particular bpf program over netlink. Creating its own debugfs files for every filter feels too slow and feature limited, since files are all or nothing interface. Netlink access to bpf tables feels cleaner. Userspace will use libmnl to access them. Other ideas? In the mean time I'll do some simple trace probe:xx { print } performance test… > While it sounds interesting, I would strongly advise to make this > capability only available to root. Traditionally lots of complex byte > code languages which were designed to be "safe" and verifiable weren't > really. e.g. i managed to crash things with "safe" systemtap multiple > times. And we all know what happened to Java. > > So the likelyhood of this having some hole somewhere (either in > the byte code or in some library function) is high. Tracing filters are for root only today and should stay this way. As far as safety of bpf… hard to argue systemtap point ;) Though existing bpf is generally accepted to be safe. extended bpf needs time to prove itself. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
(2013/12/04 3:26), Alexei Starovoitov wrote: > On Tue, Dec 3, 2013 at 7:33 AM, Steven Rostedt wrote: >> On Tue, 3 Dec 2013 10:16:55 +0100 >> Ingo Molnar wrote: >> >> >>> So, to do the math: >>> >>>tracing 'all' overhead: 95 nsecs per event >>>tracing 'eth5 + old filter' overhead: 157 nsecs per event >>>tracing 'eth5 + BPF filter' overhead: 54 nsecs per event >>> >>> So via BPF and a fairly trivial filter, we are able to reduce tracing >>> overhead for real - while old-style filters. >> >> Yep, seems that BPF can do what I wasn't able to do with the normal >> filters. Although, I haven't looked at the code yet, I'm assuming that >> the BPF works on the parameters passed into the trace event. The normal >> filters can only process the results of the trace (what's being >> recorded) not the parameters of the trace event itself. To get what's >> recorded, we need to write to the buffer first, and then we decided if >> we want to keep the event or not and discard the event from the buffer >> if we do not. >> >> That method does not reduce overhead at all, and only adds to it, as >> Alexei's tests have shown. The purpose of the filter was not to reduce >> overhead, but to reduce filling the buffer with needless data. > > Precisely. > Assumption is that filters will filter out majority of the events. > So filter takes pt_regs as input, has to interpret them and call > bpf_trace_printk > if it really wants to store something for the human to see. > We can extend bpf trace filters to return true/false to indicate > whether TP_printk-format > specified as part of the event should be printed as well, but imo > that's unnecessary. > When I was using bpf filters to debug networking bits I didn't need > that printk format of the event. I only used event as an entry point, > filtering out things and printing different fields vs initial event. > More like what developers do when they sprinkle > trace_printk/dump_stack through the code while debugging. > > the only inconvenience so far is to know how parameters are getting > into registers. > on x86-64, arg1 is in rdi, arg2 is in rsi,... I want to improve that > after first step is done. Actually, that part is done by the perf-probe and ftrace dynamic events (kernel/trace/trace_probe.c). I think this generic BPF is good for re-implementing fetch methods. :) Thank you, -- Masami HIRAMATSU IT Management Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu...@hitachi.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
Alexei Starovoitov writes: Can you do some performance comparison compared to e.g. ktap? How much faster is it? While it sounds interesting, I would strongly advise to make this capability only available to root. Traditionally lots of complex byte code languages which were designed to be "safe" and verifiable weren't really. e.g. i managed to crash things with "safe" systemtap multiple times. And we all know what happened to Java. So the likelyhood of this having some hole somewhere (either in the byte code or in some library function) is high. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Tue, Dec 3, 2013 at 7:33 AM, Steven Rostedt wrote: > On Tue, 3 Dec 2013 10:16:55 +0100 > Ingo Molnar wrote: > > >> So, to do the math: >> >>tracing 'all' overhead: 95 nsecs per event >>tracing 'eth5 + old filter' overhead: 157 nsecs per event >>tracing 'eth5 + BPF filter' overhead: 54 nsecs per event >> >> So via BPF and a fairly trivial filter, we are able to reduce tracing >> overhead for real - while old-style filters. > > Yep, seems that BPF can do what I wasn't able to do with the normal > filters. Although, I haven't looked at the code yet, I'm assuming that > the BPF works on the parameters passed into the trace event. The normal > filters can only process the results of the trace (what's being > recorded) not the parameters of the trace event itself. To get what's > recorded, we need to write to the buffer first, and then we decided if > we want to keep the event or not and discard the event from the buffer > if we do not. > > That method does not reduce overhead at all, and only adds to it, as > Alexei's tests have shown. The purpose of the filter was not to reduce > overhead, but to reduce filling the buffer with needless data. Precisely. Assumption is that filters will filter out majority of the events. So filter takes pt_regs as input, has to interpret them and call bpf_trace_printk if it really wants to store something for the human to see. We can extend bpf trace filters to return true/false to indicate whether TP_printk-format specified as part of the event should be printed as well, but imo that's unnecessary. When I was using bpf filters to debug networking bits I didn't need that printk format of the event. I only used event as an entry point, filtering out things and printing different fields vs initial event. More like what developers do when they sprinkle trace_printk/dump_stack through the code while debugging. the only inconvenience so far is to know how parameters are getting into registers. on x86-64, arg1 is in rdi, arg2 is in rsi,... I want to improve that after first step is done. In the proposed patches bpf_context == pt_regs at the event entry point. Would be cleaner to have struct {arg1,arg2,…} as bpf_context instead. But that needed more code and I wanted to keep the first patch to the minimum. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Tue, Dec 3, 2013 at 1:16 AM, Ingo Molnar wrote: > > Very cool! (Added various other folks who might be interested in this > to the Cc: list.) > > I have one generic concern: > > It would be important to make it easy to extract loaded BPF code from > the kernel in source code equivalent form, which compiles to the same > BPF code. > > I.e. I think it would be fundamentally important to make sure that > this is all within the kernel's license domain, to make it very clear > there can be no 'binary only' BPF scripts. > > By up-loading BPF into a kernel the person loading it agrees to make > that code available to all users of that system who can access it, > under the same license as the kernel's code (or under a more > permissive license). > > The last thing we want is people getting funny ideas and writing > drivers in BPF and hiding the code or making license claims over it all makes sense. In case of kernel modules all export_symbols are accessible and module has to have kernel compatible license. Same licensing terms apply to anything else that interacts with kernel functions. In case of BPF the list of accessible functions is tiny, so it's much easier to enforce specific limited use case. For tracing filters it's just bpf_load_xx/trace_printk/dump_stack. Even if someone has funny ideas they cannot be brought to life, since drivers need a lot more than this set of functions and BPF checker will reject any attempts to call something outside of this tiny list. imo the same applies to existing BPF as well. Meaning that tcpdump filter string and seccomp filters, if distributed, has to have their source code available. > I.e. we want to allow flexible plugins technologically, but make sure > people who run into such a plugin can modify and improve it under the > same license as they can modify and improve the kernel itself! wow. I guess if the whole thing takes off, we would need an in-kernel directory to store upstreamed bpf filters as well :) >> opcode encoding is the same between old BPF and extended BPF. >> Original BPF has two 32-bit registers. >> Extended BPF has ten 64-bit registers. >> That is the main difference. >> >> Old BPF was using jt/jf fields for jump-insn only. >> New BPF combines them into generic 'off' field for jump and non-jump insns. >> k==imm field has the same meaning. > > This only affects the internal JIT representation, not the BPF byte > code, right? that is the ebpf vs bpf code difference. JIT doesn't keep another representation. Just converts it to x86 >> 32 files changed, 3332 insertions(+), 24 deletions(-) > > Impressive! > > I'm wondering, will the new nftable code in works make use of the BPF > JIT as well, or is that a separate implementation? nft is much higher level state machine customized for specific nftable use case. imo iptables/nftable rules can be compiled into extended bpf. One needs to define bpf_context and set of functions to do packet lookup via bpf_callbacks... but let's do it one step at a a time. Thanks Alexei -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Tue, 3 Dec 2013 10:16:55 +0100 Ingo Molnar wrote: > So, to do the math: > >tracing 'all' overhead: 95 nsecs per event >tracing 'eth5 + old filter' overhead: 157 nsecs per event >tracing 'eth5 + BPF filter' overhead: 54 nsecs per event > > So via BPF and a fairly trivial filter, we are able to reduce tracing > overhead for real - while old-style filters. Yep, seems that BPF can do what I wasn't able to do with the normal filters. Although, I haven't looked at the code yet, I'm assuming that the BPF works on the parameters passed into the trace event. The normal filters can only process the results of the trace (what's being recorded) not the parameters of the trace event itself. To get what's recorded, we need to write to the buffer first, and then we decided if we want to keep the event or not and discard the event from the buffer if we do not. That method does not reduce overhead at all, and only adds to it, as Alexei's tests have shown. The purpose of the filter was not to reduce overhead, but to reduce filling the buffer with needless data. It looks as if the BPF filter works on the parameters of the trace event and not what is written to the buffers (as they can be different). I've been looking for a way to do just that, and if this does accomplish it, I'll be very happy :-) -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
(2013/12/03 13:28), Alexei Starovoitov wrote: > Hi All, > > the following set of patches adds BPF support to trace filters. > > Trace filters can be written in C and allow safe read-only access to any > kernel data structure. Like systemtap but with safety guaranteed by kernel. > > The user can do: > cat bpf_program > /sys/kernel/debug/tracing/.../filter > if tracing event is either static or dynamic via kprobe_events. Oh, thank you for this great work! :D > > The filter program may look like: > void filter(struct bpf_context *ctx) > { > char devname[4] = "eth5"; > struct net_device *dev; > struct sk_buff *skb = 0; > > dev = (struct net_device *)ctx->regs.si; > if (bpf_memcmp(dev->name, devname, 4) == 0) { > char fmt[] = "skb %p dev %p eth5\n"; > bpf_trace_printk(fmt, skb, dev, 0, 0); > } > } > > The kernel will do static analysis of bpf program to make sure that it cannot > crash the kernel (doesn't have loops, valid memory/register accesses, etc). > Then kernel will map bpf instructions to x86 instructions and let it > run in the place of trace filter. > > To demonstrate performance I did a synthetic test: > dev = init_net.loopback_dev; > do_gettimeofday(_tv); > for (i = 0; i < 100; i++) { > struct sk_buff *skb; > skb = netdev_alloc_skb(dev, 128); > kfree_skb(skb); > } > do_gettimeofday(_tv); > time = end_tv.tv_sec - start_tv.tv_sec; > time *= USEC_PER_SEC; > time += (long long)((long)end_tv.tv_usec - (long)start_tv.tv_usec); > > printk("1M skb alloc/free %lld (usecs)\n", time); > > no tracing > [ 33.450966] 1M skb alloc/free 145179 (usecs) > > echo 1 > enable > [ 97.186379] 1M skb alloc/free 240419 (usecs) > (tracing slows down kfree_skb() due to event_buffer_lock/buffer_unlock_commit) > > echo 'name==eth5' > filter > [ 139.644161] 1M skb alloc/free 302552 (usecs) > (running filter_match_preds() for every skb and discarding > event_buffer is even slower) > > cat bpf_prog > filter > [ 171.150566] 1M skb alloc/free 199463 (usecs) > (JITed bpf program is safely checking dev->name == eth5 and discarding) > > echo 0 > enable > [ 258.073593] 1M skb alloc/free 144919 (usecs) > (tracing is disabled, performance is back to original) > > The C program compiled into BPF and then JITed into x86 is faster than > filter_match_preds() approach (199-145 msec vs 302-145 msec) Great! :) > tracing+bpf is a tool for safe read-only access to variables without > recompiling > the kernel and without affecting running programs. Hmm, this feature and trace-event trigger actions can give us powerful on-the-fly scripting functionality... > BPF filters can be written manually (see tools/bpf/trace/filter_ex1.c) > or better compiled from restricted C via GCC or LLVM > > Q: What is the difference between existing BPF and extended BPF? > A: > Existing BPF insn from uapi/linux/filter.h > struct sock_filter { > __u16 code; /* Actual filter code */ > __u8jt; /* Jump true */ > __u8jf; /* Jump false */ > __u32 k; /* Generic multiuse field */ > }; > > Extended BPF insn from linux/bpf.h > struct bpf_insn { > __u8code;/* opcode */ > __u8a_reg:4; /* dest register*/ > __u8x_reg:4; /* source register */ > __s16 off; /* signed offset */ > __s32 imm; /* signed immediate constant */ > }; > > opcode encoding is the same between old BPF and extended BPF. > Original BPF has two 32-bit registers. > Extended BPF has ten 64-bit registers. > That is the main difference. > > Old BPF was using jt/jf fields for jump-insn only. > New BPF combines them into generic 'off' field for jump and non-jump insns. > k==imm field has the same meaning. Looks very interesting. :) Thank you! -- Masami HIRAMATSU IT Management Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu...@hitachi.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
* Alexei Starovoitov wrote: > Hi All, > > the following set of patches adds BPF support to trace filters. > > Trace filters can be written in C and allow safe read-only access to > any kernel data structure. Like systemtap but with safety guaranteed > by kernel. Very cool! (Added various other folks who might be interested in this to the Cc: list.) I have one generic concern: It would be important to make it easy to extract loaded BPF code from the kernel in source code equivalent form, which compiles to the same BPF code. I.e. I think it would be fundamentally important to make sure that this is all within the kernel's license domain, to make it very clear there can be no 'binary only' BPF scripts. By up-loading BPF into a kernel the person loading it agrees to make that code available to all users of that system who can access it, under the same license as the kernel's code (or under a more permissive license). The last thing we want is people getting funny ideas and writing drivers in BPF and hiding the code or making license claims over it ... I.e. we want to allow flexible plugins technologically, but make sure people who run into such a plugin can modify and improve it under the same license as they can modify and improve the kernel itself! [ People can still 'hide' their sekrit plugins if they want to, by not distributing them to anyone who'd redistribute it widely. ] > The user can do: > cat bpf_program > /sys/kernel/debug/tracing/.../filter > if tracing event is either static or dynamic via kprobe_events. > > The filter program may look like: > void filter(struct bpf_context *ctx) > { > char devname[4] = "eth5"; > struct net_device *dev; > struct sk_buff *skb = 0; > > dev = (struct net_device *)ctx->regs.si; > if (bpf_memcmp(dev->name, devname, 4) == 0) { > char fmt[] = "skb %p dev %p eth5\n"; > bpf_trace_printk(fmt, skb, dev, 0, 0); > } > } > > The kernel will do static analysis of bpf program to make sure that > it cannot crash the kernel (doesn't have loops, valid > memory/register accesses, etc). Then kernel will map bpf > instructions to x86 instructions and let it run in the place of > trace filter. > > To demonstrate performance I did a synthetic test: > dev = init_net.loopback_dev; > do_gettimeofday(_tv); > for (i = 0; i < 100; i++) { > struct sk_buff *skb; > skb = netdev_alloc_skb(dev, 128); > kfree_skb(skb); > } > do_gettimeofday(_tv); > time = end_tv.tv_sec - start_tv.tv_sec; > time *= USEC_PER_SEC; > time += (long long)((long)end_tv.tv_usec - (long)start_tv.tv_usec); > > printk("1M skb alloc/free %lld (usecs)\n", time); > > no tracing > [ 33.450966] 1M skb alloc/free 145179 (usecs) > > echo 1 > enable > [ 97.186379] 1M skb alloc/free 240419 (usecs) > (tracing slows down kfree_skb() due to event_buffer_lock/buffer_unlock_commit) > > echo 'name==eth5' > filter > [ 139.644161] 1M skb alloc/free 302552 (usecs) > (running filter_match_preds() for every skb and discarding > event_buffer is even slower) > > cat bpf_prog > filter > [ 171.150566] 1M skb alloc/free 199463 (usecs) > (JITed bpf program is safely checking dev->name == eth5 and discarding) So, to do the math: tracing 'all' overhead: 95 nsecs per event tracing 'eth5 + old filter' overhead: 157 nsecs per event tracing 'eth5 + BPF filter' overhead: 54 nsecs per event So via BPF and a fairly trivial filter, we are able to reduce tracing overhead for real - while old-style filters. In addition to that we now also have arbitrary BPF scripts, full C programs (or written in any other language from which BPF bytecode can be generated) enabled. Seems like a massive win-win scenario to me ;-) > echo 0 > enable > [ 258.073593] 1M skb alloc/free 144919 (usecs) > (tracing is disabled, performance is back to original) > > The C program compiled into BPF and then JITed into x86 is faster > than filter_match_preds() approach (199-145 msec vs 302-145 msec) > > tracing+bpf is a tool for safe read-only access to variables without > recompiling the kernel and without affecting running programs. > > BPF filters can be written manually (see > tools/bpf/trace/filter_ex1.c) or better compiled from restricted C > via GCC or LLVM > Q: What is the difference between existing BPF and extended BPF? > A: > Existing BPF insn from uapi/linux/filter.h > struct sock_filter { > __u16 code; /* Actual filter code */ > __u8jt; /* Jump true */ > __u8jf; /* Jump false */ > __u32 k; /* Generic multiuse field */ > }; > > Extended BPF insn from linux/bpf.h > struct bpf_insn { > __u8code;/* opcode */ > __u8a_reg:4; /* dest register*/ > __u8x_reg:4; /* source register */ >
Re: [RFC PATCH tip 0/5] tracing filters with BPF
* Alexei Starovoitov a...@plumgrid.com wrote: Hi All, the following set of patches adds BPF support to trace filters. Trace filters can be written in C and allow safe read-only access to any kernel data structure. Like systemtap but with safety guaranteed by kernel. Very cool! (Added various other folks who might be interested in this to the Cc: list.) I have one generic concern: It would be important to make it easy to extract loaded BPF code from the kernel in source code equivalent form, which compiles to the same BPF code. I.e. I think it would be fundamentally important to make sure that this is all within the kernel's license domain, to make it very clear there can be no 'binary only' BPF scripts. By up-loading BPF into a kernel the person loading it agrees to make that code available to all users of that system who can access it, under the same license as the kernel's code (or under a more permissive license). The last thing we want is people getting funny ideas and writing drivers in BPF and hiding the code or making license claims over it ... I.e. we want to allow flexible plugins technologically, but make sure people who run into such a plugin can modify and improve it under the same license as they can modify and improve the kernel itself! [ People can still 'hide' their sekrit plugins if they want to, by not distributing them to anyone who'd redistribute it widely. ] The user can do: cat bpf_program /sys/kernel/debug/tracing/.../filter if tracing event is either static or dynamic via kprobe_events. The filter program may look like: void filter(struct bpf_context *ctx) { char devname[4] = eth5; struct net_device *dev; struct sk_buff *skb = 0; dev = (struct net_device *)ctx-regs.si; if (bpf_memcmp(dev-name, devname, 4) == 0) { char fmt[] = skb %p dev %p eth5\n; bpf_trace_printk(fmt, skb, dev, 0, 0); } } The kernel will do static analysis of bpf program to make sure that it cannot crash the kernel (doesn't have loops, valid memory/register accesses, etc). Then kernel will map bpf instructions to x86 instructions and let it run in the place of trace filter. To demonstrate performance I did a synthetic test: dev = init_net.loopback_dev; do_gettimeofday(start_tv); for (i = 0; i 100; i++) { struct sk_buff *skb; skb = netdev_alloc_skb(dev, 128); kfree_skb(skb); } do_gettimeofday(end_tv); time = end_tv.tv_sec - start_tv.tv_sec; time *= USEC_PER_SEC; time += (long long)((long)end_tv.tv_usec - (long)start_tv.tv_usec); printk(1M skb alloc/free %lld (usecs)\n, time); no tracing [ 33.450966] 1M skb alloc/free 145179 (usecs) echo 1 enable [ 97.186379] 1M skb alloc/free 240419 (usecs) (tracing slows down kfree_skb() due to event_buffer_lock/buffer_unlock_commit) echo 'name==eth5' filter [ 139.644161] 1M skb alloc/free 302552 (usecs) (running filter_match_preds() for every skb and discarding event_buffer is even slower) cat bpf_prog filter [ 171.150566] 1M skb alloc/free 199463 (usecs) (JITed bpf program is safely checking dev-name == eth5 and discarding) So, to do the math: tracing 'all' overhead: 95 nsecs per event tracing 'eth5 + old filter' overhead: 157 nsecs per event tracing 'eth5 + BPF filter' overhead: 54 nsecs per event So via BPF and a fairly trivial filter, we are able to reduce tracing overhead for real - while old-style filters. In addition to that we now also have arbitrary BPF scripts, full C programs (or written in any other language from which BPF bytecode can be generated) enabled. Seems like a massive win-win scenario to me ;-) echo 0 enable [ 258.073593] 1M skb alloc/free 144919 (usecs) (tracing is disabled, performance is back to original) The C program compiled into BPF and then JITed into x86 is faster than filter_match_preds() approach (199-145 msec vs 302-145 msec) tracing+bpf is a tool for safe read-only access to variables without recompiling the kernel and without affecting running programs. BPF filters can be written manually (see tools/bpf/trace/filter_ex1.c) or better compiled from restricted C via GCC or LLVM Q: What is the difference between existing BPF and extended BPF? A: Existing BPF insn from uapi/linux/filter.h struct sock_filter { __u16 code; /* Actual filter code */ __u8jt; /* Jump true */ __u8jf; /* Jump false */ __u32 k; /* Generic multiuse field */ }; Extended BPF insn from linux/bpf.h struct bpf_insn { __u8code;/* opcode */ __u8a_reg:4; /* dest register*/ __u8x_reg:4; /* source register */ __s16 off; /* signed offset */ __s32 imm; /* signed immediate
Re: [RFC PATCH tip 0/5] tracing filters with BPF
(2013/12/03 13:28), Alexei Starovoitov wrote: Hi All, the following set of patches adds BPF support to trace filters. Trace filters can be written in C and allow safe read-only access to any kernel data structure. Like systemtap but with safety guaranteed by kernel. The user can do: cat bpf_program /sys/kernel/debug/tracing/.../filter if tracing event is either static or dynamic via kprobe_events. Oh, thank you for this great work! :D The filter program may look like: void filter(struct bpf_context *ctx) { char devname[4] = eth5; struct net_device *dev; struct sk_buff *skb = 0; dev = (struct net_device *)ctx-regs.si; if (bpf_memcmp(dev-name, devname, 4) == 0) { char fmt[] = skb %p dev %p eth5\n; bpf_trace_printk(fmt, skb, dev, 0, 0); } } The kernel will do static analysis of bpf program to make sure that it cannot crash the kernel (doesn't have loops, valid memory/register accesses, etc). Then kernel will map bpf instructions to x86 instructions and let it run in the place of trace filter. To demonstrate performance I did a synthetic test: dev = init_net.loopback_dev; do_gettimeofday(start_tv); for (i = 0; i 100; i++) { struct sk_buff *skb; skb = netdev_alloc_skb(dev, 128); kfree_skb(skb); } do_gettimeofday(end_tv); time = end_tv.tv_sec - start_tv.tv_sec; time *= USEC_PER_SEC; time += (long long)((long)end_tv.tv_usec - (long)start_tv.tv_usec); printk(1M skb alloc/free %lld (usecs)\n, time); no tracing [ 33.450966] 1M skb alloc/free 145179 (usecs) echo 1 enable [ 97.186379] 1M skb alloc/free 240419 (usecs) (tracing slows down kfree_skb() due to event_buffer_lock/buffer_unlock_commit) echo 'name==eth5' filter [ 139.644161] 1M skb alloc/free 302552 (usecs) (running filter_match_preds() for every skb and discarding event_buffer is even slower) cat bpf_prog filter [ 171.150566] 1M skb alloc/free 199463 (usecs) (JITed bpf program is safely checking dev-name == eth5 and discarding) echo 0 enable [ 258.073593] 1M skb alloc/free 144919 (usecs) (tracing is disabled, performance is back to original) The C program compiled into BPF and then JITed into x86 is faster than filter_match_preds() approach (199-145 msec vs 302-145 msec) Great! :) tracing+bpf is a tool for safe read-only access to variables without recompiling the kernel and without affecting running programs. Hmm, this feature and trace-event trigger actions can give us powerful on-the-fly scripting functionality... BPF filters can be written manually (see tools/bpf/trace/filter_ex1.c) or better compiled from restricted C via GCC or LLVM Q: What is the difference between existing BPF and extended BPF? A: Existing BPF insn from uapi/linux/filter.h struct sock_filter { __u16 code; /* Actual filter code */ __u8jt; /* Jump true */ __u8jf; /* Jump false */ __u32 k; /* Generic multiuse field */ }; Extended BPF insn from linux/bpf.h struct bpf_insn { __u8code;/* opcode */ __u8a_reg:4; /* dest register*/ __u8x_reg:4; /* source register */ __s16 off; /* signed offset */ __s32 imm; /* signed immediate constant */ }; opcode encoding is the same between old BPF and extended BPF. Original BPF has two 32-bit registers. Extended BPF has ten 64-bit registers. That is the main difference. Old BPF was using jt/jf fields for jump-insn only. New BPF combines them into generic 'off' field for jump and non-jump insns. k==imm field has the same meaning. Looks very interesting. :) Thank you! -- Masami HIRAMATSU IT Management Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu...@hitachi.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Tue, 3 Dec 2013 10:16:55 +0100 Ingo Molnar mi...@kernel.org wrote: So, to do the math: tracing 'all' overhead: 95 nsecs per event tracing 'eth5 + old filter' overhead: 157 nsecs per event tracing 'eth5 + BPF filter' overhead: 54 nsecs per event So via BPF and a fairly trivial filter, we are able to reduce tracing overhead for real - while old-style filters. Yep, seems that BPF can do what I wasn't able to do with the normal filters. Although, I haven't looked at the code yet, I'm assuming that the BPF works on the parameters passed into the trace event. The normal filters can only process the results of the trace (what's being recorded) not the parameters of the trace event itself. To get what's recorded, we need to write to the buffer first, and then we decided if we want to keep the event or not and discard the event from the buffer if we do not. That method does not reduce overhead at all, and only adds to it, as Alexei's tests have shown. The purpose of the filter was not to reduce overhead, but to reduce filling the buffer with needless data. It looks as if the BPF filter works on the parameters of the trace event and not what is written to the buffers (as they can be different). I've been looking for a way to do just that, and if this does accomplish it, I'll be very happy :-) -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Tue, Dec 3, 2013 at 1:16 AM, Ingo Molnar mi...@kernel.org wrote: Very cool! (Added various other folks who might be interested in this to the Cc: list.) I have one generic concern: It would be important to make it easy to extract loaded BPF code from the kernel in source code equivalent form, which compiles to the same BPF code. I.e. I think it would be fundamentally important to make sure that this is all within the kernel's license domain, to make it very clear there can be no 'binary only' BPF scripts. By up-loading BPF into a kernel the person loading it agrees to make that code available to all users of that system who can access it, under the same license as the kernel's code (or under a more permissive license). The last thing we want is people getting funny ideas and writing drivers in BPF and hiding the code or making license claims over it all makes sense. In case of kernel modules all export_symbols are accessible and module has to have kernel compatible license. Same licensing terms apply to anything else that interacts with kernel functions. In case of BPF the list of accessible functions is tiny, so it's much easier to enforce specific limited use case. For tracing filters it's just bpf_load_xx/trace_printk/dump_stack. Even if someone has funny ideas they cannot be brought to life, since drivers need a lot more than this set of functions and BPF checker will reject any attempts to call something outside of this tiny list. imo the same applies to existing BPF as well. Meaning that tcpdump filter string and seccomp filters, if distributed, has to have their source code available. I.e. we want to allow flexible plugins technologically, but make sure people who run into such a plugin can modify and improve it under the same license as they can modify and improve the kernel itself! wow. I guess if the whole thing takes off, we would need an in-kernel directory to store upstreamed bpf filters as well :) opcode encoding is the same between old BPF and extended BPF. Original BPF has two 32-bit registers. Extended BPF has ten 64-bit registers. That is the main difference. Old BPF was using jt/jf fields for jump-insn only. New BPF combines them into generic 'off' field for jump and non-jump insns. k==imm field has the same meaning. This only affects the internal JIT representation, not the BPF byte code, right? that is the ebpf vs bpf code difference. JIT doesn't keep another representation. Just converts it to x86 32 files changed, 3332 insertions(+), 24 deletions(-) Impressive! I'm wondering, will the new nftable code in works make use of the BPF JIT as well, or is that a separate implementation? nft is much higher level state machine customized for specific nftable use case. imo iptables/nftable rules can be compiled into extended bpf. One needs to define bpf_context and set of functions to do packet lookup via bpf_callbacks... but let's do it one step at a a time. Thanks Alexei -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Tue, Dec 3, 2013 at 7:33 AM, Steven Rostedt rost...@goodmis.org wrote: On Tue, 3 Dec 2013 10:16:55 +0100 Ingo Molnar mi...@kernel.org wrote: So, to do the math: tracing 'all' overhead: 95 nsecs per event tracing 'eth5 + old filter' overhead: 157 nsecs per event tracing 'eth5 + BPF filter' overhead: 54 nsecs per event So via BPF and a fairly trivial filter, we are able to reduce tracing overhead for real - while old-style filters. Yep, seems that BPF can do what I wasn't able to do with the normal filters. Although, I haven't looked at the code yet, I'm assuming that the BPF works on the parameters passed into the trace event. The normal filters can only process the results of the trace (what's being recorded) not the parameters of the trace event itself. To get what's recorded, we need to write to the buffer first, and then we decided if we want to keep the event or not and discard the event from the buffer if we do not. That method does not reduce overhead at all, and only adds to it, as Alexei's tests have shown. The purpose of the filter was not to reduce overhead, but to reduce filling the buffer with needless data. Precisely. Assumption is that filters will filter out majority of the events. So filter takes pt_regs as input, has to interpret them and call bpf_trace_printk if it really wants to store something for the human to see. We can extend bpf trace filters to return true/false to indicate whether TP_printk-format specified as part of the event should be printed as well, but imo that's unnecessary. When I was using bpf filters to debug networking bits I didn't need that printk format of the event. I only used event as an entry point, filtering out things and printing different fields vs initial event. More like what developers do when they sprinkle trace_printk/dump_stack through the code while debugging. the only inconvenience so far is to know how parameters are getting into registers. on x86-64, arg1 is in rdi, arg2 is in rsi,... I want to improve that after first step is done. In the proposed patches bpf_context == pt_regs at the event entry point. Would be cleaner to have struct {arg1,arg2,…} as bpf_context instead. But that needed more code and I wanted to keep the first patch to the minimum. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
Alexei Starovoitov a...@plumgrid.com writes: Can you do some performance comparison compared to e.g. ktap? How much faster is it? While it sounds interesting, I would strongly advise to make this capability only available to root. Traditionally lots of complex byte code languages which were designed to be safe and verifiable weren't really. e.g. i managed to crash things with safe systemtap multiple times. And we all know what happened to Java. So the likelyhood of this having some hole somewhere (either in the byte code or in some library function) is high. -Andi -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
(2013/12/04 3:26), Alexei Starovoitov wrote: On Tue, Dec 3, 2013 at 7:33 AM, Steven Rostedt rost...@goodmis.org wrote: On Tue, 3 Dec 2013 10:16:55 +0100 Ingo Molnar mi...@kernel.org wrote: So, to do the math: tracing 'all' overhead: 95 nsecs per event tracing 'eth5 + old filter' overhead: 157 nsecs per event tracing 'eth5 + BPF filter' overhead: 54 nsecs per event So via BPF and a fairly trivial filter, we are able to reduce tracing overhead for real - while old-style filters. Yep, seems that BPF can do what I wasn't able to do with the normal filters. Although, I haven't looked at the code yet, I'm assuming that the BPF works on the parameters passed into the trace event. The normal filters can only process the results of the trace (what's being recorded) not the parameters of the trace event itself. To get what's recorded, we need to write to the buffer first, and then we decided if we want to keep the event or not and discard the event from the buffer if we do not. That method does not reduce overhead at all, and only adds to it, as Alexei's tests have shown. The purpose of the filter was not to reduce overhead, but to reduce filling the buffer with needless data. Precisely. Assumption is that filters will filter out majority of the events. So filter takes pt_regs as input, has to interpret them and call bpf_trace_printk if it really wants to store something for the human to see. We can extend bpf trace filters to return true/false to indicate whether TP_printk-format specified as part of the event should be printed as well, but imo that's unnecessary. When I was using bpf filters to debug networking bits I didn't need that printk format of the event. I only used event as an entry point, filtering out things and printing different fields vs initial event. More like what developers do when they sprinkle trace_printk/dump_stack through the code while debugging. the only inconvenience so far is to know how parameters are getting into registers. on x86-64, arg1 is in rdi, arg2 is in rsi,... I want to improve that after first step is done. Actually, that part is done by the perf-probe and ftrace dynamic events (kernel/trace/trace_probe.c). I think this generic BPF is good for re-implementing fetch methods. :) Thank you, -- Masami HIRAMATSU IT Management Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: masami.hiramatsu...@hitachi.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH tip 0/5] tracing filters with BPF
On Tue, Dec 3, 2013 at 4:01 PM, Andi Kleen a...@firstfloor.org wrote: Alexei Starovoitov a...@plumgrid.com writes: Can you do some performance comparison compared to e.g. ktap? How much faster is it? imo the most interesting ktap scripts (like kmalloc-top.kp) need tables and timers. tables are almost ready for prime time, but timers I prefer to keep out of kernel. I would like bpf filter to fill tables with interesting data in kernel up to predefined limit and periodically read and clear the tables from userspace. This way I will be able to do nettop.stp, iotop.stp like programs. So I'm still thinking what should be clean kernel/user interface for bpf-defined tables. Format of keys and elements of the table is defined within bpf program. During load of bpf program, the tables are allocated and bpf program can now lookup/update into them. At the same time corresponding userspace program can read tables of this particular bpf program over netlink. Creating its own debugfs files for every filter feels too slow and feature limited, since files are all or nothing interface. Netlink access to bpf tables feels cleaner. Userspace will use libmnl to access them. Other ideas? In the mean time I'll do some simple trace probe:xx { print } performance test… While it sounds interesting, I would strongly advise to make this capability only available to root. Traditionally lots of complex byte code languages which were designed to be safe and verifiable weren't really. e.g. i managed to crash things with safe systemtap multiple times. And we all know what happened to Java. So the likelyhood of this having some hole somewhere (either in the byte code or in some library function) is high. Tracing filters are for root only today and should stay this way. As far as safety of bpf… hard to argue systemtap point ;) Though existing bpf is generally accepted to be safe. extended bpf needs time to prove itself. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH tip 0/5] tracing filters with BPF
Hi All, the following set of patches adds BPF support to trace filters. Trace filters can be written in C and allow safe read-only access to any kernel data structure. Like systemtap but with safety guaranteed by kernel. The user can do: cat bpf_program > /sys/kernel/debug/tracing/.../filter if tracing event is either static or dynamic via kprobe_events. The filter program may look like: void filter(struct bpf_context *ctx) { char devname[4] = "eth5"; struct net_device *dev; struct sk_buff *skb = 0; dev = (struct net_device *)ctx->regs.si; if (bpf_memcmp(dev->name, devname, 4) == 0) { char fmt[] = "skb %p dev %p eth5\n"; bpf_trace_printk(fmt, skb, dev, 0, 0); } } The kernel will do static analysis of bpf program to make sure that it cannot crash the kernel (doesn't have loops, valid memory/register accesses, etc). Then kernel will map bpf instructions to x86 instructions and let it run in the place of trace filter. To demonstrate performance I did a synthetic test: dev = init_net.loopback_dev; do_gettimeofday(_tv); for (i = 0; i < 100; i++) { struct sk_buff *skb; skb = netdev_alloc_skb(dev, 128); kfree_skb(skb); } do_gettimeofday(_tv); time = end_tv.tv_sec - start_tv.tv_sec; time *= USEC_PER_SEC; time += (long long)((long)end_tv.tv_usec - (long)start_tv.tv_usec); printk("1M skb alloc/free %lld (usecs)\n", time); no tracing [ 33.450966] 1M skb alloc/free 145179 (usecs) echo 1 > enable [ 97.186379] 1M skb alloc/free 240419 (usecs) (tracing slows down kfree_skb() due to event_buffer_lock/buffer_unlock_commit) echo 'name==eth5' > filter [ 139.644161] 1M skb alloc/free 302552 (usecs) (running filter_match_preds() for every skb and discarding event_buffer is even slower) cat bpf_prog > filter [ 171.150566] 1M skb alloc/free 199463 (usecs) (JITed bpf program is safely checking dev->name == eth5 and discarding) echo 0 > enable [ 258.073593] 1M skb alloc/free 144919 (usecs) (tracing is disabled, performance is back to original) The C program compiled into BPF and then JITed into x86 is faster than filter_match_preds() approach (199-145 msec vs 302-145 msec) tracing+bpf is a tool for safe read-only access to variables without recompiling the kernel and without affecting running programs. BPF filters can be written manually (see tools/bpf/trace/filter_ex1.c) or better compiled from restricted C via GCC or LLVM Q: What is the difference between existing BPF and extended BPF? A: Existing BPF insn from uapi/linux/filter.h struct sock_filter { __u16 code; /* Actual filter code */ __u8jt; /* Jump true */ __u8jf; /* Jump false */ __u32 k; /* Generic multiuse field */ }; Extended BPF insn from linux/bpf.h struct bpf_insn { __u8code;/* opcode */ __u8a_reg:4; /* dest register*/ __u8x_reg:4; /* source register */ __s16 off; /* signed offset */ __s32 imm; /* signed immediate constant */ }; opcode encoding is the same between old BPF and extended BPF. Original BPF has two 32-bit registers. Extended BPF has ten 64-bit registers. That is the main difference. Old BPF was using jt/jf fields for jump-insn only. New BPF combines them into generic 'off' field for jump and non-jump insns. k==imm field has the same meaning. Thanks Alexei Starovoitov (5): Extended BPF core framework Extended BPF JIT for x86-64 Extended BPF (64-bit BPF) design document use BPF in tracing filters tracing filter examples in BPF Documentation/bpf_jit.txt| 204 +++ arch/x86/Kconfig |1 + arch/x86/net/Makefile|1 + arch/x86/net/bpf64_jit_comp.c| 625 arch/x86/net/bpf_jit_comp.c | 23 +- arch/x86/net/bpf_jit_comp.h | 35 ++ include/linux/bpf.h | 149 + include/linux/bpf_jit.h | 129 + include/linux/ftrace_event.h |3 + include/trace/bpf_trace.h| 27 + include/trace/ftrace.h | 14 + kernel/Makefile |1 + kernel/bpf_jit/Makefile |3 + kernel/bpf_jit/bpf_check.c | 1054 ++ kernel/bpf_jit/bpf_run.c | 452 +++ kernel/trace/Kconfig |1 + kernel/trace/Makefile|1 + kernel/trace/bpf_trace_callbacks.c | 191 ++ kernel/trace/trace.c |7 + kernel/trace/trace.h | 11 +- kernel/trace/trace_events.c |9 +- kernel/trace/trace_events_filter.c | 61 +- kernel/trace/trace_kprobe.c |6 + lib/Kconfig.debug| 15 + tools/bpf/llvm/README.txt|6 + tools/bpf/trace/Makefile
[RFC PATCH tip 0/5] tracing filters with BPF
Hi All, the following set of patches adds BPF support to trace filters. Trace filters can be written in C and allow safe read-only access to any kernel data structure. Like systemtap but with safety guaranteed by kernel. The user can do: cat bpf_program /sys/kernel/debug/tracing/.../filter if tracing event is either static or dynamic via kprobe_events. The filter program may look like: void filter(struct bpf_context *ctx) { char devname[4] = eth5; struct net_device *dev; struct sk_buff *skb = 0; dev = (struct net_device *)ctx-regs.si; if (bpf_memcmp(dev-name, devname, 4) == 0) { char fmt[] = skb %p dev %p eth5\n; bpf_trace_printk(fmt, skb, dev, 0, 0); } } The kernel will do static analysis of bpf program to make sure that it cannot crash the kernel (doesn't have loops, valid memory/register accesses, etc). Then kernel will map bpf instructions to x86 instructions and let it run in the place of trace filter. To demonstrate performance I did a synthetic test: dev = init_net.loopback_dev; do_gettimeofday(start_tv); for (i = 0; i 100; i++) { struct sk_buff *skb; skb = netdev_alloc_skb(dev, 128); kfree_skb(skb); } do_gettimeofday(end_tv); time = end_tv.tv_sec - start_tv.tv_sec; time *= USEC_PER_SEC; time += (long long)((long)end_tv.tv_usec - (long)start_tv.tv_usec); printk(1M skb alloc/free %lld (usecs)\n, time); no tracing [ 33.450966] 1M skb alloc/free 145179 (usecs) echo 1 enable [ 97.186379] 1M skb alloc/free 240419 (usecs) (tracing slows down kfree_skb() due to event_buffer_lock/buffer_unlock_commit) echo 'name==eth5' filter [ 139.644161] 1M skb alloc/free 302552 (usecs) (running filter_match_preds() for every skb and discarding event_buffer is even slower) cat bpf_prog filter [ 171.150566] 1M skb alloc/free 199463 (usecs) (JITed bpf program is safely checking dev-name == eth5 and discarding) echo 0 enable [ 258.073593] 1M skb alloc/free 144919 (usecs) (tracing is disabled, performance is back to original) The C program compiled into BPF and then JITed into x86 is faster than filter_match_preds() approach (199-145 msec vs 302-145 msec) tracing+bpf is a tool for safe read-only access to variables without recompiling the kernel and without affecting running programs. BPF filters can be written manually (see tools/bpf/trace/filter_ex1.c) or better compiled from restricted C via GCC or LLVM Q: What is the difference between existing BPF and extended BPF? A: Existing BPF insn from uapi/linux/filter.h struct sock_filter { __u16 code; /* Actual filter code */ __u8jt; /* Jump true */ __u8jf; /* Jump false */ __u32 k; /* Generic multiuse field */ }; Extended BPF insn from linux/bpf.h struct bpf_insn { __u8code;/* opcode */ __u8a_reg:4; /* dest register*/ __u8x_reg:4; /* source register */ __s16 off; /* signed offset */ __s32 imm; /* signed immediate constant */ }; opcode encoding is the same between old BPF and extended BPF. Original BPF has two 32-bit registers. Extended BPF has ten 64-bit registers. That is the main difference. Old BPF was using jt/jf fields for jump-insn only. New BPF combines them into generic 'off' field for jump and non-jump insns. k==imm field has the same meaning. Thanks Alexei Starovoitov (5): Extended BPF core framework Extended BPF JIT for x86-64 Extended BPF (64-bit BPF) design document use BPF in tracing filters tracing filter examples in BPF Documentation/bpf_jit.txt| 204 +++ arch/x86/Kconfig |1 + arch/x86/net/Makefile|1 + arch/x86/net/bpf64_jit_comp.c| 625 arch/x86/net/bpf_jit_comp.c | 23 +- arch/x86/net/bpf_jit_comp.h | 35 ++ include/linux/bpf.h | 149 + include/linux/bpf_jit.h | 129 + include/linux/ftrace_event.h |3 + include/trace/bpf_trace.h| 27 + include/trace/ftrace.h | 14 + kernel/Makefile |1 + kernel/bpf_jit/Makefile |3 + kernel/bpf_jit/bpf_check.c | 1054 ++ kernel/bpf_jit/bpf_run.c | 452 +++ kernel/trace/Kconfig |1 + kernel/trace/Makefile|1 + kernel/trace/bpf_trace_callbacks.c | 191 ++ kernel/trace/trace.c |7 + kernel/trace/trace.h | 11 +- kernel/trace/trace_events.c |9 +- kernel/trace/trace_events_filter.c | 61 +- kernel/trace/trace_kprobe.c |6 + lib/Kconfig.debug| 15 + tools/bpf/llvm/README.txt|6 + tools/bpf/trace/Makefile