Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter
On 03/04/2014 06:53 PM, Alexei Starovoitov wrote: On Tue, Mar 4, 2014 at 6:28 AM, Hagen Paul Pfeifer wrote: If all issues raised by Daniel are addresed: Acked-by: Hagen Paul Pfeifer Thanks! But ... Future work: 0. seccomp 1. add extended BPF JIT for x86_64 2. add inband old/new demux and extended BPF verifier, so that new programs can be loaded through old sk_attach_filter() and sk_unattached_filter_create() interfaces 3. tracing filters systemtap-like with extended BPF 4. OVS with extended BPF 5. nftables with extended BPF ... this is shit (not your fault). (Jitted) BPF envolved into a direction which is just not the right way to do it. You try to fix things, bypass architectural shortcomings of BPF, perf issues because and so on. The right direction is to write a new general purpose in-kernel interpreter from scratch. Capability layers should provide an compatible API for BPF and I think ebpf would have the potential to be *the* general purpose in-kernel interpreter actually (if we undertake all this effort of migration) as its already designed to be in a more generic context than the traditional interpreter which is restricted to skb (or NULL). seccomp. You have the knowledge to do exactly this, you nearly already did this - you should start this undertake! this insn set evolved over few years. Initially we had nft-like high level state machine, but it wasn't fast, then kprobe-like pure x86_64 which was fast, but very hard to analyze from safety point of view. Then reduced x86-64 insn set and finally ebpf. I think any brand new instruction set will have steep learning curve, just because it's all new. ebpf tries to reuse as much as possible. opcode encoding is the same, instruction size is fixed at 8 bytes and so on. Yeah, these restrictions make few things not 100% optimal, but imo common look and feel is more important. What ebpf has already should be enough to do all of the above 'future work'. Built-in JIT-ability of ebpf is the key to performance. Ability to call some kernel functions from ebpf make it ultimately extensible. socket filters and seccomp don't use this feature yet, but tracing filters will. Regards, Alexei -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter
On 03/04/2014 06:09 PM, Alexei Starovoitov wrote: On Tue, Mar 4, 2014 at 1:59 AM, Daniel Borkmann wrote: ... Hmm, so the case statement is about BPF_RET | BPF_A and BPF_RET | BPF_K but BPF_RET | BPF_X is not mentioned. However, in BPF_SRC(fp->code) selection you fall back to BPF_X if it doesn't equal BPF_K? Is that correct? And, you probably also need to handle BPF_RET | BPF_X ? :) that design choice of original BPF always puzzled me. BPF_A macro only used in one insn: BPF_RET + BPF_A and all other insns use BPF_K and BPF_X and though comment in uapi/filter.h says "ret - BPF_K and BPF_X also apply" this is not true, since sk_chk_filter() only allows ret+a and ret+k libpcap is equally confused. It never generates ret+x, but has few places in the code that can recognize it. I guess that's an artifact of distant past. Good point, ret+a and ret+k are main users anyway, though we could fix that limitation actually. ;) epbf has only one RET insn that takes register R0 and returns it. That is similar to real CPU 'ret' insn and done to make epbf easier to generate from gcc/llvm point of view. ebpf jit converts 'ret' into 'leave; ret' on x86_64. so original bpf+k and bpf+a are converted into 'mov r0, [a or k]; ret r0' btw, if there is interest I can put ebpf testsuite into tools/net/ Yes, please. Would be great if you can place the test suite under: tools/testing/selftests/net/bpf/ I believe some stuff there could get its own folder e.g. "packet" for PF_PACKET test cases etc, so that we can easily arrange them. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter
On Tue, Mar 4, 2014 at 6:28 AM, Hagen Paul Pfeifer wrote: > If all issues raised by Daniel are addresed: > > Acked-by: Hagen Paul Pfeifer Thanks! > But ... > >>Future work: >> >>0. seccomp >> >>1. add extended BPF JIT for x86_64 >> >>2. add inband old/new demux and extended BPF verifier, so that new programs >> can be loaded through old sk_attach_filter() and >> sk_unattached_filter_create() >> interfaces >> >>3. tracing filters systemtap-like with extended BPF >> >>4. OVS with extended BPF >> >>5. nftables with extended BPF > > ... this is shit (not your fault). (Jitted) BPF envolved into a direction > which is just not the right way to do it. You try to fix things, bypass > architectural shortcomings of BPF, perf issues because and so on. > > The right direction is to write a new general purpose in-kernel interpreter > from scratch. Capability layers should provide an compatible API for BPF and > seccomp. You have the knowledge to do exactly this, you nearly already did > this - you should start this undertake! this insn set evolved over few years. Initially we had nft-like high level state machine, but it wasn't fast, then kprobe-like pure x86_64 which was fast, but very hard to analyze from safety point of view. Then reduced x86-64 insn set and finally ebpf. I think any brand new instruction set will have steep learning curve, just because it's all new. ebpf tries to reuse as much as possible. opcode encoding is the same, instruction size is fixed at 8 bytes and so on. Yeah, these restrictions make few things not 100% optimal, but imo common look and feel is more important. What ebpf has already should be enough to do all of the above 'future work'. Built-in JIT-ability of ebpf is the key to performance. Ability to call some kernel functions from ebpf make it ultimately extensible. socket filters and seccomp don't use this feature yet, but tracing filters will. Regards, Alexei > > -- > Hagen Paul Pfeifer -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter
On Tue, Mar 4, 2014 at 1:59 AM, Daniel Borkmann wrote: > On 03/04/2014 06:18 AM, Alexei Starovoitov wrote: >> >> Extended BPF extends old BPF in the following ways: >> - from 2 to 10 registers >>Original BPF has two registers (A and X) and hidden frame pointer. >>Extended BPF has ten registers and read-only frame pointer. >> - from 32-bit registers to 64-bit registers >>semantics of old 32-bit ALU operations are preserved via 32-bit >>subregisters >> - if (cond) jump_true; else jump_false; >>old BPF insns are replaced with: >>if (cond) jump_true; /* else fallthrough */ >> - adds signed > and >= insns >> - 16 4-byte stack slots for register spill-fill replaced with >>up to 512 bytes of multi-use stack space >> - introduces bpf_call insn and register passing convention for zero >>overhead calls from/to other kernel functions (not part of this patch) >> - adds arithmetic right shift insn >> - adds swab32/swab64 insns >> - adds atomic_add insn >> - old tax/txa insns are replaced with 'mov dst,src' insn >> >> Extended BPF is designed to be JITed with one to one mapping, which >> allows GCC/LLVM backends to generate optimized BPF code that performs >> almost as fast as natively compiled code >> >> sk_convert_filter() remaps old style insns into extended: >> 'sock_filter' instructions are remapped on the fly to >> 'sock_filter_ext' extended instructions when >> sysctl net.core.bpf_ext_enable=1 >> >> Old filter comes through sk_attach_filter() or >> sk_unattached_filter_create() >> if (bpf_ext_enable) { >> convert to new >> sk_chk_filter() - check old bpf >> use sk_run_filter_ext() - new interpreter >> } else { >> sk_chk_filter() - check old bpf >> if (bpf_jit_enable) >> use old jit >> else >> use sk_run_filter() - old interpreter >> } >> >> sk_run_filter_ext() interpreter is noticeably faster >> than sk_run_filter() for two reasons: >> >> 1.fall-through jumps >>Old BPF jump instructions are forced to go either 'true' or 'false' >>branch which causes branch-miss penalty. >>Extended BPF jump instructions have one branch and fall-through, >>which fit CPU branch predictor logic better. >>'perf stat' shows drastic difference for branch-misses. >> >> 2.jump-threaded implementation of interpreter vs switch statement >>Instead of single tablejump at the top of 'switch' statement, GCC will >>generate multiple tablejump instructions, which helps CPU branch >> predictor >> >> Performance of two BPF filters generated by libpcap was measured >> on x86_64, i386 and arm32. >> >> fprog #1 is taken from Documentation/networking/filter.txt: >> tcpdump -i eth0 port 22 -dd >> >> fprog #2 is taken from 'man tcpdump': >> tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - >> ((tcp[12]&0xf0)>>2)) != 0)' -dd >> >> Other libpcap programs have similar performance differences. >> >> Raw performance data from BPF micro-benchmark: >> SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss) >> time in nsec per call, smaller is better >> --x86_64-- >> fprog #1 fprog #1 fprog #2 fprog #2 >> cache-hit cache-miss cache-hit cache-miss >> old BPF 90 101 192 202 >> ext BPF 3171 47 97 >> old BPF jit 1234 17 44 >> ext BPF jit TBD >> >> --i386-- >> fprog #1 fprog #1 fprog #2 fprog #2 >> cache-hit cache-miss cache-hit cache-miss >> old BPF107136 227 252 >> ext BPF 40119 69 172 >> >> --arm32-- >> fprog #1 fprog #1 fprog #2 fprog #2 >> cache-hit cache-miss cache-hit cache-miss >> old BPF202300 475 540 >> ext BPF139270 296 470 >> old BPF jit 26182 37 202 >> new BPF jit TBD >> >> Tested with trinify BPF fuzzer >> >> Future work: >> >> 0. seccomp >> >> 1. add extended BPF JIT for x86_64 >> >> 2. add inband old/new demux and extended BPF verifier, so that new >> programs >> can be loaded through old sk_attach_filter() and >> sk_unattached_filter_create() >> interfaces >> >> 3. tracing filters systemtap-like with extended BPF >> >> 4. OVS with extended BPF >> >> 5. nftables with extended BPF >> >> Signed-off-by: Alexei Starovoitov > > > Looks great, imho, some comments/questions inline: > > Nit: subject line of your patches should be, e.g. > > "filter: add Extended BPF interpreter and converter" > "doc: filter: add Extended BPF documentation" > ... > > so first ": ". sure. Will send v5 :) >> --- >> include/linux/filter.h |8 +- >> include/linux/netdevice.h |1 + >> include/uapi/linux/filter.h | 34 +- >> net/core/filter.c | 802 >> ++- >> net/core/sysctl_net_core.c |7 + >> 5 files changed, 830 insertions(+), 22 deletions(-) >> >> diff --git a/include/linux/filter.h
Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter
If all issues raised by Daniel are addresed: Acked-by: Hagen Paul Pfeifer But ... >Future work: > >0. seccomp > >1. add extended BPF JIT for x86_64 > >2. add inband old/new demux and extended BPF verifier, so that new programs > can be loaded through old sk_attach_filter() and > sk_unattached_filter_create() > interfaces > >3. tracing filters systemtap-like with extended BPF > >4. OVS with extended BPF > >5. nftables with extended BPF ... this is shit (not your fault). (Jitted) BPF envolved into a direction which is just not the right way to do it. You try to fix things, bypass architectural shortcomings of BPF, perf issues because and so on. The right direction is to write a new general purpose in-kernel interpreter from scratch. Capability layers should provide an compatible API for BPF and seccomp. You have the knowledge to do exactly this, you nearly already did this - you should start this undertake! -- Hagen Paul Pfeifer -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter
On 03/04/2014 06:18 AM, Alexei Starovoitov wrote: Extended BPF extends old BPF in the following ways: - from 2 to 10 registers Original BPF has two registers (A and X) and hidden frame pointer. Extended BPF has ten registers and read-only frame pointer. - from 32-bit registers to 64-bit registers semantics of old 32-bit ALU operations are preserved via 32-bit subregisters - if (cond) jump_true; else jump_false; old BPF insns are replaced with: if (cond) jump_true; /* else fallthrough */ - adds signed > and >= insns - 16 4-byte stack slots for register spill-fill replaced with up to 512 bytes of multi-use stack space - introduces bpf_call insn and register passing convention for zero overhead calls from/to other kernel functions (not part of this patch) - adds arithmetic right shift insn - adds swab32/swab64 insns - adds atomic_add insn - old tax/txa insns are replaced with 'mov dst,src' insn Extended BPF is designed to be JITed with one to one mapping, which allows GCC/LLVM backends to generate optimized BPF code that performs almost as fast as natively compiled code sk_convert_filter() remaps old style insns into extended: 'sock_filter' instructions are remapped on the fly to 'sock_filter_ext' extended instructions when sysctl net.core.bpf_ext_enable=1 Old filter comes through sk_attach_filter() or sk_unattached_filter_create() if (bpf_ext_enable) { convert to new sk_chk_filter() - check old bpf use sk_run_filter_ext() - new interpreter } else { sk_chk_filter() - check old bpf if (bpf_jit_enable) use old jit else use sk_run_filter() - old interpreter } sk_run_filter_ext() interpreter is noticeably faster than sk_run_filter() for two reasons: 1.fall-through jumps Old BPF jump instructions are forced to go either 'true' or 'false' branch which causes branch-miss penalty. Extended BPF jump instructions have one branch and fall-through, which fit CPU branch predictor logic better. 'perf stat' shows drastic difference for branch-misses. 2.jump-threaded implementation of interpreter vs switch statement Instead of single tablejump at the top of 'switch' statement, GCC will generate multiple tablejump instructions, which helps CPU branch predictor Performance of two BPF filters generated by libpcap was measured on x86_64, i386 and arm32. fprog #1 is taken from Documentation/networking/filter.txt: tcpdump -i eth0 port 22 -dd fprog #2 is taken from 'man tcpdump': tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)' -dd Other libpcap programs have similar performance differences. Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss) time in nsec per call, smaller is better --x86_64-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 90 101 192 202 ext BPF 3171 47 97 old BPF jit 1234 17 44 ext BPF jit TBD --i386-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF107136 227 252 ext BPF 40119 69 172 --arm32-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF202300 475 540 ext BPF139270 296 470 old BPF jit 26182 37 202 new BPF jit TBD Tested with trinify BPF fuzzer Future work: 0. seccomp 1. add extended BPF JIT for x86_64 2. add inband old/new demux and extended BPF verifier, so that new programs can be loaded through old sk_attach_filter() and sk_unattached_filter_create() interfaces 3. tracing filters systemtap-like with extended BPF 4. OVS with extended BPF 5. nftables with extended BPF Signed-off-by: Alexei Starovoitov Looks great, imho, some comments/questions inline: Nit: subject line of your patches should be, e.g. "filter: add Extended BPF interpreter and converter" "doc: filter: add Extended BPF documentation" ... so first ": ". --- include/linux/filter.h |8 +- include/linux/netdevice.h |1 + include/uapi/linux/filter.h | 34 +- net/core/filter.c | 802 ++- net/core/sysctl_net_core.c |7 + 5 files changed, 830 insertions(+), 22 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index e568c8ef896b..0e84ff6e991b 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -52,7 +52,13 @@ extern int sk_detach_filter(struct sock *sk); extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen); extern int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, unsigned len); extern void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to); +/* function remaps
Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter
On 03/04/2014 06:18 AM, Alexei Starovoitov wrote: Extended BPF extends old BPF in the following ways: - from 2 to 10 registers Original BPF has two registers (A and X) and hidden frame pointer. Extended BPF has ten registers and read-only frame pointer. - from 32-bit registers to 64-bit registers semantics of old 32-bit ALU operations are preserved via 32-bit subregisters - if (cond) jump_true; else jump_false; old BPF insns are replaced with: if (cond) jump_true; /* else fallthrough */ - adds signed and = insns - 16 4-byte stack slots for register spill-fill replaced with up to 512 bytes of multi-use stack space - introduces bpf_call insn and register passing convention for zero overhead calls from/to other kernel functions (not part of this patch) - adds arithmetic right shift insn - adds swab32/swab64 insns - adds atomic_add insn - old tax/txa insns are replaced with 'mov dst,src' insn Extended BPF is designed to be JITed with one to one mapping, which allows GCC/LLVM backends to generate optimized BPF code that performs almost as fast as natively compiled code sk_convert_filter() remaps old style insns into extended: 'sock_filter' instructions are remapped on the fly to 'sock_filter_ext' extended instructions when sysctl net.core.bpf_ext_enable=1 Old filter comes through sk_attach_filter() or sk_unattached_filter_create() if (bpf_ext_enable) { convert to new sk_chk_filter() - check old bpf use sk_run_filter_ext() - new interpreter } else { sk_chk_filter() - check old bpf if (bpf_jit_enable) use old jit else use sk_run_filter() - old interpreter } sk_run_filter_ext() interpreter is noticeably faster than sk_run_filter() for two reasons: 1.fall-through jumps Old BPF jump instructions are forced to go either 'true' or 'false' branch which causes branch-miss penalty. Extended BPF jump instructions have one branch and fall-through, which fit CPU branch predictor logic better. 'perf stat' shows drastic difference for branch-misses. 2.jump-threaded implementation of interpreter vs switch statement Instead of single tablejump at the top of 'switch' statement, GCC will generate multiple tablejump instructions, which helps CPU branch predictor Performance of two BPF filters generated by libpcap was measured on x86_64, i386 and arm32. fprog #1 is taken from Documentation/networking/filter.txt: tcpdump -i eth0 port 22 -dd fprog #2 is taken from 'man tcpdump': tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]0xf)2)) - ((tcp[12]0xf0)2)) != 0)' -dd Other libpcap programs have similar performance differences. Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss) time in nsec per call, smaller is better --x86_64-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 90 101 192 202 ext BPF 3171 47 97 old BPF jit 1234 17 44 ext BPF jit TBD --i386-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF107136 227 252 ext BPF 40119 69 172 --arm32-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF202300 475 540 ext BPF139270 296 470 old BPF jit 26182 37 202 new BPF jit TBD Tested with trinify BPF fuzzer Future work: 0. seccomp 1. add extended BPF JIT for x86_64 2. add inband old/new demux and extended BPF verifier, so that new programs can be loaded through old sk_attach_filter() and sk_unattached_filter_create() interfaces 3. tracing filters systemtap-like with extended BPF 4. OVS with extended BPF 5. nftables with extended BPF Signed-off-by: Alexei Starovoitov a...@plumgrid.com Looks great, imho, some comments/questions inline: Nit: subject line of your patches should be, e.g. filter: add Extended BPF interpreter and converter doc: filter: add Extended BPF documentation ... so first subsystem: summary phrase. --- include/linux/filter.h |8 +- include/linux/netdevice.h |1 + include/uapi/linux/filter.h | 34 +- net/core/filter.c | 802 ++- net/core/sysctl_net_core.c |7 + 5 files changed, 830 insertions(+), 22 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index e568c8ef896b..0e84ff6e991b 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -52,7 +52,13 @@ extern int sk_detach_filter(struct sock *sk); extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen); extern int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, unsigned len); extern void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to);
Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter
If all issues raised by Daniel are addresed: Acked-by: Hagen Paul Pfeifer ha...@jauu.net But ... Future work: 0. seccomp 1. add extended BPF JIT for x86_64 2. add inband old/new demux and extended BPF verifier, so that new programs can be loaded through old sk_attach_filter() and sk_unattached_filter_create() interfaces 3. tracing filters systemtap-like with extended BPF 4. OVS with extended BPF 5. nftables with extended BPF ... this is shit (not your fault). (Jitted) BPF envolved into a direction which is just not the right way to do it. You try to fix things, bypass architectural shortcomings of BPF, perf issues because and so on. The right direction is to write a new general purpose in-kernel interpreter from scratch. Capability layers should provide an compatible API for BPF and seccomp. You have the knowledge to do exactly this, you nearly already did this - you should start this undertake! -- Hagen Paul Pfeifer -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter
On Tue, Mar 4, 2014 at 1:59 AM, Daniel Borkmann dbork...@redhat.com wrote: On 03/04/2014 06:18 AM, Alexei Starovoitov wrote: Extended BPF extends old BPF in the following ways: - from 2 to 10 registers Original BPF has two registers (A and X) and hidden frame pointer. Extended BPF has ten registers and read-only frame pointer. - from 32-bit registers to 64-bit registers semantics of old 32-bit ALU operations are preserved via 32-bit subregisters - if (cond) jump_true; else jump_false; old BPF insns are replaced with: if (cond) jump_true; /* else fallthrough */ - adds signed and = insns - 16 4-byte stack slots for register spill-fill replaced with up to 512 bytes of multi-use stack space - introduces bpf_call insn and register passing convention for zero overhead calls from/to other kernel functions (not part of this patch) - adds arithmetic right shift insn - adds swab32/swab64 insns - adds atomic_add insn - old tax/txa insns are replaced with 'mov dst,src' insn Extended BPF is designed to be JITed with one to one mapping, which allows GCC/LLVM backends to generate optimized BPF code that performs almost as fast as natively compiled code sk_convert_filter() remaps old style insns into extended: 'sock_filter' instructions are remapped on the fly to 'sock_filter_ext' extended instructions when sysctl net.core.bpf_ext_enable=1 Old filter comes through sk_attach_filter() or sk_unattached_filter_create() if (bpf_ext_enable) { convert to new sk_chk_filter() - check old bpf use sk_run_filter_ext() - new interpreter } else { sk_chk_filter() - check old bpf if (bpf_jit_enable) use old jit else use sk_run_filter() - old interpreter } sk_run_filter_ext() interpreter is noticeably faster than sk_run_filter() for two reasons: 1.fall-through jumps Old BPF jump instructions are forced to go either 'true' or 'false' branch which causes branch-miss penalty. Extended BPF jump instructions have one branch and fall-through, which fit CPU branch predictor logic better. 'perf stat' shows drastic difference for branch-misses. 2.jump-threaded implementation of interpreter vs switch statement Instead of single tablejump at the top of 'switch' statement, GCC will generate multiple tablejump instructions, which helps CPU branch predictor Performance of two BPF filters generated by libpcap was measured on x86_64, i386 and arm32. fprog #1 is taken from Documentation/networking/filter.txt: tcpdump -i eth0 port 22 -dd fprog #2 is taken from 'man tcpdump': tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]0xf)2)) - ((tcp[12]0xf0)2)) != 0)' -dd Other libpcap programs have similar performance differences. Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss) time in nsec per call, smaller is better --x86_64-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 90 101 192 202 ext BPF 3171 47 97 old BPF jit 1234 17 44 ext BPF jit TBD --i386-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF107136 227 252 ext BPF 40119 69 172 --arm32-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF202300 475 540 ext BPF139270 296 470 old BPF jit 26182 37 202 new BPF jit TBD Tested with trinify BPF fuzzer Future work: 0. seccomp 1. add extended BPF JIT for x86_64 2. add inband old/new demux and extended BPF verifier, so that new programs can be loaded through old sk_attach_filter() and sk_unattached_filter_create() interfaces 3. tracing filters systemtap-like with extended BPF 4. OVS with extended BPF 5. nftables with extended BPF Signed-off-by: Alexei Starovoitov a...@plumgrid.com Looks great, imho, some comments/questions inline: Nit: subject line of your patches should be, e.g. filter: add Extended BPF interpreter and converter doc: filter: add Extended BPF documentation ... so first subsystem: summary phrase. sure. Will send v5 :) --- include/linux/filter.h |8 +- include/linux/netdevice.h |1 + include/uapi/linux/filter.h | 34 +- net/core/filter.c | 802 ++- net/core/sysctl_net_core.c |7 + 5 files changed, 830 insertions(+), 22 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index e568c8ef896b..0e84ff6e991b 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -52,7 +52,13 @@ extern int sk_detach_filter(struct sock *sk); extern int sk_chk_filter(struct
Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter
On Tue, Mar 4, 2014 at 6:28 AM, Hagen Paul Pfeifer ha...@jauu.net wrote: If all issues raised by Daniel are addresed: Acked-by: Hagen Paul Pfeifer ha...@jauu.net Thanks! But ... Future work: 0. seccomp 1. add extended BPF JIT for x86_64 2. add inband old/new demux and extended BPF verifier, so that new programs can be loaded through old sk_attach_filter() and sk_unattached_filter_create() interfaces 3. tracing filters systemtap-like with extended BPF 4. OVS with extended BPF 5. nftables with extended BPF ... this is shit (not your fault). (Jitted) BPF envolved into a direction which is just not the right way to do it. You try to fix things, bypass architectural shortcomings of BPF, perf issues because and so on. The right direction is to write a new general purpose in-kernel interpreter from scratch. Capability layers should provide an compatible API for BPF and seccomp. You have the knowledge to do exactly this, you nearly already did this - you should start this undertake! this insn set evolved over few years. Initially we had nft-like high level state machine, but it wasn't fast, then kprobe-like pure x86_64 which was fast, but very hard to analyze from safety point of view. Then reduced x86-64 insn set and finally ebpf. I think any brand new instruction set will have steep learning curve, just because it's all new. ebpf tries to reuse as much as possible. opcode encoding is the same, instruction size is fixed at 8 bytes and so on. Yeah, these restrictions make few things not 100% optimal, but imo common look and feel is more important. What ebpf has already should be enough to do all of the above 'future work'. Built-in JIT-ability of ebpf is the key to performance. Ability to call some kernel functions from ebpf make it ultimately extensible. socket filters and seccomp don't use this feature yet, but tracing filters will. Regards, Alexei -- Hagen Paul Pfeifer -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter
On 03/04/2014 06:09 PM, Alexei Starovoitov wrote: On Tue, Mar 4, 2014 at 1:59 AM, Daniel Borkmann dbork...@redhat.com wrote: ... Hmm, so the case statement is about BPF_RET | BPF_A and BPF_RET | BPF_K but BPF_RET | BPF_X is not mentioned. However, in BPF_SRC(fp-code) selection you fall back to BPF_X if it doesn't equal BPF_K? Is that correct? And, you probably also need to handle BPF_RET | BPF_X ? :) that design choice of original BPF always puzzled me. BPF_A macro only used in one insn: BPF_RET + BPF_A and all other insns use BPF_K and BPF_X and though comment in uapi/filter.h says ret - BPF_K and BPF_X also apply this is not true, since sk_chk_filter() only allows ret+a and ret+k libpcap is equally confused. It never generates ret+x, but has few places in the code that can recognize it. I guess that's an artifact of distant past. Good point, ret+a and ret+k are main users anyway, though we could fix that limitation actually. ;) epbf has only one RET insn that takes register R0 and returns it. That is similar to real CPU 'ret' insn and done to make epbf easier to generate from gcc/llvm point of view. ebpf jit converts 'ret' into 'leave; ret' on x86_64. so original bpf+k and bpf+a are converted into 'mov r0, [a or k]; ret r0' btw, if there is interest I can put ebpf testsuite into tools/net/ Yes, please. Would be great if you can place the test suite under: tools/testing/selftests/net/bpf/ I believe some stuff there could get its own folder e.g. packet for PF_PACKET test cases etc, so that we can easily arrange them. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter
On 03/04/2014 06:53 PM, Alexei Starovoitov wrote: On Tue, Mar 4, 2014 at 6:28 AM, Hagen Paul Pfeifer ha...@jauu.net wrote: If all issues raised by Daniel are addresed: Acked-by: Hagen Paul Pfeifer ha...@jauu.net Thanks! But ... Future work: 0. seccomp 1. add extended BPF JIT for x86_64 2. add inband old/new demux and extended BPF verifier, so that new programs can be loaded through old sk_attach_filter() and sk_unattached_filter_create() interfaces 3. tracing filters systemtap-like with extended BPF 4. OVS with extended BPF 5. nftables with extended BPF ... this is shit (not your fault). (Jitted) BPF envolved into a direction which is just not the right way to do it. You try to fix things, bypass architectural shortcomings of BPF, perf issues because and so on. The right direction is to write a new general purpose in-kernel interpreter from scratch. Capability layers should provide an compatible API for BPF and I think ebpf would have the potential to be *the* general purpose in-kernel interpreter actually (if we undertake all this effort of migration) as its already designed to be in a more generic context than the traditional interpreter which is restricted to skb (or NULL). seccomp. You have the knowledge to do exactly this, you nearly already did this - you should start this undertake! this insn set evolved over few years. Initially we had nft-like high level state machine, but it wasn't fast, then kprobe-like pure x86_64 which was fast, but very hard to analyze from safety point of view. Then reduced x86-64 insn set and finally ebpf. I think any brand new instruction set will have steep learning curve, just because it's all new. ebpf tries to reuse as much as possible. opcode encoding is the same, instruction size is fixed at 8 bytes and so on. Yeah, these restrictions make few things not 100% optimal, but imo common look and feel is more important. What ebpf has already should be enough to do all of the above 'future work'. Built-in JIT-ability of ebpf is the key to performance. Ability to call some kernel functions from ebpf make it ultimately extensible. socket filters and seccomp don't use this feature yet, but tracing filters will. Regards, Alexei -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v4 net-next 1/3] Extended BPF interpreter and converter
Extended BPF extends old BPF in the following ways: - from 2 to 10 registers Original BPF has two registers (A and X) and hidden frame pointer. Extended BPF has ten registers and read-only frame pointer. - from 32-bit registers to 64-bit registers semantics of old 32-bit ALU operations are preserved via 32-bit subregisters - if (cond) jump_true; else jump_false; old BPF insns are replaced with: if (cond) jump_true; /* else fallthrough */ - adds signed > and >= insns - 16 4-byte stack slots for register spill-fill replaced with up to 512 bytes of multi-use stack space - introduces bpf_call insn and register passing convention for zero overhead calls from/to other kernel functions (not part of this patch) - adds arithmetic right shift insn - adds swab32/swab64 insns - adds atomic_add insn - old tax/txa insns are replaced with 'mov dst,src' insn Extended BPF is designed to be JITed with one to one mapping, which allows GCC/LLVM backends to generate optimized BPF code that performs almost as fast as natively compiled code sk_convert_filter() remaps old style insns into extended: 'sock_filter' instructions are remapped on the fly to 'sock_filter_ext' extended instructions when sysctl net.core.bpf_ext_enable=1 Old filter comes through sk_attach_filter() or sk_unattached_filter_create() if (bpf_ext_enable) { convert to new sk_chk_filter() - check old bpf use sk_run_filter_ext() - new interpreter } else { sk_chk_filter() - check old bpf if (bpf_jit_enable) use old jit else use sk_run_filter() - old interpreter } sk_run_filter_ext() interpreter is noticeably faster than sk_run_filter() for two reasons: 1.fall-through jumps Old BPF jump instructions are forced to go either 'true' or 'false' branch which causes branch-miss penalty. Extended BPF jump instructions have one branch and fall-through, which fit CPU branch predictor logic better. 'perf stat' shows drastic difference for branch-misses. 2.jump-threaded implementation of interpreter vs switch statement Instead of single tablejump at the top of 'switch' statement, GCC will generate multiple tablejump instructions, which helps CPU branch predictor Performance of two BPF filters generated by libpcap was measured on x86_64, i386 and arm32. fprog #1 is taken from Documentation/networking/filter.txt: tcpdump -i eth0 port 22 -dd fprog #2 is taken from 'man tcpdump': tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)' -dd Other libpcap programs have similar performance differences. Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss) time in nsec per call, smaller is better --x86_64-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 90 101 192 202 ext BPF 3171 47 97 old BPF jit 1234 17 44 ext BPF jit TBD --i386-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF107136 227 252 ext BPF 40119 69 172 --arm32-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF202300 475 540 ext BPF139270 296 470 old BPF jit 26182 37 202 new BPF jit TBD Tested with trinify BPF fuzzer Future work: 0. seccomp 1. add extended BPF JIT for x86_64 2. add inband old/new demux and extended BPF verifier, so that new programs can be loaded through old sk_attach_filter() and sk_unattached_filter_create() interfaces 3. tracing filters systemtap-like with extended BPF 4. OVS with extended BPF 5. nftables with extended BPF Signed-off-by: Alexei Starovoitov --- include/linux/filter.h |8 +- include/linux/netdevice.h |1 + include/uapi/linux/filter.h | 34 +- net/core/filter.c | 802 ++- net/core/sysctl_net_core.c |7 + 5 files changed, 830 insertions(+), 22 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index e568c8ef896b..0e84ff6e991b 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -52,7 +52,13 @@ extern int sk_detach_filter(struct sock *sk); extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen); extern int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, unsigned len); extern void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to); +/* function remaps 'sock_filter' insns to 'sock_filter_ext' insns */ +int sk_convert_filter(struct sock_filter *old_prog, int len, + struct sock_filter_ext *new_prog, int *p_new_len); +/* execute extended bpf program */ +unsigned int sk_run_filter_ext(void *ctx, const struct sock_filter_ext *insn); +#define
[PATCH v4 net-next 1/3] Extended BPF interpreter and converter
Extended BPF extends old BPF in the following ways: - from 2 to 10 registers Original BPF has two registers (A and X) and hidden frame pointer. Extended BPF has ten registers and read-only frame pointer. - from 32-bit registers to 64-bit registers semantics of old 32-bit ALU operations are preserved via 32-bit subregisters - if (cond) jump_true; else jump_false; old BPF insns are replaced with: if (cond) jump_true; /* else fallthrough */ - adds signed and = insns - 16 4-byte stack slots for register spill-fill replaced with up to 512 bytes of multi-use stack space - introduces bpf_call insn and register passing convention for zero overhead calls from/to other kernel functions (not part of this patch) - adds arithmetic right shift insn - adds swab32/swab64 insns - adds atomic_add insn - old tax/txa insns are replaced with 'mov dst,src' insn Extended BPF is designed to be JITed with one to one mapping, which allows GCC/LLVM backends to generate optimized BPF code that performs almost as fast as natively compiled code sk_convert_filter() remaps old style insns into extended: 'sock_filter' instructions are remapped on the fly to 'sock_filter_ext' extended instructions when sysctl net.core.bpf_ext_enable=1 Old filter comes through sk_attach_filter() or sk_unattached_filter_create() if (bpf_ext_enable) { convert to new sk_chk_filter() - check old bpf use sk_run_filter_ext() - new interpreter } else { sk_chk_filter() - check old bpf if (bpf_jit_enable) use old jit else use sk_run_filter() - old interpreter } sk_run_filter_ext() interpreter is noticeably faster than sk_run_filter() for two reasons: 1.fall-through jumps Old BPF jump instructions are forced to go either 'true' or 'false' branch which causes branch-miss penalty. Extended BPF jump instructions have one branch and fall-through, which fit CPU branch predictor logic better. 'perf stat' shows drastic difference for branch-misses. 2.jump-threaded implementation of interpreter vs switch statement Instead of single tablejump at the top of 'switch' statement, GCC will generate multiple tablejump instructions, which helps CPU branch predictor Performance of two BPF filters generated by libpcap was measured on x86_64, i386 and arm32. fprog #1 is taken from Documentation/networking/filter.txt: tcpdump -i eth0 port 22 -dd fprog #2 is taken from 'man tcpdump': tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]0xf)2)) - ((tcp[12]0xf0)2)) != 0)' -dd Other libpcap programs have similar performance differences. Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss) time in nsec per call, smaller is better --x86_64-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 90 101 192 202 ext BPF 3171 47 97 old BPF jit 1234 17 44 ext BPF jit TBD --i386-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF107136 227 252 ext BPF 40119 69 172 --arm32-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF202300 475 540 ext BPF139270 296 470 old BPF jit 26182 37 202 new BPF jit TBD Tested with trinify BPF fuzzer Future work: 0. seccomp 1. add extended BPF JIT for x86_64 2. add inband old/new demux and extended BPF verifier, so that new programs can be loaded through old sk_attach_filter() and sk_unattached_filter_create() interfaces 3. tracing filters systemtap-like with extended BPF 4. OVS with extended BPF 5. nftables with extended BPF Signed-off-by: Alexei Starovoitov a...@plumgrid.com --- include/linux/filter.h |8 +- include/linux/netdevice.h |1 + include/uapi/linux/filter.h | 34 +- net/core/filter.c | 802 ++- net/core/sysctl_net_core.c |7 + 5 files changed, 830 insertions(+), 22 deletions(-) diff --git a/include/linux/filter.h b/include/linux/filter.h index e568c8ef896b..0e84ff6e991b 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -52,7 +52,13 @@ extern int sk_detach_filter(struct sock *sk); extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen); extern int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, unsigned len); extern void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to); +/* function remaps 'sock_filter' insns to 'sock_filter_ext' insns */ +int sk_convert_filter(struct sock_filter *old_prog, int len, + struct sock_filter_ext *new_prog, int *p_new_len); +/* execute extended bpf program */ +unsigned int sk_run_filter_ext(void *ctx, const struct sock_filter_ext *insn); +#define