On 03/11/2014 10:03 PM, Alexei Starovoitov wrote: > On Tue, Mar 11, 2014 at 10:40 AM, Pavel Emelyanov <xe...@parallels.com> wrote: >> On 03/10/2014 02:00 AM, Daniel Borkmann wrote: >>> On 03/09/2014 06:08 PM, Alexei Starovoitov wrote: >>>> On Sun, Mar 9, 2014 at 5:29 AM, Daniel Borkmann <borkm...@iogearbox.net> >>>> wrote: >>>>> On 03/09/2014 12:15 AM, Alexei Starovoitov wrote: >>>>>> >>>>>> Extended BPF extends old BPF in the following ways: >>>>>> - from 2 to 10 registers >>>>>> Original BPF has two registers (A and X) and hidden frame pointer. >>>>>> Extended BPF has ten registers and read-only frame pointer. >>>>>> - from 32-bit registers to 64-bit registers >>>>>> semantics of old 32-bit ALU operations are preserved via 32-bit >>>>>> subregisters >>>>>> - if (cond) jump_true; else jump_false; >>>>>> old BPF insns are replaced with: >>>>>> if (cond) jump_true; /* else fallthrough */ >>>>>> - adds signed > and >= insns >>>>>> - 16 4-byte stack slots for register spill-fill replaced with >>>>>> up to 512 bytes of multi-use stack space >>>>>> - introduces bpf_call insn and register passing convention for zero >>>>>> overhead calls from/to other kernel functions (not part of this >>>>>> patch) >>>>>> - adds arithmetic right shift insn >>>>>> - adds swab32/swab64 insns >>>>>> - adds atomic_add insn >>>>>> - old tax/txa insns are replaced with 'mov dst,src' insn >>>>>> >>>>>> Extended BPF is designed to be JITed with one to one mapping, which >>>>>> allows GCC/LLVM backends to generate optimized BPF code that performs >>>>>> almost as fast as natively compiled code >>>>>> >>>>>> sk_convert_filter() remaps old style insns into extended: >>>>>> 'sock_filter' instructions are remapped on the fly to >>>>>> 'sock_filter_ext' extended instructions when >>>>>> sysctl net.core.bpf_ext_enable=1 >>>>>> >>>>>> Old filter comes through sk_attach_filter() or >>>>>> sk_unattached_filter_create() >>>>>> if (bpf_ext_enable) { >>>>>> convert to new >>>>>> sk_chk_filter() - check old bpf >>>>>> use sk_run_filter_ext() - new interpreter >>>>>> } else { >>>>>> sk_chk_filter() - check old bpf >>>>>> if (bpf_jit_enable) >>>>>> use old jit >>>>>> else >>>>>> use sk_run_filter() - old interpreter >>>>>> } >>>>>> >>>>>> sk_run_filter_ext() interpreter is noticeably faster >>>>>> than sk_run_filter() for two reasons: >>>>>> >>>>>> 1.fall-through jumps >>>>>> Old BPF jump instructions are forced to go either 'true' or 'false' >>>>>> branch which causes branch-miss penalty. >>>>>> Extended BPF jump instructions have one branch and fall-through, >>>>>> which fit CPU branch predictor logic better. >>>>>> 'perf stat' shows drastic difference for branch-misses. >>>>>> >>>>>> 2.jump-threaded implementation of interpreter vs switch statement >>>>>> Instead of single tablejump at the top of 'switch' statement, GCC >>>>>> will >>>>>> generate multiple tablejump instructions, which helps CPU branch >>>>>> predictor >>>>>> >>>>>> Performance of two BPF filters generated by libpcap was measured >>>>>> on x86_64, i386 and arm32. >>>>>> >>>>>> fprog #1 is taken from Documentation/networking/filter.txt: >>>>>> tcpdump -i eth0 port 22 -dd >>>>>> >>>>>> fprog #2 is taken from 'man tcpdump': >>>>>> tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - >>>>>> ((tcp[12]&0xf0)>>2)) != 0)' -dd >>>>>> >>>>>> Other libpcap programs have similar performance differences. >>>>>> >>>>>> Raw performance data from BPF micro-benchmark: >>>>>> SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss) >>>>>> time in nsec per call, smaller is better >>>>>> --x86_64-- >>>>>> fprog #1 fprog #1 fprog #2 fprog #2 >>>>>> cache-hit cache-miss cache-hit cache-miss >>>>>> old BPF 90 101 192 202 >>>>>> ext BPF 31 71 47 97 >>>>>> old BPF jit 12 34 17 44 >>>>>> ext BPF jit TBD >>>>>> >>>>>> --i386-- >>>>>> fprog #1 fprog #1 fprog #2 fprog #2 >>>>>> cache-hit cache-miss cache-hit cache-miss >>>>>> old BPF 107 136 227 252 >>>>>> ext BPF 40 119 69 172 >>>>>> >>>>>> --arm32-- >>>>>> fprog #1 fprog #1 fprog #2 fprog #2 >>>>>> cache-hit cache-miss cache-hit cache-miss >>>>>> old BPF 202 300 475 540 >>>>>> ext BPF 180 270 330 470 >>>>>> old BPF jit 26 182 37 202 >>>>>> new BPF jit TBD >>>>>> >>>>>> Tested with trinify BPF fuzzer >>>>>> >>>>>> Future work: >>>>>> >>>>>> 0. add bpf/ebpf testsuite to tools/testing/selftests/net/bpf >>>>>> >>>>>> 1. add extended BPF JIT for x86_64 >>>>>> >>>>>> 2. add inband old/new demux and extended BPF verifier, so that new >>>>>> programs >>>>>> can be loaded through old sk_attach_filter() and >>>>>> sk_unattached_filter_create() >>>>>> interfaces >>>>>> >>>>>> 3. tracing filters systemtap-like with extended BPF >>>>>> >>>>>> 4. OVS with extended BPF >>>>>> >>>>>> 5. nftables with extended BPF >>>>>> >>>>>> Signed-off-by: Alexei Starovoitov <a...@plumgrid.com> >>>>>> Acked-by: Hagen Paul Pfeifer <ha...@jauu.net> >>>>>> Reviewed-by: Daniel Borkmann <dbork...@redhat.com> >>>>> >>>>> >>>>> One more question or possible issue that came through my mind: When >>>>> someone attaches a socket filter from user space, and bpf_ext_enable=1 >>>>> then the old filter will transparently be converted to the new >>>>> representation. If then user space (e.g. through checkpoint restore) >>>>> will issue a sk_get_filter() and thus we're calling sk_decode_filter() >>>>> on sk->sk_filter and, therefore, try to decode what we stored in >>>>> insns_ext[] with the assumption we still have the old code. Would that >>>>> actually crash (or leak memory, or just return garbage), as we access >>>>> decodes[] array with filt->code? Would be great if you could double-check. >>>> >>>> ohh. yes. missed that. >>>> when bpf_ext_enable=1 I think it's cleaner to return ebpf filter. >>>> This way the user space can see how old bpf filter was converted. >>>> >>>> Of course we can allocate extra memory and keep original bpf code there >>>> just to return it via sk_get_filter(), but that seems overkill. >>> >>> Cc'ing Pavel for a8fc92778080 ("sk-filter: Add ability to get socket >>> filter program (v2)"). >>> >>> I think the issue can be that when applications could get migrated >>> from one machine to another and their kernel won't support ebpf yet, >>> then filter could not get loaded this way as it's expected to return >>> what the user loaded. The trade-off, however, is that the original >>> BPF code needs to be stored as well. :( >> >> Sorry if I miss the point, but isn't the original filter kept on socket? >> The sk_attach_filter() does so, then calls __sk_prepare_filter, which >> in turn calls bpf_jit_compile(), and the latter two keep the insns in place. > > Yes. in V8/V9 series original filter is kept on socket.
Ah, I see :) > and your crtools/test/zdtm/live/static/socket_filter.c test passes. > Let me know if there are any other tests I can try. No, that's the only test we need wrt sk-filter. Thanks for keeping an eye on it :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/