Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter

2014-03-04 Thread Daniel Borkmann

On 03/04/2014 06:53 PM, Alexei Starovoitov wrote:

On Tue, Mar 4, 2014 at 6:28 AM, Hagen Paul Pfeifer  wrote:

If all issues raised by Daniel are addresed:

Acked-by: Hagen Paul Pfeifer 


Thanks!


But ...


Future work:

0. seccomp

1. add extended BPF JIT for x86_64

2. add inband old/new demux and extended BPF verifier, so that new programs
   can be loaded through old sk_attach_filter() and 
sk_unattached_filter_create()
   interfaces

3. tracing filters systemtap-like with extended BPF

4. OVS with extended BPF

5. nftables with extended BPF


... this is shit (not your fault). (Jitted) BPF envolved into a direction
which is just not the right way to do it. You try to fix things, bypass
architectural shortcomings of BPF, perf issues because and so on.

The right direction is to write a new general purpose in-kernel interpreter
from scratch. Capability layers should provide an compatible API for BPF and


I think ebpf would have the potential to be *the* general purpose
in-kernel interpreter actually (if we undertake all this effort of
migration) as its already designed to be in a more generic context
than the traditional interpreter which is restricted to skb (or NULL).


seccomp. You have the knowledge to do exactly this, you nearly already did
this - you should start this undertake!


this insn set evolved over few years.
Initially we had nft-like high level state machine, but it wasn't fast,
then kprobe-like pure x86_64 which was fast, but very hard to analyze
from safety point of view. Then reduced x86-64 insn set and finally ebpf.
I think any brand new instruction set will have steep learning curve,
just because
it's all new. ebpf tries to reuse as much as possible. opcode encoding
is the same,
instruction size is fixed at 8 bytes and so on. Yeah, these
restrictions make few
things not 100% optimal, but imo common look and feel is more important.
What ebpf has already should be enough to do all of the above 'future work'.
Built-in JIT-ability of ebpf is the key to performance.
Ability to call some kernel functions from ebpf make it ultimately extensible.
socket filters and seccomp don't use this feature yet, but tracing filters will.

Regards,
Alexei

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter

2014-03-04 Thread Daniel Borkmann

On 03/04/2014 06:09 PM, Alexei Starovoitov wrote:

On Tue, Mar 4, 2014 at 1:59 AM, Daniel Borkmann  wrote:

...

Hmm, so the case statement is about BPF_RET | BPF_A and BPF_RET | BPF_K
but BPF_RET | BPF_X is not mentioned. However, in BPF_SRC(fp->code)
selection you fall back to BPF_X if it doesn't equal BPF_K? Is that
correct? And, you probably also need to handle BPF_RET | BPF_X ?


:) that design choice of original BPF always puzzled me.
BPF_A macro only used in one insn: BPF_RET + BPF_A
and all other insns use BPF_K and BPF_X
and though comment in uapi/filter.h says "ret - BPF_K and BPF_X also apply"
this is not true, since sk_chk_filter() only allows ret+a and ret+k
libpcap is equally confused. It never generates ret+x, but has few
places in the code
that can recognize it. I guess that's an artifact of distant past.


Good point, ret+a and ret+k are main users anyway, though we could fix
that limitation actually. ;)


epbf has only one RET insn that takes register R0 and returns it.
That is similar to real CPU 'ret' insn and done to make epbf easier
to generate from gcc/llvm point of view.
ebpf jit converts 'ret' into 'leave; ret' on x86_64.

so original bpf+k and bpf+a are converted into 'mov r0, [a or k]; ret r0'

btw, if there is interest I can put ebpf testsuite into tools/net/


Yes, please. Would be great if you can place the test suite under:

  tools/testing/selftests/net/bpf/

I believe some stuff there could get its own folder e.g. "packet"
for PF_PACKET test cases etc, so that we can easily arrange them.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter

2014-03-04 Thread Alexei Starovoitov
On Tue, Mar 4, 2014 at 6:28 AM, Hagen Paul Pfeifer  wrote:
> If all issues raised by Daniel are addresed:
>
> Acked-by: Hagen Paul Pfeifer 

Thanks!

> But ...
>
>>Future work:
>>
>>0. seccomp
>>
>>1. add extended BPF JIT for x86_64
>>
>>2. add inband old/new demux and extended BPF verifier, so that new programs
>>   can be loaded through old sk_attach_filter() and 
>> sk_unattached_filter_create()
>>   interfaces
>>
>>3. tracing filters systemtap-like with extended BPF
>>
>>4. OVS with extended BPF
>>
>>5. nftables with extended BPF
>
> ... this is shit (not your fault). (Jitted) BPF envolved into a direction
> which is just not the right way to do it. You try to fix things, bypass
> architectural shortcomings of BPF, perf issues because and so on.
>
> The right direction is to write a new general purpose in-kernel interpreter
> from scratch. Capability layers should provide an compatible API for BPF and
> seccomp. You have the knowledge to do exactly this, you nearly already did
> this - you should start this undertake!

this insn set evolved over few years.
Initially we had nft-like high level state machine, but it wasn't fast,
then kprobe-like pure x86_64 which was fast, but very hard to analyze
from safety point of view. Then reduced x86-64 insn set and finally ebpf.
I think any brand new instruction set will have steep learning curve,
just because
it's all new. ebpf tries to reuse as much as possible. opcode encoding
is the same,
instruction size is fixed at 8 bytes and so on. Yeah, these
restrictions make few
things not 100% optimal, but imo common look and feel is more important.
What ebpf has already should be enough to do all of the above 'future work'.
Built-in JIT-ability of ebpf is the key to performance.
Ability to call some kernel functions from ebpf make it ultimately extensible.
socket filters and seccomp don't use this feature yet, but tracing filters will.

Regards,
Alexei

>
> --
> Hagen Paul Pfeifer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter

2014-03-04 Thread Alexei Starovoitov
On Tue, Mar 4, 2014 at 1:59 AM, Daniel Borkmann  wrote:
> On 03/04/2014 06:18 AM, Alexei Starovoitov wrote:
>>
>> Extended BPF extends old BPF in the following ways:
>> - from 2 to 10 registers
>>Original BPF has two registers (A and X) and hidden frame pointer.
>>Extended BPF has ten registers and read-only frame pointer.
>> - from 32-bit registers to 64-bit registers
>>semantics of old 32-bit ALU operations are preserved via 32-bit
>>subregisters
>> - if (cond) jump_true; else jump_false;
>>old BPF insns are replaced with:
>>if (cond) jump_true; /* else fallthrough */
>> - adds signed > and >= insns
>> - 16 4-byte stack slots for register spill-fill replaced with
>>up to 512 bytes of multi-use stack space
>> - introduces bpf_call insn and register passing convention for zero
>>overhead calls from/to other kernel functions (not part of this patch)
>> - adds arithmetic right shift insn
>> - adds swab32/swab64 insns
>> - adds atomic_add insn
>> - old tax/txa insns are replaced with 'mov dst,src' insn
>>
>> Extended BPF is designed to be JITed with one to one mapping, which
>> allows GCC/LLVM backends to generate optimized BPF code that performs
>> almost as fast as natively compiled code
>>
>> sk_convert_filter() remaps old style insns into extended:
>> 'sock_filter' instructions are remapped on the fly to
>> 'sock_filter_ext' extended instructions when
>> sysctl net.core.bpf_ext_enable=1
>>
>> Old filter comes through sk_attach_filter() or
>> sk_unattached_filter_create()
>>   if (bpf_ext_enable) {
>>  convert to new
>>  sk_chk_filter() - check old bpf
>>  use sk_run_filter_ext() - new interpreter
>>   } else {
>>  sk_chk_filter() - check old bpf
>>  if (bpf_jit_enable)
>>  use old jit
>>  else
>>  use sk_run_filter() - old interpreter
>>   }
>>
>> sk_run_filter_ext() interpreter is noticeably faster
>> than sk_run_filter() for two reasons:
>>
>> 1.fall-through jumps
>>Old BPF jump instructions are forced to go either 'true' or 'false'
>>branch which causes branch-miss penalty.
>>Extended BPF jump instructions have one branch and fall-through,
>>which fit CPU branch predictor logic better.
>>'perf stat' shows drastic difference for branch-misses.
>>
>> 2.jump-threaded implementation of interpreter vs switch statement
>>Instead of single tablejump at the top of 'switch' statement, GCC will
>>generate multiple tablejump instructions, which helps CPU branch
>> predictor
>>
>> Performance of two BPF filters generated by libpcap was measured
>> on x86_64, i386 and arm32.
>>
>> fprog #1 is taken from Documentation/networking/filter.txt:
>> tcpdump -i eth0 port 22 -dd
>>
>> fprog #2 is taken from 'man tcpdump':
>> tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
>> ((tcp[12]&0xf0)>>2)) != 0)' -dd
>>
>> Other libpcap programs have similar performance differences.
>>
>> Raw performance data from BPF micro-benchmark:
>> SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
>> time in nsec per call, smaller is better
>> --x86_64--
>>   fprog #1  fprog #1   fprog #2  fprog #2
>>   cache-hit cache-miss cache-hit cache-miss
>> old BPF 90   101   192   202
>> ext BPF 3171   47 97
>> old BPF jit 1234   17 44
>> ext BPF jit TBD
>>
>> --i386--
>>   fprog #1  fprog #1   fprog #2  fprog #2
>>   cache-hit cache-miss cache-hit cache-miss
>> old BPF107136  227   252
>> ext BPF 40119   69   172
>>
>> --arm32--
>>   fprog #1  fprog #1   fprog #2  fprog #2
>>   cache-hit cache-miss cache-hit cache-miss
>> old BPF202300  475   540
>> ext BPF139270  296   470
>> old BPF jit 26182   37   202
>> new BPF jit TBD
>>
>> Tested with trinify BPF fuzzer
>>
>> Future work:
>>
>> 0. seccomp
>>
>> 1. add extended BPF JIT for x86_64
>>
>> 2. add inband old/new demux and extended BPF verifier, so that new
>> programs
>> can be loaded through old sk_attach_filter() and
>> sk_unattached_filter_create()
>> interfaces
>>
>> 3. tracing filters systemtap-like with extended BPF
>>
>> 4. OVS with extended BPF
>>
>> 5. nftables with extended BPF
>>
>> Signed-off-by: Alexei Starovoitov 
>
>
> Looks great, imho, some comments/questions inline:
>
> Nit: subject line of your patches should be, e.g.
>
>  "filter: add Extended BPF interpreter and converter"
>  "doc: filter: add Extended BPF documentation"
>  ...
>
> so first ": ".

sure. Will send v5 :)

>> ---
>>   include/linux/filter.h  |8 +-
>>   include/linux/netdevice.h   |1 +
>>   include/uapi/linux/filter.h |   34 +-
>>   net/core/filter.c   |  802
>> ++-
>>   net/core/sysctl_net_core.c  |7 +
>>   5 files changed, 830 insertions(+), 22 deletions(-)
>>
>> diff --git a/include/linux/filter.h 

Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter

2014-03-04 Thread Hagen Paul Pfeifer
If all issues raised by Daniel are addresed:

Acked-by: Hagen Paul Pfeifer 

But ...

>Future work:
>
>0. seccomp
>
>1. add extended BPF JIT for x86_64
>
>2. add inband old/new demux and extended BPF verifier, so that new programs
>   can be loaded through old sk_attach_filter() and 
> sk_unattached_filter_create()
>   interfaces
>
>3. tracing filters systemtap-like with extended BPF
>
>4. OVS with extended BPF
>
>5. nftables with extended BPF

... this is shit (not your fault). (Jitted) BPF envolved into a direction
which is just not the right way to do it. You try to fix things, bypass
architectural shortcomings of BPF, perf issues because and so on.

The right direction is to write a new general purpose in-kernel interpreter
from scratch. Capability layers should provide an compatible API for BPF and
seccomp. You have the knowledge to do exactly this, you nearly already did
this - you should start this undertake!

-- 
Hagen Paul Pfeifer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter

2014-03-04 Thread Daniel Borkmann

On 03/04/2014 06:18 AM, Alexei Starovoitov wrote:

Extended BPF extends old BPF in the following ways:
- from 2 to 10 registers
   Original BPF has two registers (A and X) and hidden frame pointer.
   Extended BPF has ten registers and read-only frame pointer.
- from 32-bit registers to 64-bit registers
   semantics of old 32-bit ALU operations are preserved via 32-bit
   subregisters
- if (cond) jump_true; else jump_false;
   old BPF insns are replaced with:
   if (cond) jump_true; /* else fallthrough */
- adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced with
   up to 512 bytes of multi-use stack space
- introduces bpf_call insn and register passing convention for zero
   overhead calls from/to other kernel functions (not part of this patch)
- adds arithmetic right shift insn
- adds swab32/swab64 insns
- adds atomic_add insn
- old tax/txa insns are replaced with 'mov dst,src' insn

Extended BPF is designed to be JITed with one to one mapping, which
allows GCC/LLVM backends to generate optimized BPF code that performs
almost as fast as natively compiled code

sk_convert_filter() remaps old style insns into extended:
'sock_filter' instructions are remapped on the fly to
'sock_filter_ext' extended instructions when
sysctl net.core.bpf_ext_enable=1

Old filter comes through sk_attach_filter() or sk_unattached_filter_create()
  if (bpf_ext_enable) {
 convert to new
 sk_chk_filter() - check old bpf
 use sk_run_filter_ext() - new interpreter
  } else {
 sk_chk_filter() - check old bpf
 if (bpf_jit_enable)
 use old jit
 else
 use sk_run_filter() - old interpreter
  }

sk_run_filter_ext() interpreter is noticeably faster
than sk_run_filter() for two reasons:

1.fall-through jumps
   Old BPF jump instructions are forced to go either 'true' or 'false'
   branch which causes branch-miss penalty.
   Extended BPF jump instructions have one branch and fall-through,
   which fit CPU branch predictor logic better.
   'perf stat' shows drastic difference for branch-misses.

2.jump-threaded implementation of interpreter vs switch statement
   Instead of single tablejump at the top of 'switch' statement, GCC will
   generate multiple tablejump instructions, which helps CPU branch predictor

Performance of two BPF filters generated by libpcap was measured
on x86_64, i386 and arm32.

fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd

fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
((tcp[12]&0xf0)>>2)) != 0)' -dd

Other libpcap programs have similar performance differences.

Raw performance data from BPF micro-benchmark:
SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
time in nsec per call, smaller is better
--x86_64--
  fprog #1  fprog #1   fprog #2  fprog #2
  cache-hit cache-miss cache-hit cache-miss
old BPF 90   101   192   202
ext BPF 3171   47 97
old BPF jit 1234   17 44
ext BPF jit TBD

--i386--
  fprog #1  fprog #1   fprog #2  fprog #2
  cache-hit cache-miss cache-hit cache-miss
old BPF107136  227   252
ext BPF 40119   69   172

--arm32--
  fprog #1  fprog #1   fprog #2  fprog #2
  cache-hit cache-miss cache-hit cache-miss
old BPF202300  475   540
ext BPF139270  296   470
old BPF jit 26182   37   202
new BPF jit TBD

Tested with trinify BPF fuzzer

Future work:

0. seccomp

1. add extended BPF JIT for x86_64

2. add inband old/new demux and extended BPF verifier, so that new programs
can be loaded through old sk_attach_filter() and 
sk_unattached_filter_create()
interfaces

3. tracing filters systemtap-like with extended BPF

4. OVS with extended BPF

5. nftables with extended BPF

Signed-off-by: Alexei Starovoitov 


Looks great, imho, some comments/questions inline:

Nit: subject line of your patches should be, e.g.

 "filter: add Extended BPF interpreter and converter"
 "doc: filter: add Extended BPF documentation"
 ...

so first ": ".


---
  include/linux/filter.h  |8 +-
  include/linux/netdevice.h   |1 +
  include/uapi/linux/filter.h |   34 +-
  net/core/filter.c   |  802 ++-
  net/core/sysctl_net_core.c  |7 +
  5 files changed, 830 insertions(+), 22 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index e568c8ef896b..0e84ff6e991b 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -52,7 +52,13 @@ extern int sk_detach_filter(struct sock *sk);
  extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
  extern int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, 
unsigned len);
  extern void sk_decode_filter(struct sock_filter *filt, struct sock_filter 
*to);
+/* function remaps 

Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter

2014-03-04 Thread Daniel Borkmann

On 03/04/2014 06:18 AM, Alexei Starovoitov wrote:

Extended BPF extends old BPF in the following ways:
- from 2 to 10 registers
   Original BPF has two registers (A and X) and hidden frame pointer.
   Extended BPF has ten registers and read-only frame pointer.
- from 32-bit registers to 64-bit registers
   semantics of old 32-bit ALU operations are preserved via 32-bit
   subregisters
- if (cond) jump_true; else jump_false;
   old BPF insns are replaced with:
   if (cond) jump_true; /* else fallthrough */
- adds signed  and = insns
- 16 4-byte stack slots for register spill-fill replaced with
   up to 512 bytes of multi-use stack space
- introduces bpf_call insn and register passing convention for zero
   overhead calls from/to other kernel functions (not part of this patch)
- adds arithmetic right shift insn
- adds swab32/swab64 insns
- adds atomic_add insn
- old tax/txa insns are replaced with 'mov dst,src' insn

Extended BPF is designed to be JITed with one to one mapping, which
allows GCC/LLVM backends to generate optimized BPF code that performs
almost as fast as natively compiled code

sk_convert_filter() remaps old style insns into extended:
'sock_filter' instructions are remapped on the fly to
'sock_filter_ext' extended instructions when
sysctl net.core.bpf_ext_enable=1

Old filter comes through sk_attach_filter() or sk_unattached_filter_create()
  if (bpf_ext_enable) {
 convert to new
 sk_chk_filter() - check old bpf
 use sk_run_filter_ext() - new interpreter
  } else {
 sk_chk_filter() - check old bpf
 if (bpf_jit_enable)
 use old jit
 else
 use sk_run_filter() - old interpreter
  }

sk_run_filter_ext() interpreter is noticeably faster
than sk_run_filter() for two reasons:

1.fall-through jumps
   Old BPF jump instructions are forced to go either 'true' or 'false'
   branch which causes branch-miss penalty.
   Extended BPF jump instructions have one branch and fall-through,
   which fit CPU branch predictor logic better.
   'perf stat' shows drastic difference for branch-misses.

2.jump-threaded implementation of interpreter vs switch statement
   Instead of single tablejump at the top of 'switch' statement, GCC will
   generate multiple tablejump instructions, which helps CPU branch predictor

Performance of two BPF filters generated by libpcap was measured
on x86_64, i386 and arm32.

fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd

fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]0xf)2)) -
((tcp[12]0xf0)2)) != 0)' -dd

Other libpcap programs have similar performance differences.

Raw performance data from BPF micro-benchmark:
SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
time in nsec per call, smaller is better
--x86_64--
  fprog #1  fprog #1   fprog #2  fprog #2
  cache-hit cache-miss cache-hit cache-miss
old BPF 90   101   192   202
ext BPF 3171   47 97
old BPF jit 1234   17 44
ext BPF jit TBD

--i386--
  fprog #1  fprog #1   fprog #2  fprog #2
  cache-hit cache-miss cache-hit cache-miss
old BPF107136  227   252
ext BPF 40119   69   172

--arm32--
  fprog #1  fprog #1   fprog #2  fprog #2
  cache-hit cache-miss cache-hit cache-miss
old BPF202300  475   540
ext BPF139270  296   470
old BPF jit 26182   37   202
new BPF jit TBD

Tested with trinify BPF fuzzer

Future work:

0. seccomp

1. add extended BPF JIT for x86_64

2. add inband old/new demux and extended BPF verifier, so that new programs
can be loaded through old sk_attach_filter() and 
sk_unattached_filter_create()
interfaces

3. tracing filters systemtap-like with extended BPF

4. OVS with extended BPF

5. nftables with extended BPF

Signed-off-by: Alexei Starovoitov a...@plumgrid.com


Looks great, imho, some comments/questions inline:

Nit: subject line of your patches should be, e.g.

 filter: add Extended BPF interpreter and converter
 doc: filter: add Extended BPF documentation
 ...

so first subsystem: summary phrase.


---
  include/linux/filter.h  |8 +-
  include/linux/netdevice.h   |1 +
  include/uapi/linux/filter.h |   34 +-
  net/core/filter.c   |  802 ++-
  net/core/sysctl_net_core.c  |7 +
  5 files changed, 830 insertions(+), 22 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index e568c8ef896b..0e84ff6e991b 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -52,7 +52,13 @@ extern int sk_detach_filter(struct sock *sk);
  extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
  extern int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, 
unsigned len);
  extern void sk_decode_filter(struct sock_filter *filt, struct sock_filter 
*to);

Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter

2014-03-04 Thread Hagen Paul Pfeifer
If all issues raised by Daniel are addresed:

Acked-by: Hagen Paul Pfeifer ha...@jauu.net

But ...

Future work:

0. seccomp

1. add extended BPF JIT for x86_64

2. add inband old/new demux and extended BPF verifier, so that new programs
   can be loaded through old sk_attach_filter() and 
 sk_unattached_filter_create()
   interfaces

3. tracing filters systemtap-like with extended BPF

4. OVS with extended BPF

5. nftables with extended BPF

... this is shit (not your fault). (Jitted) BPF envolved into a direction
which is just not the right way to do it. You try to fix things, bypass
architectural shortcomings of BPF, perf issues because and so on.

The right direction is to write a new general purpose in-kernel interpreter
from scratch. Capability layers should provide an compatible API for BPF and
seccomp. You have the knowledge to do exactly this, you nearly already did
this - you should start this undertake!

-- 
Hagen Paul Pfeifer
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter

2014-03-04 Thread Alexei Starovoitov
On Tue, Mar 4, 2014 at 1:59 AM, Daniel Borkmann dbork...@redhat.com wrote:
 On 03/04/2014 06:18 AM, Alexei Starovoitov wrote:

 Extended BPF extends old BPF in the following ways:
 - from 2 to 10 registers
Original BPF has two registers (A and X) and hidden frame pointer.
Extended BPF has ten registers and read-only frame pointer.
 - from 32-bit registers to 64-bit registers
semantics of old 32-bit ALU operations are preserved via 32-bit
subregisters
 - if (cond) jump_true; else jump_false;
old BPF insns are replaced with:
if (cond) jump_true; /* else fallthrough */
 - adds signed  and = insns
 - 16 4-byte stack slots for register spill-fill replaced with
up to 512 bytes of multi-use stack space
 - introduces bpf_call insn and register passing convention for zero
overhead calls from/to other kernel functions (not part of this patch)
 - adds arithmetic right shift insn
 - adds swab32/swab64 insns
 - adds atomic_add insn
 - old tax/txa insns are replaced with 'mov dst,src' insn

 Extended BPF is designed to be JITed with one to one mapping, which
 allows GCC/LLVM backends to generate optimized BPF code that performs
 almost as fast as natively compiled code

 sk_convert_filter() remaps old style insns into extended:
 'sock_filter' instructions are remapped on the fly to
 'sock_filter_ext' extended instructions when
 sysctl net.core.bpf_ext_enable=1

 Old filter comes through sk_attach_filter() or
 sk_unattached_filter_create()
   if (bpf_ext_enable) {
  convert to new
  sk_chk_filter() - check old bpf
  use sk_run_filter_ext() - new interpreter
   } else {
  sk_chk_filter() - check old bpf
  if (bpf_jit_enable)
  use old jit
  else
  use sk_run_filter() - old interpreter
   }

 sk_run_filter_ext() interpreter is noticeably faster
 than sk_run_filter() for two reasons:

 1.fall-through jumps
Old BPF jump instructions are forced to go either 'true' or 'false'
branch which causes branch-miss penalty.
Extended BPF jump instructions have one branch and fall-through,
which fit CPU branch predictor logic better.
'perf stat' shows drastic difference for branch-misses.

 2.jump-threaded implementation of interpreter vs switch statement
Instead of single tablejump at the top of 'switch' statement, GCC will
generate multiple tablejump instructions, which helps CPU branch
 predictor

 Performance of two BPF filters generated by libpcap was measured
 on x86_64, i386 and arm32.

 fprog #1 is taken from Documentation/networking/filter.txt:
 tcpdump -i eth0 port 22 -dd

 fprog #2 is taken from 'man tcpdump':
 tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]0xf)2)) -
 ((tcp[12]0xf0)2)) != 0)' -dd

 Other libpcap programs have similar performance differences.

 Raw performance data from BPF micro-benchmark:
 SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
 time in nsec per call, smaller is better
 --x86_64--
   fprog #1  fprog #1   fprog #2  fprog #2
   cache-hit cache-miss cache-hit cache-miss
 old BPF 90   101   192   202
 ext BPF 3171   47 97
 old BPF jit 1234   17 44
 ext BPF jit TBD

 --i386--
   fprog #1  fprog #1   fprog #2  fprog #2
   cache-hit cache-miss cache-hit cache-miss
 old BPF107136  227   252
 ext BPF 40119   69   172

 --arm32--
   fprog #1  fprog #1   fprog #2  fprog #2
   cache-hit cache-miss cache-hit cache-miss
 old BPF202300  475   540
 ext BPF139270  296   470
 old BPF jit 26182   37   202
 new BPF jit TBD

 Tested with trinify BPF fuzzer

 Future work:

 0. seccomp

 1. add extended BPF JIT for x86_64

 2. add inband old/new demux and extended BPF verifier, so that new
 programs
 can be loaded through old sk_attach_filter() and
 sk_unattached_filter_create()
 interfaces

 3. tracing filters systemtap-like with extended BPF

 4. OVS with extended BPF

 5. nftables with extended BPF

 Signed-off-by: Alexei Starovoitov a...@plumgrid.com


 Looks great, imho, some comments/questions inline:

 Nit: subject line of your patches should be, e.g.

  filter: add Extended BPF interpreter and converter
  doc: filter: add Extended BPF documentation
  ...

 so first subsystem: summary phrase.

sure. Will send v5 :)

 ---
   include/linux/filter.h  |8 +-
   include/linux/netdevice.h   |1 +
   include/uapi/linux/filter.h |   34 +-
   net/core/filter.c   |  802
 ++-
   net/core/sysctl_net_core.c  |7 +
   5 files changed, 830 insertions(+), 22 deletions(-)

 diff --git a/include/linux/filter.h b/include/linux/filter.h
 index e568c8ef896b..0e84ff6e991b 100644
 --- a/include/linux/filter.h
 +++ b/include/linux/filter.h
 @@ -52,7 +52,13 @@ extern int sk_detach_filter(struct sock *sk);
   extern int sk_chk_filter(struct 

Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter

2014-03-04 Thread Alexei Starovoitov
On Tue, Mar 4, 2014 at 6:28 AM, Hagen Paul Pfeifer ha...@jauu.net wrote:
 If all issues raised by Daniel are addresed:

 Acked-by: Hagen Paul Pfeifer ha...@jauu.net

Thanks!

 But ...

Future work:

0. seccomp

1. add extended BPF JIT for x86_64

2. add inband old/new demux and extended BPF verifier, so that new programs
   can be loaded through old sk_attach_filter() and 
 sk_unattached_filter_create()
   interfaces

3. tracing filters systemtap-like with extended BPF

4. OVS with extended BPF

5. nftables with extended BPF

 ... this is shit (not your fault). (Jitted) BPF envolved into a direction
 which is just not the right way to do it. You try to fix things, bypass
 architectural shortcomings of BPF, perf issues because and so on.

 The right direction is to write a new general purpose in-kernel interpreter
 from scratch. Capability layers should provide an compatible API for BPF and
 seccomp. You have the knowledge to do exactly this, you nearly already did
 this - you should start this undertake!

this insn set evolved over few years.
Initially we had nft-like high level state machine, but it wasn't fast,
then kprobe-like pure x86_64 which was fast, but very hard to analyze
from safety point of view. Then reduced x86-64 insn set and finally ebpf.
I think any brand new instruction set will have steep learning curve,
just because
it's all new. ebpf tries to reuse as much as possible. opcode encoding
is the same,
instruction size is fixed at 8 bytes and so on. Yeah, these
restrictions make few
things not 100% optimal, but imo common look and feel is more important.
What ebpf has already should be enough to do all of the above 'future work'.
Built-in JIT-ability of ebpf is the key to performance.
Ability to call some kernel functions from ebpf make it ultimately extensible.
socket filters and seccomp don't use this feature yet, but tracing filters will.

Regards,
Alexei


 --
 Hagen Paul Pfeifer
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter

2014-03-04 Thread Daniel Borkmann

On 03/04/2014 06:09 PM, Alexei Starovoitov wrote:

On Tue, Mar 4, 2014 at 1:59 AM, Daniel Borkmann dbork...@redhat.com wrote:

...

Hmm, so the case statement is about BPF_RET | BPF_A and BPF_RET | BPF_K
but BPF_RET | BPF_X is not mentioned. However, in BPF_SRC(fp-code)
selection you fall back to BPF_X if it doesn't equal BPF_K? Is that
correct? And, you probably also need to handle BPF_RET | BPF_X ?


:) that design choice of original BPF always puzzled me.
BPF_A macro only used in one insn: BPF_RET + BPF_A
and all other insns use BPF_K and BPF_X
and though comment in uapi/filter.h says ret - BPF_K and BPF_X also apply
this is not true, since sk_chk_filter() only allows ret+a and ret+k
libpcap is equally confused. It never generates ret+x, but has few
places in the code
that can recognize it. I guess that's an artifact of distant past.


Good point, ret+a and ret+k are main users anyway, though we could fix
that limitation actually. ;)


epbf has only one RET insn that takes register R0 and returns it.
That is similar to real CPU 'ret' insn and done to make epbf easier
to generate from gcc/llvm point of view.
ebpf jit converts 'ret' into 'leave; ret' on x86_64.

so original bpf+k and bpf+a are converted into 'mov r0, [a or k]; ret r0'

btw, if there is interest I can put ebpf testsuite into tools/net/


Yes, please. Would be great if you can place the test suite under:

  tools/testing/selftests/net/bpf/

I believe some stuff there could get its own folder e.g. packet
for PF_PACKET test cases etc, so that we can easily arrange them.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 net-next 1/3] Extended BPF interpreter and converter

2014-03-04 Thread Daniel Borkmann

On 03/04/2014 06:53 PM, Alexei Starovoitov wrote:

On Tue, Mar 4, 2014 at 6:28 AM, Hagen Paul Pfeifer ha...@jauu.net wrote:

If all issues raised by Daniel are addresed:

Acked-by: Hagen Paul Pfeifer ha...@jauu.net


Thanks!


But ...


Future work:

0. seccomp

1. add extended BPF JIT for x86_64

2. add inband old/new demux and extended BPF verifier, so that new programs
   can be loaded through old sk_attach_filter() and 
sk_unattached_filter_create()
   interfaces

3. tracing filters systemtap-like with extended BPF

4. OVS with extended BPF

5. nftables with extended BPF


... this is shit (not your fault). (Jitted) BPF envolved into a direction
which is just not the right way to do it. You try to fix things, bypass
architectural shortcomings of BPF, perf issues because and so on.

The right direction is to write a new general purpose in-kernel interpreter
from scratch. Capability layers should provide an compatible API for BPF and


I think ebpf would have the potential to be *the* general purpose
in-kernel interpreter actually (if we undertake all this effort of
migration) as its already designed to be in a more generic context
than the traditional interpreter which is restricted to skb (or NULL).


seccomp. You have the knowledge to do exactly this, you nearly already did
this - you should start this undertake!


this insn set evolved over few years.
Initially we had nft-like high level state machine, but it wasn't fast,
then kprobe-like pure x86_64 which was fast, but very hard to analyze
from safety point of view. Then reduced x86-64 insn set and finally ebpf.
I think any brand new instruction set will have steep learning curve,
just because
it's all new. ebpf tries to reuse as much as possible. opcode encoding
is the same,
instruction size is fixed at 8 bytes and so on. Yeah, these
restrictions make few
things not 100% optimal, but imo common look and feel is more important.
What ebpf has already should be enough to do all of the above 'future work'.
Built-in JIT-ability of ebpf is the key to performance.
Ability to call some kernel functions from ebpf make it ultimately extensible.
socket filters and seccomp don't use this feature yet, but tracing filters will.

Regards,
Alexei

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 net-next 1/3] Extended BPF interpreter and converter

2014-03-03 Thread Alexei Starovoitov
Extended BPF extends old BPF in the following ways:
- from 2 to 10 registers
  Original BPF has two registers (A and X) and hidden frame pointer.
  Extended BPF has ten registers and read-only frame pointer.
- from 32-bit registers to 64-bit registers
  semantics of old 32-bit ALU operations are preserved via 32-bit
  subregisters
- if (cond) jump_true; else jump_false;
  old BPF insns are replaced with:
  if (cond) jump_true; /* else fallthrough */
- adds signed > and >= insns
- 16 4-byte stack slots for register spill-fill replaced with
  up to 512 bytes of multi-use stack space
- introduces bpf_call insn and register passing convention for zero
  overhead calls from/to other kernel functions (not part of this patch)
- adds arithmetic right shift insn
- adds swab32/swab64 insns
- adds atomic_add insn
- old tax/txa insns are replaced with 'mov dst,src' insn

Extended BPF is designed to be JITed with one to one mapping, which
allows GCC/LLVM backends to generate optimized BPF code that performs
almost as fast as natively compiled code

sk_convert_filter() remaps old style insns into extended:
'sock_filter' instructions are remapped on the fly to
'sock_filter_ext' extended instructions when
sysctl net.core.bpf_ext_enable=1

Old filter comes through sk_attach_filter() or sk_unattached_filter_create()
 if (bpf_ext_enable) {
convert to new
sk_chk_filter() - check old bpf
use sk_run_filter_ext() - new interpreter
 } else {
sk_chk_filter() - check old bpf
if (bpf_jit_enable)
use old jit
else
use sk_run_filter() - old interpreter
 }

sk_run_filter_ext() interpreter is noticeably faster
than sk_run_filter() for two reasons:

1.fall-through jumps
  Old BPF jump instructions are forced to go either 'true' or 'false'
  branch which causes branch-miss penalty.
  Extended BPF jump instructions have one branch and fall-through,
  which fit CPU branch predictor logic better.
  'perf stat' shows drastic difference for branch-misses.

2.jump-threaded implementation of interpreter vs switch statement
  Instead of single tablejump at the top of 'switch' statement, GCC will
  generate multiple tablejump instructions, which helps CPU branch predictor

Performance of two BPF filters generated by libpcap was measured
on x86_64, i386 and arm32.

fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd

fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
   ((tcp[12]&0xf0)>>2)) != 0)' -dd

Other libpcap programs have similar performance differences.

Raw performance data from BPF micro-benchmark:
SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
time in nsec per call, smaller is better
--x86_64--
 fprog #1  fprog #1   fprog #2  fprog #2
 cache-hit cache-miss cache-hit cache-miss
old BPF 90   101   192   202
ext BPF 3171   47 97
old BPF jit 1234   17 44
ext BPF jit TBD

--i386--
 fprog #1  fprog #1   fprog #2  fprog #2
 cache-hit cache-miss cache-hit cache-miss
old BPF107136  227   252
ext BPF 40119   69   172

--arm32--
 fprog #1  fprog #1   fprog #2  fprog #2
 cache-hit cache-miss cache-hit cache-miss
old BPF202300  475   540
ext BPF139270  296   470
old BPF jit 26182   37   202
new BPF jit TBD

Tested with trinify BPF fuzzer

Future work:

0. seccomp

1. add extended BPF JIT for x86_64

2. add inband old/new demux and extended BPF verifier, so that new programs
   can be loaded through old sk_attach_filter() and 
sk_unattached_filter_create()
   interfaces

3. tracing filters systemtap-like with extended BPF

4. OVS with extended BPF

5. nftables with extended BPF

Signed-off-by: Alexei Starovoitov 
---
 include/linux/filter.h  |8 +-
 include/linux/netdevice.h   |1 +
 include/uapi/linux/filter.h |   34 +-
 net/core/filter.c   |  802 ++-
 net/core/sysctl_net_core.c  |7 +
 5 files changed, 830 insertions(+), 22 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index e568c8ef896b..0e84ff6e991b 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -52,7 +52,13 @@ extern int sk_detach_filter(struct sock *sk);
 extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
 extern int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, 
unsigned len);
 extern void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to);
+/* function remaps 'sock_filter' insns to 'sock_filter_ext' insns */
+int sk_convert_filter(struct sock_filter *old_prog, int len,
+ struct sock_filter_ext *new_prog, int *p_new_len);
+/* execute extended bpf program */
+unsigned int sk_run_filter_ext(void *ctx, const struct sock_filter_ext *insn);
 
+#define 

[PATCH v4 net-next 1/3] Extended BPF interpreter and converter

2014-03-03 Thread Alexei Starovoitov
Extended BPF extends old BPF in the following ways:
- from 2 to 10 registers
  Original BPF has two registers (A and X) and hidden frame pointer.
  Extended BPF has ten registers and read-only frame pointer.
- from 32-bit registers to 64-bit registers
  semantics of old 32-bit ALU operations are preserved via 32-bit
  subregisters
- if (cond) jump_true; else jump_false;
  old BPF insns are replaced with:
  if (cond) jump_true; /* else fallthrough */
- adds signed  and = insns
- 16 4-byte stack slots for register spill-fill replaced with
  up to 512 bytes of multi-use stack space
- introduces bpf_call insn and register passing convention for zero
  overhead calls from/to other kernel functions (not part of this patch)
- adds arithmetic right shift insn
- adds swab32/swab64 insns
- adds atomic_add insn
- old tax/txa insns are replaced with 'mov dst,src' insn

Extended BPF is designed to be JITed with one to one mapping, which
allows GCC/LLVM backends to generate optimized BPF code that performs
almost as fast as natively compiled code

sk_convert_filter() remaps old style insns into extended:
'sock_filter' instructions are remapped on the fly to
'sock_filter_ext' extended instructions when
sysctl net.core.bpf_ext_enable=1

Old filter comes through sk_attach_filter() or sk_unattached_filter_create()
 if (bpf_ext_enable) {
convert to new
sk_chk_filter() - check old bpf
use sk_run_filter_ext() - new interpreter
 } else {
sk_chk_filter() - check old bpf
if (bpf_jit_enable)
use old jit
else
use sk_run_filter() - old interpreter
 }

sk_run_filter_ext() interpreter is noticeably faster
than sk_run_filter() for two reasons:

1.fall-through jumps
  Old BPF jump instructions are forced to go either 'true' or 'false'
  branch which causes branch-miss penalty.
  Extended BPF jump instructions have one branch and fall-through,
  which fit CPU branch predictor logic better.
  'perf stat' shows drastic difference for branch-misses.

2.jump-threaded implementation of interpreter vs switch statement
  Instead of single tablejump at the top of 'switch' statement, GCC will
  generate multiple tablejump instructions, which helps CPU branch predictor

Performance of two BPF filters generated by libpcap was measured
on x86_64, i386 and arm32.

fprog #1 is taken from Documentation/networking/filter.txt:
tcpdump -i eth0 port 22 -dd

fprog #2 is taken from 'man tcpdump':
tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]0xf)2)) -
   ((tcp[12]0xf0)2)) != 0)' -dd

Other libpcap programs have similar performance differences.

Raw performance data from BPF micro-benchmark:
SK_RUN_FILTER on same SKB (cache-hit) or 10k SKBs (cache-miss)
time in nsec per call, smaller is better
--x86_64--
 fprog #1  fprog #1   fprog #2  fprog #2
 cache-hit cache-miss cache-hit cache-miss
old BPF 90   101   192   202
ext BPF 3171   47 97
old BPF jit 1234   17 44
ext BPF jit TBD

--i386--
 fprog #1  fprog #1   fprog #2  fprog #2
 cache-hit cache-miss cache-hit cache-miss
old BPF107136  227   252
ext BPF 40119   69   172

--arm32--
 fprog #1  fprog #1   fprog #2  fprog #2
 cache-hit cache-miss cache-hit cache-miss
old BPF202300  475   540
ext BPF139270  296   470
old BPF jit 26182   37   202
new BPF jit TBD

Tested with trinify BPF fuzzer

Future work:

0. seccomp

1. add extended BPF JIT for x86_64

2. add inband old/new demux and extended BPF verifier, so that new programs
   can be loaded through old sk_attach_filter() and 
sk_unattached_filter_create()
   interfaces

3. tracing filters systemtap-like with extended BPF

4. OVS with extended BPF

5. nftables with extended BPF

Signed-off-by: Alexei Starovoitov a...@plumgrid.com
---
 include/linux/filter.h  |8 +-
 include/linux/netdevice.h   |1 +
 include/uapi/linux/filter.h |   34 +-
 net/core/filter.c   |  802 ++-
 net/core/sysctl_net_core.c  |7 +
 5 files changed, 830 insertions(+), 22 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index e568c8ef896b..0e84ff6e991b 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -52,7 +52,13 @@ extern int sk_detach_filter(struct sock *sk);
 extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
 extern int sk_get_filter(struct sock *sk, struct sock_filter __user *filter, 
unsigned len);
 extern void sk_decode_filter(struct sock_filter *filt, struct sock_filter *to);
+/* function remaps 'sock_filter' insns to 'sock_filter_ext' insns */
+int sk_convert_filter(struct sock_filter *old_prog, int len,
+ struct sock_filter_ext *new_prog, int *p_new_len);
+/* execute extended bpf program */
+unsigned int sk_run_filter_ext(void *ctx, const struct sock_filter_ext *insn);
 
+#define