Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-18 Thread Alexei Starovoitov
On Tue, Apr 18, 2017 at 11:47:59AM +0200, Johannes Berg wrote:
> > I guess any driver that would care about XDP performance would either
> > implement in-driver XDP or make sure that skb_linearize() doesn't
> > happen in generic XDP by doing build_skb() with the whole packet.
> > The driver can be smart and avoid doing copy-break if generic XDP is
> > on.
> 
> Indeed. I think this is less about performance than ease of
> experimentation.

yes. I'm bringing it up only to hopefully find cases where this
generic XDP can be used for real data path too (and not only experimentation).
Like if it can do 6Mpps whereas in-driver XDP will do 20Mpps (when implemented),
for some folks it may be good enough already to start using it.
Which will put more pressure on NIC vendor to implement it.



Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-18 Thread Alexei Starovoitov
On Tue, Apr 18, 2017 at 02:46:25PM -0400, David Miller wrote:
> From: Alexei Starovoitov 
> Date: Mon, 17 Apr 2017 16:04:38 -0700
> 
> > On Mon, Apr 17, 2017 at 03:49:55PM -0400, David Miller wrote:
> >> From: Jesper Dangaard Brouer 
> >> Date: Sun, 16 Apr 2017 22:26:01 +0200
> >> 
> >> > The bpf tail-call use-case is a very good example of why the
> >> > verifier cannot deduct the needed HEADROOM upfront.
> >> 
> >> This brings up a very interesting question for me.
> >> 
> >> I notice that tail calls are implemented by JITs largely by skipping
> >> over the prologue of that destination program.
> >> 
> >> However, many JITs preload cached SKB values into fixed registers in
> >> the prologue.  But they only do this if the program being JITed needs
> >> those values.
> >> 
> >> So how can it work properly if a program that does not need the SKB
> >> values tail calls into one that does?
> > 
> > For x86 JIT it's fine, since caching of skb values is not part of the 
> > prologue:
> >   emit_prologue();
> >   if (seen_ld_abs)
> >   emit_load_skb_data_hlen();
> > and tail_call jumps into the next program as:
> >   EMIT4(0x48, 0x83, 0xC0, PROLOGUE_SIZE);   /* add rax, prologue_size */
> >   EMIT2(0xFF, 0xE0);/* jmp rax */
> > whereas inside emit_prologue() we have:
> > B  UILD_BUG_ON(cnt != PROLOGUE_SIZE);
> > 
> > arm64 has similar proplogue skipping code and it's even
> > simpler than x86, since it doesn't try to optimize LD_ABS/IND in assembler
> > and instead calls into bpf_load_pointer() from generated code,
> > so no caching of skb values at all.
> > 
> > s390 jit has partial skipping of prologue, since bunch
> > of registers are save/restored during tail_call and it looks fine
> > to me as well.
> 
> Ok, what about stack usage?
> 
> Currently if I don't see a reference to FP then I elide allocating
> MAX_BPF_STACK stack space.
> 
> What if, with tail calls, some programs need that stack space whilst
> other's done?
> 
> It looks like, for example, JITs like powerpc avoids this issue
> because they allocate the full MAX_BPF_STACK all the time.  That seems
> like overkill to me and bad for cache locality.

For x86 we also give proper stack frame always, since optimizing
for leaf functions is very rare. Most eBPF progs have at least
one helper call. Even cBPF often use SKF_AD_CPU or SKF_AD_RANDOM
which translate to function call as well and need stack frame.
I think stack frames on sparc are much more expensive than on x86
due to register window architecture, so there it may make sense
to squeeze these extra cycles, but it will be rarely exercised
in practice.

I was thinking to teach verifier to recognize required stack size,
so we can JIT with that size instead of 512, but that's mainly
to reduce kernel stack usage. I doubt it will make any performance
difference.

As far as big sparc and other archs JIT optimizations would be
great somehow to take advantage of extra registers that these
archs have. I think it will only be possible once we have
verifier 2.0 with proper register liveness and stuff, so
we can convert spill/fills into register copies and may be
even run simple regalloc pass after verifier. Crazy talk ;)



Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-18 Thread David Miller
From: Alexei Starovoitov 
Date: Mon, 17 Apr 2017 16:04:38 -0700

> On Mon, Apr 17, 2017 at 03:49:55PM -0400, David Miller wrote:
>> From: Jesper Dangaard Brouer 
>> Date: Sun, 16 Apr 2017 22:26:01 +0200
>> 
>> > The bpf tail-call use-case is a very good example of why the
>> > verifier cannot deduct the needed HEADROOM upfront.
>> 
>> This brings up a very interesting question for me.
>> 
>> I notice that tail calls are implemented by JITs largely by skipping
>> over the prologue of that destination program.
>> 
>> However, many JITs preload cached SKB values into fixed registers in
>> the prologue.  But they only do this if the program being JITed needs
>> those values.
>> 
>> So how can it work properly if a program that does not need the SKB
>> values tail calls into one that does?
> 
> For x86 JIT it's fine, since caching of skb values is not part of the 
> prologue:
>   emit_prologue();
>   if (seen_ld_abs)
>   emit_load_skb_data_hlen();
> and tail_call jumps into the next program as:
>   EMIT4(0x48, 0x83, 0xC0, PROLOGUE_SIZE);   /* add rax, prologue_size */
>   EMIT2(0xFF, 0xE0);/* jmp rax */
> whereas inside emit_prologue() we have:
> B  UILD_BUG_ON(cnt != PROLOGUE_SIZE);
> 
> arm64 has similar proplogue skipping code and it's even
> simpler than x86, since it doesn't try to optimize LD_ABS/IND in assembler
> and instead calls into bpf_load_pointer() from generated code,
> so no caching of skb values at all.
> 
> s390 jit has partial skipping of prologue, since bunch
> of registers are save/restored during tail_call and it looks fine
> to me as well.

Ok, what about stack usage?

Currently if I don't see a reference to FP then I elide allocating
MAX_BPF_STACK stack space.

What if, with tail calls, some programs need that stack space whilst
other's done?

It looks like, for example, JITs like powerpc avoids this issue
because they allocate the full MAX_BPF_STACK all the time.  That seems
like overkill to me and bad for cache locality.


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-18 Thread Johannes Berg
On Fri, 2017-04-14 at 12:41 -0700, Alexei Starovoitov wrote:
> 
> ahh. i thought all drivers do at least copy-break (256) bytes 

We do copy-break, but that's only applicable when there's no non-linear 
data left afterwards :)

> or copy
> get_headlen or build_skb the whole thing.
> Since wireless does eth_hlen, then yeah, skb_linearize() is the only
> way.

It's rather difficult to do this in wifi because we'd need to parse
deep into the packet - at a point where we haven't even parsed the
802.11 header (which has variable length) fully.

So that leaves us to do it at the very end, after conversion to
ethernet format, but then it feels pointless since we might as well let
the rest of the stack do it.

> I guess any driver that would care about XDP performance would either
> implement in-driver XDP or make sure that skb_linearize() doesn't
> happen in generic XDP by doing build_skb() with the whole packet.
> The driver can be smart and avoid doing copy-break if generic XDP is
> on.

Indeed. I think this is less about performance than ease of
experimentation.

johannes


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-17 Thread Daniel Borkmann

On 04/18/2017 01:04 AM, Alexei Starovoitov wrote:

On Mon, Apr 17, 2017 at 03:49:55PM -0400, David Miller wrote:

From: Jesper Dangaard Brouer 
Date: Sun, 16 Apr 2017 22:26:01 +0200


The bpf tail-call use-case is a very good example of why the
verifier cannot deduct the needed HEADROOM upfront.


This brings up a very interesting question for me.

I notice that tail calls are implemented by JITs largely by skipping
over the prologue of that destination program.

However, many JITs preload cached SKB values into fixed registers in
the prologue.  But they only do this if the program being JITed needs
those values.

So how can it work properly if a program that does not need the SKB
values tail calls into one that does?


For x86 JIT it's fine, since caching of skb values is not part of the prologue:
   emit_prologue();
   if (seen_ld_abs)
   emit_load_skb_data_hlen();
and tail_call jumps into the next program as:
   EMIT4(0x48, 0x83, 0xC0, PROLOGUE_SIZE);   /* add rax, prologue_size */
   EMIT2(0xFF, 0xE0);/* jmp rax */
whereas inside emit_prologue() we have:
B  UILD_BUG_ON(cnt != PROLOGUE_SIZE);

arm64 has similar proplogue skipping code and it's even
simpler than x86, since it doesn't try to optimize LD_ABS/IND in assembler
and instead calls into bpf_load_pointer() from generated code,
so no caching of skb values at all.

s390 jit has partial skipping of prologue, since bunch
of registers are save/restored during tail_call and it looks fine
to me as well.


And ppc64 does unwinding/tearing down the stack of the prog before
jumping into the other program. Thus, no skipping of others prologue;
looks fine, too.


It's very hard to extend test_bpf.ko with tail_calls, since maps need
to be allocated and populated with file descriptors which are
not feasible to do from .ko. Instead we need a user space based test for it.
We've started building one in tools/testing/selftests/bpf/test_progs.c
much more tests need to be added. Thorough testing of tail_calls
is on the todo list.


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-17 Thread Alexei Starovoitov
On Mon, Apr 17, 2017 at 03:49:55PM -0400, David Miller wrote:
> From: Jesper Dangaard Brouer 
> Date: Sun, 16 Apr 2017 22:26:01 +0200
> 
> > The bpf tail-call use-case is a very good example of why the
> > verifier cannot deduct the needed HEADROOM upfront.
> 
> This brings up a very interesting question for me.
> 
> I notice that tail calls are implemented by JITs largely by skipping
> over the prologue of that destination program.
> 
> However, many JITs preload cached SKB values into fixed registers in
> the prologue.  But they only do this if the program being JITed needs
> those values.
> 
> So how can it work properly if a program that does not need the SKB
> values tail calls into one that does?

For x86 JIT it's fine, since caching of skb values is not part of the prologue:
  emit_prologue();
  if (seen_ld_abs)
  emit_load_skb_data_hlen();
and tail_call jumps into the next program as:
  EMIT4(0x48, 0x83, 0xC0, PROLOGUE_SIZE);   /* add rax, prologue_size */
  EMIT2(0xFF, 0xE0);/* jmp rax */
whereas inside emit_prologue() we have:
B  UILD_BUG_ON(cnt != PROLOGUE_SIZE);

arm64 has similar proplogue skipping code and it's even
simpler than x86, since it doesn't try to optimize LD_ABS/IND in assembler
and instead calls into bpf_load_pointer() from generated code,
so no caching of skb values at all.

s390 jit has partial skipping of prologue, since bunch
of registers are save/restored during tail_call and it looks fine
to me as well.

It's very hard to extend test_bpf.ko with tail_calls, since maps need
to be allocated and populated with file descriptors which are
not feasible to do from .ko. Instead we need a user space based test for it.
We've started building one in tools/testing/selftests/bpf/test_progs.c
much more tests need to be added. Thorough testing of tail_calls
is on the todo list.



Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-17 Thread David Miller
From: Jesper Dangaard Brouer 
Date: Sun, 16 Apr 2017 22:26:01 +0200

> The bpf tail-call use-case is a very good example of why the
> verifier cannot deduct the needed HEADROOM upfront.

This brings up a very interesting question for me.

I notice that tail calls are implemented by JITs largely by skipping
over the prologue of that destination program.

However, many JITs preload cached SKB values into fixed registers in
the prologue.  But they only do this if the program being JITed needs
those values.

So how can it work properly if a program that does not need the SKB
values tail calls into one that does?

Daniel, Alexei?




Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-16 Thread Jesper Dangaard Brouer
On Fri, 14 Apr 2017 17:46:44 -0700
Alexei Starovoitov  wrote:

> But that's probably bad idea, since I was assuming that such ring
> reconfiguration can be made fast, which is unlikely.
> If it takes seconds to setup a ring, then drivers should just
> assume XDP_PACKET_HEADROOM, since at that time the program
> properties are unknown and in the future other programs will be loaded.
>
> Take a look at our current setup with slots for xdp_dump, ddos, lb
> programs. Only root program is attached at the begining.
> If the driver configures the ring for such empty program that would break
> dynamic addition of lb prog.
> The driver must not interrupt the traffic when user adds another
> prog to prog array. In such case XDP side doesn't even know that
> prog array is used. It's all happening purely on bpf side.

The bpf tail-call use-case is a very good example of why the verifier
cannot deduct the needed HEADROOM upfront.

Could we still make the verifier reject a program getting attached as a
tail-call when a too "low"/small HEADROOM have been setup? (to satisfy
programs needs)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-14 Thread Jakub Kicinski
On Fri, 14 Apr 2017 17:46:44 -0700, Alexei Starovoitov wrote:
> On Fri, Apr 14, 2017 at 03:30:32PM -0700, Jakub Kicinski wrote:
> > On Fri, 14 Apr 2017 12:28:15 -0700, Alexei Starovoitov wrote:  
> > > On Fri, Apr 14, 2017 at 11:05:25AM +0200, Jesper Dangaard Brouer wrote:  
> > > > > 
> > > > > We are consistently finding that there is this real need to
> > > > > communicate XDP capabilities, or somehow verify that the needs
> > > > > of an XDP program can be satisfied by a given implementation.
> > > > 
> > > > I fully agree that we need some way to express capabilities[1]
> > > > 
> > > > > Maximum headroom is just one.
> > > 
> > > I don't like the idea of asking program author to explain capabilities
> > > to the kernel. Right now the verifier already understands more about
> > > the program than human does. If the verifier cannot deduct from the
> > > insns what program will be doing, it's pretty much guarantee
> > > that program author has no idea either.
> > > If we add 'required_headroom' as an extra flag to BPF_PROG_LOAD,
> > > the users will just pass something like 64 or 128 whereas the program
> > > might only be doing IPIP encap and that will cause kernel to
> > > provide extra headroom for no good reason or reject the program
> > > whereas it could have run just fine.
> > > 
> > > So I very much agree with this part:  
> > > > ... or somehow verify that the needs
> > > > of an XDP program can be satisfied by a given implementation.
> > > 
> > > we already have cb_access, dst_needed, xdp_adjust_head flags
> > > that verifier discovers in the program.
> > > For headroom we need one more. The verifier can see
> > > the constant passed into bpf_xdp_adjust_head().
> > > It's trickier if it's a variable, but the verifier can estimate min/max
> > > of the variable already and worst case it can say that it
> > > will be XDP_PACKET_HEADROOM.
> > > If the program is doing variable length bpf_xdp_adjust_head(),
> > > the human has no idea how much they need and cannot tell kernel anyway.  
> > 
> > If I understand correctly that the current proposal is to:
> > 
> > 1. Assume program needs 256 (XDP_PACKET_HEADROOM) of prepend.
> > 2. Make verifier try to lower that through byte code analysis.
> > 3. Make the load fail if (prog->max_prepend > driver->xdp_max_prepend).  
> 
> no. the hard fail will be wrong, since verifier is conservative.
> As i showed in the example with variable ip hdr length.
> 
> > We may expect most users won't care and we may want to still depend on
> > the verifier to analyze the program and enable optimizations, but
> > depending on the verifier for proving prepend lengths is scary.  Or are
> > we just discussion the optimizations here and not the guarantees about
> > headroom availability?  
> 
> yes. I was thinking as an optimization. If verifier can see that
> prog needs only 20 bytes of headroom or no headroom (like netronome
> already does) than the ring can be configured differently for hopefully
> better performance.
> But that's probably bad idea, since I was assuming that such ring
> reconfiguration can be made fast, which is unlikely.
> If it takes seconds to setup a ring, then drivers should just
> assume XDP_PACKET_HEADROOM, since at that time the program
> properties are unknown and in the future other programs will be loaded.
> Take a look at our current setup with slots for xdp_dump, ddos, lb
> programs. Only root program is attached at the begining.
> If the driver configures the ring for such empty program that would break
> dynamic addition of lb prog.
> The driver must not interrupt the traffic when user adds another
> prog to prog array. In such case XDP side doesn't even know that
> prog array is used. It's all happening purely on bpf side.
>
> > Or worse not do that, use Qlogic, Broadcom, Mellanox or Netronome cards
> > and then be in for a nasty surprise when they try to load the same
> > program for Intel parts :(  
> 
> are you talking hw offloaded prog or normal in-driver xdp?

In-driver, mostly.  As you said earlier in the thread we expect
Alex's/Intel's patches to provide "only" 192B of headroom.  For the NFP
offload there are trade-offs to full 256B offset.  I think we could just
fallback to the host, but if user hints that the headroom is big
perhaps we would be better off not even trying to offload.

> > I agree that trying to express capabilities to user space is quickly
> > going to become more confusing than helpful.  However, I think here we
> > need the opposite - we need an opt-in way of program requesting
> > guarantees about the datapath.  
> 
> what is an example of what you proposing? How will it look from user pov?

This is roughly what I had in mind.  From user's perspective it would
probably make sense to tie it to the program itself, so:

bpf_helpers.h:

#define __XDP_ATTR  \
__attribute__((section(ELF_SECTION_XDP_ATTRS), used))

#define XDP_GUARANTEE_HEADROOM(val) \
__u32 

Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-14 Thread Alexei Starovoitov
On Fri, Apr 14, 2017 at 03:30:32PM -0700, Jakub Kicinski wrote:
> On Fri, 14 Apr 2017 12:28:15 -0700, Alexei Starovoitov wrote:
> > On Fri, Apr 14, 2017 at 11:05:25AM +0200, Jesper Dangaard Brouer wrote:
> > > > 
> > > > We are consistently finding that there is this real need to
> > > > communicate XDP capabilities, or somehow verify that the needs
> > > > of an XDP program can be satisfied by a given implementation.  
> > > 
> > > I fully agree that we need some way to express capabilities[1]
> > >   
> > > > Maximum headroom is just one.  
> > 
> > I don't like the idea of asking program author to explain capabilities
> > to the kernel. Right now the verifier already understands more about
> > the program than human does. If the verifier cannot deduct from the
> > insns what program will be doing, it's pretty much guarantee
> > that program author has no idea either.
> > If we add 'required_headroom' as an extra flag to BPF_PROG_LOAD,
> > the users will just pass something like 64 or 128 whereas the program
> > might only be doing IPIP encap and that will cause kernel to
> > provide extra headroom for no good reason or reject the program
> > whereas it could have run just fine.
> > 
> > So I very much agree with this part:
> > > ... or somehow verify that the needs
> > > of an XDP program can be satisfied by a given implementation.  
> > 
> > we already have cb_access, dst_needed, xdp_adjust_head flags
> > that verifier discovers in the program.
> > For headroom we need one more. The verifier can see
> > the constant passed into bpf_xdp_adjust_head().
> > It's trickier if it's a variable, but the verifier can estimate min/max
> > of the variable already and worst case it can say that it
> > will be XDP_PACKET_HEADROOM.
> > If the program is doing variable length bpf_xdp_adjust_head(),
> > the human has no idea how much they need and cannot tell kernel anyway.
> 
> If I understand correctly that the current proposal is to:
> 
> 1. Assume program needs 256 (XDP_PACKET_HEADROOM) of prepend.
> 2. Make verifier try to lower that through byte code analysis.
> 3. Make the load fail if (prog->max_prepend > driver->xdp_max_prepend).

no. the hard fail will be wrong, since verifier is conservative.
As i showed in the example with variable ip hdr length.

> We may expect most users won't care and we may want to still depend on
> the verifier to analyze the program and enable optimizations, but
> depending on the verifier for proving prepend lengths is scary.  Or are
> we just discussion the optimizations here and not the guarantees about
> headroom availability?

yes. I was thinking as an optimization. If verifier can see that
prog needs only 20 bytes of headroom or no headroom (like netronome
already does) than the ring can be configured differently for hopefully
better performance.
But that's probably bad idea, since I was assuming that such ring
reconfiguration can be made fast, which is unlikely.
If it takes seconds to setup a ring, then drivers should just
assume XDP_PACKET_HEADROOM, since at that time the program
properties are unknown and in the future other programs will be loaded.
Take a look at our current setup with slots for xdp_dump, ddos, lb
programs. Only root program is attached at the begining.
If the driver configures the ring for such empty program that would break
dynamic addition of lb prog.
The driver must not interrupt the traffic when user adds another
prog to prog array. In such case XDP side doesn't even know that
prog array is used. It's all happening purely on bpf side.

> Or worse not do that, use Qlogic, Broadcom, Mellanox or Netronome cards
> and then be in for a nasty surprise when they try to load the same
> program for Intel parts :(

are you talking hw offloaded prog or normal in-driver xdp?

> I agree that trying to express capabilities to user space is quickly
> going to become more confusing than helpful.  However, I think here we
> need the opposite - we need an opt-in way of program requesting
> guarantees about the datapath.

what is an example of what you proposing? How will it look from user pov?



Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-14 Thread Jakub Kicinski
On Fri, 14 Apr 2017 12:28:15 -0700, Alexei Starovoitov wrote:
> On Fri, Apr 14, 2017 at 11:05:25AM +0200, Jesper Dangaard Brouer wrote:
> > > 
> > > We are consistently finding that there is this real need to
> > > communicate XDP capabilities, or somehow verify that the needs
> > > of an XDP program can be satisfied by a given implementation.  
> > 
> > I fully agree that we need some way to express capabilities[1]
> >   
> > > Maximum headroom is just one.  
> 
> I don't like the idea of asking program author to explain capabilities
> to the kernel. Right now the verifier already understands more about
> the program than human does. If the verifier cannot deduct from the
> insns what program will be doing, it's pretty much guarantee
> that program author has no idea either.
> If we add 'required_headroom' as an extra flag to BPF_PROG_LOAD,
> the users will just pass something like 64 or 128 whereas the program
> might only be doing IPIP encap and that will cause kernel to
> provide extra headroom for no good reason or reject the program
> whereas it could have run just fine.
> 
> So I very much agree with this part:
> > ... or somehow verify that the needs
> > of an XDP program can be satisfied by a given implementation.  
> 
> we already have cb_access, dst_needed, xdp_adjust_head flags
> that verifier discovers in the program.
> For headroom we need one more. The verifier can see
> the constant passed into bpf_xdp_adjust_head().
> It's trickier if it's a variable, but the verifier can estimate min/max
> of the variable already and worst case it can say that it
> will be XDP_PACKET_HEADROOM.
> If the program is doing variable length bpf_xdp_adjust_head(),
> the human has no idea how much they need and cannot tell kernel anyway.

If I understand correctly that the current proposal is to:

1. Assume program needs 256 (XDP_PACKET_HEADROOM) of prepend.
2. Make verifier try to lower that through byte code analysis.
3. Make the load fail if (prog->max_prepend > driver->xdp_max_prepend).

Would that not cause unnecessary load failures?  I'm worried it would
force users to do things like this:

struct map_value *val;

if (val->prep_len > 64) /* Prove to verifier this is short */
return XDP_DROP;
if (xdp_adjust_head(xdp, -val->prep_len))
...

Or worse not do that, use Qlogic, Broadcom, Mellanox or Netronome cards
and then be in for a nasty surprise when they try to load the same
program for Intel parts :(

I agree that trying to express capabilities to user space is quickly
going to become more confusing than helpful.  However, I think here we
need the opposite - we need an opt-in way of program requesting
guarantees about the datapath.

We may expect most users won't care and we may want to still depend on
the verifier to analyze the program and enable optimizations, but
depending on the verifier for proving prepend lengths is scary.  Or are
we just discussion the optimizations here and not the guarantees about
headroom availability?


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-14 Thread Daniel Borkmann

On 04/14/2017 09:28 PM, Alexei Starovoitov wrote:

On Fri, Apr 14, 2017 at 11:05:25AM +0200, Jesper Dangaard Brouer wrote:


We are consistently finding that there is this real need to
communicate XDP capabilities, or somehow verify that the needs
of an XDP program can be satisfied by a given implementation.


I fully agree that we need some way to express capabilities[1]


Maximum headroom is just one.


I don't like the idea of asking program author to explain capabilities
to the kernel. Right now the verifier already understands more about
the program than human does. If the verifier cannot deduct from the
insns what program will be doing, it's pretty much guarantee
that program author has no idea either.
If we add 'required_headroom' as an extra flag to BPF_PROG_LOAD,
the users will just pass something like 64 or 128 whereas the program
might only be doing IPIP encap and that will cause kernel to
provide extra headroom for no good reason or reject the program
whereas it could have run just fine.


Fully agree, such an extension is likely to be used wrongly or with
some default size as we have right now with XDP_PACKET_HEADROOM to
cover most use cases in order to not bother the user to deal with
this resp. not to complicate things more.


So I very much agree with this part:

... or somehow verify that the needs
of an XDP program can be satisfied by a given implementation.


we already have cb_access, dst_needed, xdp_adjust_head flags
that verifier discovers in the program.
For headroom we need one more. The verifier can see
the constant passed into bpf_xdp_adjust_head().
It's trickier if it's a variable, but the verifier can estimate min/max
of the variable already and worst case it can say that it
will be XDP_PACKET_HEADROOM.


+1, we should try hard to reuse such information from the verifier
to determine specific requirements the program has, and check against
them at prog attach time. This works okay so far for the already
mentioned bits in struct bpf_prog, and could be further extended.


If the program is doing variable length bpf_xdp_adjust_head(),
the human has no idea how much they need and cannot tell kernel anyway.


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-14 Thread Alexei Starovoitov
On Thu, Apr 13, 2017 at 11:38:10AM -0400, David Miller wrote:
> From: Johannes Berg 
> Date: Thu, 13 Apr 2017 08:10:56 +0200
> 
> > On Wed, 2017-04-12 at 21:20 -0700, Alexei Starovoitov wrote:
> > 
> >> > +if (skb_linearize(skb))
> >> > +goto do_drop;
> >> 
> >> when we discussed supporting jumbo frames in XDP, the idea
> >> was that the program would need to look at first 3+k bytes only
> >> and the rest of the packet will be in non-contiguous pages.
> >> If we do that, it means that XDP program would have to assume
> >> that the packet is more than [data, data_end] and this range
> >> only covers linear part.
> >> If that's the future, we don't need to linearize the skb here
> >> and can let the program access headlen only.
> > 
> > I'm not sure how you think that would work - at least with our (wifi)
> > driver, the headlen should be maybe ETH_HLEN or so at this point. We'd
> > let the program know that it can only look at so much, but then the
> > program can't do anything at all with those frames. At some point then
> > we go back to bpf_skb_load_bytes() being necessary in one form or
> > another, no?
> 
> Agreed, this is completely unusable.  Some wired ethernet drivers do the
> same exact thing.

ahh. i thought all drivers do at least copy-break (256) bytes or copy
get_headlen or build_skb the whole thing.
Since wireless does eth_hlen, then yeah, skb_linearize() is the only way.
I guess any driver that would care about XDP performance would either
implement in-driver XDP or make sure that skb_linearize() doesn't
happen in generic XDP by doing build_skb() with the whole packet.
The driver can be smart and avoid doing copy-break if generic XDP is on.



Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-14 Thread Alexei Starovoitov
On Fri, Apr 14, 2017 at 11:05:25AM +0200, Jesper Dangaard Brouer wrote:
> > 
> > We are consistently finding that there is this real need to
> > communicate XDP capabilities, or somehow verify that the needs
> > of an XDP program can be satisfied by a given implementation.
> 
> I fully agree that we need some way to express capabilities[1]
> 
> > Maximum headroom is just one.

I don't like the idea of asking program author to explain capabilities
to the kernel. Right now the verifier already understands more about
the program than human does. If the verifier cannot deduct from the
insns what program will be doing, it's pretty much guarantee
that program author has no idea either.
If we add 'required_headroom' as an extra flag to BPF_PROG_LOAD,
the users will just pass something like 64 or 128 whereas the program
might only be doing IPIP encap and that will cause kernel to
provide extra headroom for no good reason or reject the program
whereas it could have run just fine.

So I very much agree with this part:
> ... or somehow verify that the needs
> of an XDP program can be satisfied by a given implementation.

we already have cb_access, dst_needed, xdp_adjust_head flags
that verifier discovers in the program.
For headroom we need one more. The verifier can see
the constant passed into bpf_xdp_adjust_head().
It's trickier if it's a variable, but the verifier can estimate min/max
of the variable already and worst case it can say that it
will be XDP_PACKET_HEADROOM.
If the program is doing variable length bpf_xdp_adjust_head(),
the human has no idea how much they need and cannot tell kernel anyway.



Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-14 Thread Alexei Starovoitov
On Fri, Apr 14, 2017 at 10:07:40AM +0200, Johannes Berg wrote:
> On Thu, 2017-04-13 at 16:01 -0400, David Miller wrote:
> > From: Johannes Berg 
> > Date: Thu, 13 Apr 2017 21:22:21 +0200
> > 
> > > OTOH, it might depend on the frame data itself, if the program does
> > > something like
> > > 
> > > xdp->data[xdp->data[0] & 0xf]
> > > 
> > > (read or write, doesn't really matter) so then the verifier would
> > have
> > > to take the maximum possible value there into account.
> > 
> > I am not well versed enough with the verifier to understand exactly
> > how and to what extent SKB accesses are validated by the verifier.
> > 
> > My, perhaps mistaken, impression is that access range validation is
> > still at least partially done at run time.
> 
> I think you're right for SKB accesses, but I'm pretty sure that for XDP
> the verifier checks that the program can't possibly access outside of
> [xdp->data, xdp->data_end], see compare_ptrs_to_packet().
> 
> This checks that comparisons to data_end are all there, i.e. that the
> program verifies it may access some bytes before actually doing so.
> However, the program could start with
> 
> if (xdp->data_end < xdp->data + 1024)
>   return DROP;
> [...]
> 
> and then the verifier would consider it safe.
> 
> Still, it has to track down into the [...] code to actually understand
> that it now didn't try to access xdp->data+1025, and as such it should
> be able to determine the maximum desired offset.
> 
> However, I'm coming to realize that may not necessarily mean that the
> program really *needs* to access that data.
> 
> For example, a hypothetical wifi program might want to recalculate and
> compare the CRC checksum (for whatever reason, say a driver/hardware
> bug). This would require accessing the last 4 bytes of the packet,
> which may not be present. The program could, however, accept that
> sometimes this isn't possible, and simply accept frames when it can't
> see the last 4 bytes (or if the last 4 bytes aren't the CRC because
> that isn't reported now, but whatever, I'm handwaving anyway.)
> 
> So perhaps this isn't really a good idea. The program should probably
> separately say how much data it really *needs* there, and then perhaps
> a warning could be emitted if it never accesses the data that it
> advertises as needing (i.e. if it says "I want 1024 bytes" but then
> can't possibly read more than 14)

The above descrption is correct. Also note that the offset within the packet
can be a variable as well, since that's the only way to do variable
length parsing (like ip options), so the verifier cannot tell how far
into the packet the program is going to parse. The program author cannot
tell that either!
Consider samples/bpf/parse_varlen.c:
ihl_len = iph->ihl * 4;

if (iph->protocol == IPPROTO_IPIP) {
iph = data + nh_off + ihl_len;
if (iph + 1 > data_end)
return 0;

'ihl' is 4 bit, so 2nd ip hdr may start at sizeof(eth) + 64 bytes,
but we cannot ask the driver to always provide these many bytes.
Technically we can add a feature to verifier to calculate the max range
into the packet that the program may look at, but it will be large.
Like if the program parses IPIP the range will be 64(first ip) + 64(second ip)
+ 64 (if it looks at tcp options).
If the program parses TLVs then the range will be 64k, so I'm not sure what
would be the value of such verifier feature.



Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-14 Thread Jesper Dangaard Brouer
On Thu, 13 Apr 2017 11:37:22 -0400 (EDT)
David Miller  wrote:

> From: Alexei Starovoitov 
> Date: Wed, 12 Apr 2017 21:20:38 -0700
> 
> > On Wed, Apr 12, 2017 at 02:54:15PM -0400, David Miller wrote:  
[...]
> 
> If the capability is variable, it must be communicated to the user
> somehow at program load time.
> 
> We are consistently finding that there is this real need to
> communicate XDP capabilities, or somehow verify that the needs
> of an XDP program can be satisfied by a given implementation.

I fully agree that we need some way to express capabilities[1]

[1] 
http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/design.html#capabilities-negotiation

> Maximum headroom is just one.

[...]
> 
> We can only optimize this and elide things when we have a facility in
> the future for the program to express it's needs precisely.  I think
> we will have to add some control structure to XDP programs that can
> be filled in for this purpose.

I fully agree that we need some control structure to XDP programs.  My
previous attempt was shot-down due to performance concerns of an extra
pointer dereference. As I explained before, this is not a concern as
the dereference will happen once per N packets in the NAPI loop.

Plus now we see a need to elide things based on facilities the XDP
program choose to use/enable, for performance reasons.  I would prefer
keeping these facility settings in control structure to XDP programs,
instead of pulling in derived bits runtime.   Again remember, adding
if/branch statements checking for facilities, should have little
performance impact as the branch predictor should guess correctly given
we process N packets in the NAPI loop with same facilities.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-14 Thread Johannes Berg
On Thu, 2017-04-13 at 16:01 -0400, David Miller wrote:
> From: Johannes Berg 
> Date: Thu, 13 Apr 2017 21:22:21 +0200
> 
> > OTOH, it might depend on the frame data itself, if the program does
> > something like
> > 
> > xdp->data[xdp->data[0] & 0xf]
> > 
> > (read or write, doesn't really matter) so then the verifier would
> have
> > to take the maximum possible value there into account.
> 
> I am not well versed enough with the verifier to understand exactly
> how and to what extent SKB accesses are validated by the verifier.
> 
> My, perhaps mistaken, impression is that access range validation is
> still at least partially done at run time.

I think you're right for SKB accesses, but I'm pretty sure that for XDP
the verifier checks that the program can't possibly access outside of
[xdp->data, xdp->data_end], see compare_ptrs_to_packet().

This checks that comparisons to data_end are all there, i.e. that the
program verifies it may access some bytes before actually doing so.
However, the program could start with

if (xdp->data_end < xdp->data + 1024)
return DROP;
[...]

and then the verifier would consider it safe.

Still, it has to track down into the [...] code to actually understand
that it now didn't try to access xdp->data+1025, and as such it should
be able to determine the maximum desired offset.

However, I'm coming to realize that may not necessarily mean that the
program really *needs* to access that data.

For example, a hypothetical wifi program might want to recalculate and
compare the CRC checksum (for whatever reason, say a driver/hardware
bug). This would require accessing the last 4 bytes of the packet,
which may not be present. The program could, however, accept that
sometimes this isn't possible, and simply accept frames when it can't
see the last 4 bytes (or if the last 4 bytes aren't the CRC because
that isn't reported now, but whatever, I'm handwaving anyway.)

So perhaps this isn't really a good idea. The program should probably
separately say how much data it really *needs* there, and then perhaps
a warning could be emitted if it never accesses the data that it
advertises as needing (i.e. if it says "I want 1024 bytes" but then
can't possibly read more than 14)

johannes


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-13 Thread David Miller
From: Johannes Berg 
Date: Thu, 13 Apr 2017 21:22:21 +0200

> OTOH, it might depend on the frame data itself, if the program does
> something like
> 
> xdp->data[xdp->data[0] & 0xf]
> 
> (read or write, doesn't really matter) so then the verifier would have
> to take the maximum possible value there into account.

I am not well versed enough with the verifier to understand exactly
how and to what extent SKB accesses are validated by the verifier.

My, perhaps mistaken, impression is that access range validation is
still at least partially done at run time.


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-13 Thread Johannes Berg
On Thu, 2017-04-13 at 11:37 -0400, David Miller wrote:

> If the capability is variable, it must be communicated to the user
> somehow at program load time.
> 
> We are consistently finding that there is this real need to
> communicate XDP capabilities, or somehow verify that the needs
> of an XDP program can be satisfied by a given implementation.

Technically, once you know the capability of the *driver*, the verifier
should be able to check if the *program* is compatible. So if the
driver can guarantee "you always get 2k accessible", the verifier can
check that you don't access more than xdb->data + 2047, similar to how
it verifies that you don't access beyond xdb->data_end.

> And eth_get_headlen() only pulls protocol headers, which precludes
> XDP inspecting anything below TCP/UDP/etc.  This is also not
> reasonable.
> 
> Right now, as it stands, we have to assume the program can
> potentially be interested in the entire packet.

I agree with this though, it's not reasonable to have wildly varying
implementations here that may or may not be able to access almost
anything. The totally degenerate case would be having no skb header at
all, which is also still entirely valid from the network stack's POV.

> We can only optimize this and elide things when we have a facility in
> the future for the program to express it's needs precisely.  I think
> we will have to add some control structure to XDP programs that can
> be filled in for this purpose.

Like I said above, I think this is something that you can possibly
determine in the verifier.

So if, for example, the verifier notices that the program never
accesses anything but the first few bytes, then it would seem valid to
run with only that much pulled into the skb header.

OTOH, it might depend on the frame data itself, if the program does
something like

xdp->data[xdp->data[0] & 0xf]

(read or write, doesn't really matter) so then the verifier would have
to take the maximum possible value there into account.

johannes


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-13 Thread David Miller
From: Daniel Borkmann 
Date: Thu, 13 Apr 2017 17:57:06 +0200

> On 04/12/2017 08:54 PM, David Miller wrote:
> [...]
>> +static u32 netif_receive_generic_xdp(struct sk_buff *skb,
>> + struct bpf_prog *xdp_prog)
>> +{
>> +struct xdp_buff xdp;
>> +u32 act = XDP_DROP;
>> +void *orig_data;
>> +int hlen, off;
>> +
>> +if (skb_linearize(skb))
> 
> Btw, given the skb can come from all kind of points in the stack,
> it could also be a clone at this point. One example is act_mirred
> which in fact does skb_clone() and can push the skb back to
> ingress path through netif_receive_skb() and thus could then go
> into generic xdp processing, where skb can be mangled.
> 
> Instead of skb_linearize() we would therefore need to use something
> like skb_ensure_writable(skb, skb->len) as equivalent, which also
> makes sure that we unclone whenever needed.

We could use skb_cow() for this purpose, which deals with cloning as
well as enforcing headroom.

However, thinking further about this, the goal is to make generic XDP
match precisely how in-driver-XDP behaves.  Therefore, such redirects
from act_mirred would never flow through the XDP path.

No other possibility can cause us to see a cloned packet here, we
are before network taps are processed, etc.  So in my opinion the
thing to do is to elide generic XDP if the SKB is cloned.


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-13 Thread Daniel Borkmann

On 04/12/2017 08:54 PM, David Miller wrote:
[...]

+static u32 netif_receive_generic_xdp(struct sk_buff *skb,
+struct bpf_prog *xdp_prog)
+{
+   struct xdp_buff xdp;
+   u32 act = XDP_DROP;
+   void *orig_data;
+   int hlen, off;
+
+   if (skb_linearize(skb))


Btw, given the skb can come from all kind of points in the stack,
it could also be a clone at this point. One example is act_mirred
which in fact does skb_clone() and can push the skb back to
ingress path through netif_receive_skb() and thus could then go
into generic xdp processing, where skb can be mangled.

Instead of skb_linearize() we would therefore need to use something
like skb_ensure_writable(skb, skb->len) as equivalent, which also
makes sure that we unclone whenever needed.


+   goto do_drop;
+
+   /* The XDP program wants to see the packet starting at the MAC
+* header.
+*/
+   hlen = skb_headlen(skb) + skb->mac_len;
+   xdp.data = skb->data - skb->mac_len;
+   xdp.data_end = xdp.data + hlen;
+   xdp.data_hard_start = xdp.data - skb_headroom(skb);
+   orig_data = xdp.data;
+
+   act = bpf_prog_run_xdp(xdp_prog, );
+
+   off = xdp.data - orig_data;
+   if (off)
+   __skb_push(skb, off);
+
+   switch (act) {
+   case XDP_TX:
+   __skb_push(skb, skb->mac_len);
+   /* fall through */
+   case XDP_PASS:
+   break;
+
+   default:
+   bpf_warn_invalid_xdp_action(act);
+   /* fall through */
+   case XDP_ABORTED:
+   trace_xdp_exception(skb->dev, xdp_prog, act);
+   /* fall through */
+   case XDP_DROP:
+   do_drop:
+   kfree_skb(skb);
+   break;
+   }
+
+   return act;
+}
+
  static int netif_receive_skb_internal(struct sk_buff *skb)
  {
int ret;
@@ -4258,6 +4341,21 @@ static int netif_receive_skb_internal(struct sk_buff 
*skb)

rcu_read_lock();

+   if (static_key_false(_xdp_needed)) {
+   struct bpf_prog *xdp_prog = rcu_dereference(skb->dev->xdp_prog);
+
+   if (xdp_prog) {
+   u32 act = netif_receive_generic_xdp(skb, xdp_prog);
+
+   if (act != XDP_PASS) {
+   rcu_read_unlock();
+   if (act == XDP_TX)
+   dev_queue_xmit(skb);
+   return NET_RX_DROP;
+   }
+   }
+   }
+

[...]


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-13 Thread David Miller
From: Michael Chan 
Date: Wed, 12 Apr 2017 23:48:22 -0700

> On Wed, Apr 12, 2017 at 11:54 AM, David Miller  wrote:
>>
>> +static struct static_key generic_xdp_needed __read_mostly;
>> +
>> +static int generic_xdp_install(struct net_device *dev, struct netdev_xdp 
>> *xdp)
>> +{
>> +   struct bpf_prog *new = xdp->prog;
>> +   int ret = 0;
>> +
>> +   switch (xdp->command) {
>> +   case XDP_SETUP_PROG: {
>> +   struct bpf_prog *old = rtnl_dereference(dev->xdp_prog);
>> +
>> +   rcu_assign_pointer(dev->xdp_prog, new);
>> +   if (old)
>> +   bpf_prog_put(old);
>> +
>> +   if (old && !new)
>> +   static_key_slow_dec(_xdp_needed);
>> +   else if (new && !old)
>> +   static_key_slow_inc(_xdp_needed);
> 
> I think we need to call dev_disable_lro(dev) to disable LRO also when
> a BPF program is installed.

Agreed, I'll add this.

Thanks Michael.


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-13 Thread David Miller
From: Johannes Berg 
Date: Thu, 13 Apr 2017 08:10:56 +0200

> On Wed, 2017-04-12 at 21:20 -0700, Alexei Starovoitov wrote:
> 
>> > +  if (skb_linearize(skb))
>> > +  goto do_drop;
>> 
>> when we discussed supporting jumbo frames in XDP, the idea
>> was that the program would need to look at first 3+k bytes only
>> and the rest of the packet will be in non-contiguous pages.
>> If we do that, it means that XDP program would have to assume
>> that the packet is more than [data, data_end] and this range
>> only covers linear part.
>> If that's the future, we don't need to linearize the skb here
>> and can let the program access headlen only.
> 
> I'm not sure how you think that would work - at least with our (wifi)
> driver, the headlen should be maybe ETH_HLEN or so at this point. We'd
> let the program know that it can only look at so much, but then the
> program can't do anything at all with those frames. At some point then
> we go back to bpf_skb_load_bytes() being necessary in one form or
> another, no?

Agreed, this is completely unusable.  Some wired ethernet drivers do the
same exact thing.


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-13 Thread David Miller
From: Alexei Starovoitov 
Date: Wed, 12 Apr 2017 21:20:38 -0700

> On Wed, Apr 12, 2017 at 02:54:15PM -0400, David Miller wrote:
>> One thing I have not tried to address here is the issue of
>> XDP_PACKET_HEADROOM, thanks to Daniel for spotting that.  It seems
>> incredibly expensive to do a skb_cow(skb, XDP_PACKET_HEADROOM) or
>> whatever even if the XDP program doesn't try to push headers at all.
>> I think we really need the verifier to somehow propagate whether
>> certain XDP helpers are used or not.
> 
> Looks like we need to relax the headroom requirement.
> I really wanted to simplify the life of program writers,
> but intel drivers insist on 192 headroom already, then for skb
> every driver does 64 and netronome doesn't have the room by default at all
> even for XDP and relies on expensive copy when xdp_adjust_head is used
> so that dream isn't going to come true.
> I guess for now every driver should _try_ to give XDP_PACKET_HEADROOM
> bytes, but the driver can omit it completely if xdp_adjust_head()
> is not used for performance reasons.

If the capability is variable, it must be communicated to the user
somehow at program load time.

We are consistently finding that there is this real need to
communicate XDP capabilities, or somehow verify that the needs
of an XDP program can be satisfied by a given implementation.

Maximum headroom is just one.

Simply having bpf_xdp_adjust_head() fail and therefore drop packets
isn't good behavior at all.  But that is the situation we have
right now.

>> +if (skb_linearize(skb))
>> +goto do_drop;
> 
> when we discussed supporting jumbo frames in XDP, the idea
> was that the program would need to look at first 3+k bytes only
> and the rest of the packet will be in non-contiguous pages.
> If we do that, it means that XDP program would have to assume
> that the packet is more than [data, data_end] and this range
> only covers linear part.
> If that's the future, we don't need to linearize the skb here
> and can let the program access headlen only.
> It means that we'd need to add 'len' field to 'struct xdp_md' uapi.
> For our existing programs (ddos and l4lb) it's not a problem,
> since for MTU < 3.xK and physical XDP the driver guarantees that
> data_end - data == len, so nothing will break.
> So I'm proposing to do two things:
> - drop skb_linearize here
> - introduce 'len' to 'struct xdp_md' and update it here and
> in the existing drivers that support xdp.

I don't think this is reasonable.  The amount of inconsistency will be
quite huge.

First of all, every driver pulls a different amount of bytes into the
linear area when building frag based SKBs on receive.  There are
drivers that only put the MAC header in the linear area, and depend
upon the logic in the stack which does pskb_may_pull() before touching
deeper headers.

Of course, such drivers should be using eth_get_headlen().  But they
don't, and what they are doing is not a bug.

And eth_get_headlen() only pulls protocol headers, which precludes XDP
inspecting anything below TCP/UDP/etc.  This is also not reasonable.

Right now, as it stands, we have to assume the program can potentially
be interested in the entire packet.

We can only optimize this and elide things when we have a facility in
the future for the program to express it's needs precisely.  I think
we will have to add some control structure to XDP programs that can
be filled in for this purpose.

Right now the two things we know would need to be expressed are 1)
maximum headroom needed and 2) parsing depth.  I'm not %100 clear on
how #2 should be specified, but in it's simplest sense it could take
on two values, either "what eth_get_headlen() returns" or "entire
packet" to start with.

> 
>> +if (act == XDP_TX)
>> +dev_queue_xmit(skb);
> 
> this will go through qdisc which is not a problem per-se,
> but may mislead poor users that XDP_TX goes through qdisc
> even for in-driver XDP which is not the case.
> So I think we need to bypass qdisc somehow. Like
> txq = netdev_pick_tx(skb,..
> HARD_TX_LOCK(dev, txq..
> if (!netif_xmit_stopped(txq)) {
>   dev_hard_start_xmit(skb, dev, txq,
> } else {
>   kfree_skb(skb);
>   txq->xdp_tx_full++;
> }
> HARD_TX_UNLOCK(dev, txq);
> 
> this way it will be similar to in-driver XDP which
> also has xdp_tx_full counter when HW TX queue is full.

Ok, this makes sense.  I'll work on that.

> Re: csum and skb->encapsulate issues that were raised in the previous thread
> 
> Today all cls_bpf csum helpers are currently disabled for XDP
> and if the program mangles the packet and then does XDP_PASS,
> the packet will be dropped by the stack due to incorrect csum.
> So we're no better here and need a solution for both in-driver XDP and 
> generic XDP.

I'm not so sure how I feel about checksums, I've been busy following
the thread started by Tom.  I want to strive for something simple.

> I think the green 

Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-13 Thread Michael Chan
On Wed, Apr 12, 2017 at 11:54 AM, David Miller  wrote:
>
> +static struct static_key generic_xdp_needed __read_mostly;
> +
> +static int generic_xdp_install(struct net_device *dev, struct netdev_xdp 
> *xdp)
> +{
> +   struct bpf_prog *new = xdp->prog;
> +   int ret = 0;
> +
> +   switch (xdp->command) {
> +   case XDP_SETUP_PROG: {
> +   struct bpf_prog *old = rtnl_dereference(dev->xdp_prog);
> +
> +   rcu_assign_pointer(dev->xdp_prog, new);
> +   if (old)
> +   bpf_prog_put(old);
> +
> +   if (old && !new)
> +   static_key_slow_dec(_xdp_needed);
> +   else if (new && !old)
> +   static_key_slow_inc(_xdp_needed);

I think we need to call dev_disable_lro(dev) to disable LRO also when
a BPF program is installed.

> +   break;
> +   }
> +
> +   case XDP_QUERY_PROG:
> +   xdp->prog_attached = !!rcu_access_pointer(dev->xdp_prog);
> +   break;
> +
> +   default:
> +   ret = -EINVAL;
> +   break;
> +   }
> +
> +   return ret;
> +}
> +


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-13 Thread Johannes Berg
On Wed, 2017-04-12 at 21:20 -0700, Alexei Starovoitov wrote:

> > +   if (skb_linearize(skb))
> > +   goto do_drop;
> 
> when we discussed supporting jumbo frames in XDP, the idea
> was that the program would need to look at first 3+k bytes only
> and the rest of the packet will be in non-contiguous pages.
> If we do that, it means that XDP program would have to assume
> that the packet is more than [data, data_end] and this range
> only covers linear part.
> If that's the future, we don't need to linearize the skb here
> and can let the program access headlen only.

I'm not sure how you think that would work - at least with our (wifi)
driver, the headlen should be maybe ETH_HLEN or so at this point. We'd
let the program know that it can only look at so much, but then the
program can't do anything at all with those frames. At some point then
we go back to bpf_skb_load_bytes() being necessary in one form or
another, no?

johannes


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-12 Thread Alexei Starovoitov
On Wed, Apr 12, 2017 at 02:54:15PM -0400, David Miller wrote:
> 
> This provides a generic SKB based non-optimized XDP path which is used
> if either the driver lacks a specific XDP implementation, or the user
> requests it via a new IFLA_XDP_FLAGS value named XDP_FLAGS_SKB_MODE.
> 
> It is arguable that perhaps I should have required something like
> this as part of the initial XDP feature merge.
> 
> I believe this is critical for two reasons:
> 
> 1) Accessibility.  More people can play with XDP with less
>dependencies.  Yes I know we have XDP support in virtio_net, but
>that just creates another depedency for learning how to use this
>facility.
> 
>I wrote this to make life easier for the XDP newbies.
> 
> 2) As a model for what the expected semantics are.  If there is a pure
>generic core implementation, it serves as a semantic example for
>driver folks adding XDP support.
> 
> This is just a rough draft and is untested.
> 
> One thing I have not tried to address here is the issue of
> XDP_PACKET_HEADROOM, thanks to Daniel for spotting that.  It seems
> incredibly expensive to do a skb_cow(skb, XDP_PACKET_HEADROOM) or
> whatever even if the XDP program doesn't try to push headers at all.
> I think we really need the verifier to somehow propagate whether
> certain XDP helpers are used or not.

Looks like we need to relax the headroom requirement.
I really wanted to simplify the life of program writers,
but intel drivers insist on 192 headroom already, then for skb
every driver does 64 and netronome doesn't have the room by default at all
even for XDP and relies on expensive copy when xdp_adjust_head is used
so that dream isn't going to come true.
I guess for now every driver should _try_ to give XDP_PACKET_HEADROOM
bytes, but the driver can omit it completely if xdp_adjust_head()
is not used for performance reasons.

> +static inline bool netif_elide_gro(const struct net_device *dev)
> +{
> + if (!(dev->features & NETIF_F_GRO) || dev->xdp_prog)
> + return true;
> + return false;
> +}

I think that's the rigth call.
I've been thinking back and forth about it.
On one side it's not cool to disable it and not inform user
about it in ethtool, so it might feel that doing
ethtool_set_one_feature(~GRO) may be cleaner, but then
we'd need to remember the current gro on/off status and
restore that after prog is detached.
And while the prog is attached the user shouldn't be able
to change gro via ethtool -K, so we'd need extra boolean
flag anyway. If we go with this netif_elide_gro() approach,
we don't need to mess with ndo_fix_features() and only
need to hack ethtool_get_features() to report GRO disabled
if (dev->xdp_prog) and mark it as [fixed].
So overall imo it's cleaner than messing with dev->features
directly while attaching/detaching the prog.

> + if (skb_linearize(skb))
> + goto do_drop;

when we discussed supporting jumbo frames in XDP, the idea
was that the program would need to look at first 3+k bytes only
and the rest of the packet will be in non-contiguous pages.
If we do that, it means that XDP program would have to assume
that the packet is more than [data, data_end] and this range
only covers linear part.
If that's the future, we don't need to linearize the skb here
and can let the program access headlen only.
It means that we'd need to add 'len' field to 'struct xdp_md' uapi.
For our existing programs (ddos and l4lb) it's not a problem,
since for MTU < 3.xK and physical XDP the driver guarantees that
data_end - data == len, so nothing will break.
So I'm proposing to do two things:
- drop skb_linearize here
- introduce 'len' to 'struct xdp_md' and update it here and
in the existing drivers that support xdp.

> + if (act == XDP_TX)
> + dev_queue_xmit(skb);

this will go through qdisc which is not a problem per-se,
but may mislead poor users that XDP_TX goes through qdisc
even for in-driver XDP which is not the case.
So I think we need to bypass qdisc somehow. Like
txq = netdev_pick_tx(skb,..
HARD_TX_LOCK(dev, txq..
if (!netif_xmit_stopped(txq)) {
  dev_hard_start_xmit(skb, dev, txq,
} else {
  kfree_skb(skb);
  txq->xdp_tx_full++;
}
HARD_TX_UNLOCK(dev, txq);

this way it will be similar to in-driver XDP which
also has xdp_tx_full counter when HW TX queue is full.

Re: csum and skb->encapsulate issues that were raised in the previous thread

Today all cls_bpf csum helpers are currently disabled for XDP
and if the program mangles the packet and then does XDP_PASS,
the packet will be dropped by the stack due to incorrect csum.
So we're no better here and need a solution for both in-driver XDP and generic 
XDP.

I think the green light to apply this patch will be when
samples/bpf/xdp1, xdp2 and xdp_tx_iptunnel
work just like they do in in-driver XDP and I think we're pretty close.

If desired I can start hacking on this patch and testing mid next week.



Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-12 Thread David Ahern
On 4/12/17 8:08 PM, David Miller wrote:
> From: David Ahern 
> Date: Wed, 12 Apr 2017 13:54:20 -0600
> 
>> packet passed to xdp seems to be messed up
> Ok, the problem is that skb->mac_len isn't set properly at this point.
> That doesn't happen until __netif_receive_skb_core().
> 
> I'm also not setting xdp.data_hard_start properly.
> 
> It should work better with this small diff:

it does.


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-12 Thread David Miller
From: David Ahern 
Date: Wed, 12 Apr 2017 13:54:20 -0600

> packet passed to xdp seems to be messed up

Ok, the problem is that skb->mac_len isn't set properly at this point.
That doesn't happen until __netif_receive_skb_core().

I'm also not setting xdp.data_hard_start properly.

It should work better with this small diff:

diff --git a/net/core/dev.c b/net/core/dev.c
index 9ed4569..d36ae8f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4289,6 +4289,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
u32 act = XDP_DROP;
void *orig_data;
int hlen, off;
+   u32 mac_len;
 
if (skb_linearize(skb))
goto do_drop;
@@ -4296,10 +4297,11 @@ static u32 netif_receive_generic_xdp(struct sk_buff 
*skb,
/* The XDP program wants to see the packet starting at the MAC
 * header.
 */
-   hlen = skb_headlen(skb) + skb->mac_len;
-   xdp.data = skb->data - skb->mac_len;
+   mac_len = skb->data - skb_mac_header(skb);
+   hlen = skb_headlen(skb) + mac_len;
+   xdp.data = skb->data - mac_len;
xdp.data_end = xdp.data + hlen;
-   xdp.data_hard_start = xdp.data - skb_headroom(skb);
+   xdp.data_hard_start = skb->data - skb_headroom(skb);
orig_data = xdp.data;
 
act = bpf_prog_run_xdp(xdp_prog, );


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-12 Thread David Miller
From: Stephen Hemminger 
Date: Wed, 12 Apr 2017 14:30:37 -0700

> On Wed, 12 Apr 2017 14:54:15 -0400 (EDT)
> David Miller  wrote:
> 
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index b0aa089..071a58b 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -1891,9 +1891,17 @@ struct net_device {
>>  struct lock_class_key   *qdisc_tx_busylock;
>>  struct lock_class_key   *qdisc_running_key;
>>  boolproto_down;
>> +struct bpf_prog __rcu   *xdp_prog;
> 
> It would be good if all devices could reuse this for the xdp_prog pointer.
> It would allow for could be used for introspection utility functions in 
> future.

We plan to do so.


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-12 Thread David Miller
From: Eric Dumazet 
Date: Wed, 12 Apr 2017 14:49:53 -0700

> On Wed, 2017-04-12 at 14:30 -0700, Stephen Hemminger wrote:
>> On Wed, 12 Apr 2017 14:54:15 -0400 (EDT)
>> David Miller  wrote:
>> 
>> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> > index b0aa089..071a58b 100644
>> > --- a/include/linux/netdevice.h
>> > +++ b/include/linux/netdevice.h
>> > @@ -1891,9 +1891,17 @@ struct net_device {
>> >struct lock_class_key   *qdisc_tx_busylock;
>> >struct lock_class_key   *qdisc_running_key;
>> >boolproto_down;
>> > +  struct bpf_prog __rcu   *xdp_prog;
>> 
>> It would be good if all devices could reuse this for the xdp_prog pointer.
>> It would allow for could be used for introspection utility functions in 
>> future.
> 
> Problem is that some xdp usages were envisioning a per RX queue xdp
> program.

True, but that hasn't materialized yet so designing for it so soon
doesn't make a lot of sense.


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-12 Thread Eric Dumazet
On Wed, 2017-04-12 at 14:30 -0700, Stephen Hemminger wrote:
> On Wed, 12 Apr 2017 14:54:15 -0400 (EDT)
> David Miller  wrote:
> 
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index b0aa089..071a58b 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -1891,9 +1891,17 @@ struct net_device {
> > struct lock_class_key   *qdisc_tx_busylock;
> > struct lock_class_key   *qdisc_running_key;
> > boolproto_down;
> > +   struct bpf_prog __rcu   *xdp_prog;
> 
> It would be good if all devices could reuse this for the xdp_prog pointer.
> It would allow for could be used for introspection utility functions in 
> future.

Problem is that some xdp usages were envisioning a per RX queue xdp
program.





Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-12 Thread Stephen Hemminger
On Wed, 12 Apr 2017 14:54:15 -0400 (EDT)
David Miller  wrote:

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index b0aa089..071a58b 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1891,9 +1891,17 @@ struct net_device {
>   struct lock_class_key   *qdisc_tx_busylock;
>   struct lock_class_key   *qdisc_running_key;
>   boolproto_down;
> + struct bpf_prog __rcu   *xdp_prog;

It would be good if all devices could reuse this for the xdp_prog pointer.
It would allow for could be used for introspection utility functions in future.


Re: [PATCH v3 net-next RFC] Generic XDP

2017-04-12 Thread David Ahern
On 4/12/17 12:54 PM, David Miller wrote:
> 
> This provides a generic SKB based non-optimized XDP path which is used
> if either the driver lacks a specific XDP implementation, or the user
> requests it via a new IFLA_XDP_FLAGS value named XDP_FLAGS_SKB_MODE.
> 
> It is arguable that perhaps I should have required something like
> this as part of the initial XDP feature merge.
> 
> I believe this is critical for two reasons:
> 
> 1) Accessibility.  More people can play with XDP with less
>dependencies.  Yes I know we have XDP support in virtio_net, but
>that just creates another depedency for learning how to use this
>facility.
> 
>I wrote this to make life easier for the XDP newbies.
> 
> 2) As a model for what the expected semantics are.  If there is a pure
>generic core implementation, it serves as a semantic example for
>driver folks adding XDP support.
> 
> This is just a rough draft and is untested.


packet passed to xdp seems to be messed up. Using samples/bpf/xdp1*.c as
an example:

Current with virtio-net:
# xdp1 3
proto 0:  0 pkt/s
proto 6:  2 pkt/s
proto 17:   1610 pkt/s
proto 17:  22785 pkt/s

Using the skb-mode:

# xdp1 3 skb-mode
proto 0:  2 pkt/s
proto 0:  10880 pkt/s
proto 0:  29990 pkt/s
proto 0:  16972 pkt/s


If I change ipproto in samples/bpf/xdp1_kern.c:

if (h_proto == htons(ETH_P_IP))
ipproto = parse_ipv4(data, nh_off, data_end);
else if (h_proto == htons(ETH_P_IPV6))
ipproto = parse_ipv6(data, nh_off, data_end);
else
ipproto = 63;

I get:

# xdp1 3 skb-mode
proto 63:   6538 pkt/s
proto 63:  23679 pkt/s
proto 63:  29974 pkt/s
proto 63:  29599 pkt/s

which suggests h_proto is goofed up with skb-mode.


[PATCH v3 net-next RFC] Generic XDP

2017-04-12 Thread David Miller

This provides a generic SKB based non-optimized XDP path which is used
if either the driver lacks a specific XDP implementation, or the user
requests it via a new IFLA_XDP_FLAGS value named XDP_FLAGS_SKB_MODE.

It is arguable that perhaps I should have required something like
this as part of the initial XDP feature merge.

I believe this is critical for two reasons:

1) Accessibility.  More people can play with XDP with less
   dependencies.  Yes I know we have XDP support in virtio_net, but
   that just creates another depedency for learning how to use this
   facility.

   I wrote this to make life easier for the XDP newbies.

2) As a model for what the expected semantics are.  If there is a pure
   generic core implementation, it serves as a semantic example for
   driver folks adding XDP support.

This is just a rough draft and is untested.

One thing I have not tried to address here is the issue of
XDP_PACKET_HEADROOM, thanks to Daniel for spotting that.  It seems
incredibly expensive to do a skb_cow(skb, XDP_PACKET_HEADROOM) or
whatever even if the XDP program doesn't try to push headers at all.
I think we really need the verifier to somehow propagate whether
certain XDP helpers are used or not.

Signed-off-by: David S. Miller 
--

v3:
 - Make sure XDP program sees packet at MAC header, push back MAC
   header if we do XDP_TX.  (Alexei)
 - Elide GRO when generic XDP is in use.  (Alexei)
 - Add XDP_FLAG_SKB_MODE flag which the user can use to request generic
   XDP even if the driver has an XDP implementation.  (Alexei)
 - Report whether SKB mode is in use in rtnl_xdp_fill() via XDP_FLAGS
   attribute.  (Daniel)

v2:
 - Add some "fall through" comments in switch statements based
   upon feedback from Andrew Lunn
 - Use RCU for generic xdp_prog, thanks to Johannes Berg.

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b0aa089..071a58b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1891,9 +1891,17 @@ struct net_device {
struct lock_class_key   *qdisc_tx_busylock;
struct lock_class_key   *qdisc_running_key;
boolproto_down;
+   struct bpf_prog __rcu   *xdp_prog;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
+static inline bool netif_elide_gro(const struct net_device *dev)
+{
+   if (!(dev->features & NETIF_F_GRO) || dev->xdp_prog)
+   return true;
+   return false;
+}
+
 #defineNETDEV_ALIGN32
 
 static inline
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 8b405af..633aa02 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -887,7 +887,9 @@ enum {
 /* XDP section */
 
 #define XDP_FLAGS_UPDATE_IF_NOEXIST(1U << 0)
-#define XDP_FLAGS_MASK (XDP_FLAGS_UPDATE_IF_NOEXIST)
+#define XDP_FLAGS_SKB_MODE (2U << 0)
+#define XDP_FLAGS_MASK (XDP_FLAGS_UPDATE_IF_NOEXIST | \
+XDP_FLAGS_SKB_MODE)
 
 enum {
IFLA_XDP_UNSPEC,
diff --git a/net/core/dev.c b/net/core/dev.c
index ef9fe60e..9ed4569 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -95,6 +95,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -4247,6 +4248,88 @@ static int __netif_receive_skb(struct sk_buff *skb)
return ret;
 }
 
+static struct static_key generic_xdp_needed __read_mostly;
+
+static int generic_xdp_install(struct net_device *dev, struct netdev_xdp *xdp)
+{
+   struct bpf_prog *new = xdp->prog;
+   int ret = 0;
+
+   switch (xdp->command) {
+   case XDP_SETUP_PROG: {
+   struct bpf_prog *old = rtnl_dereference(dev->xdp_prog);
+
+   rcu_assign_pointer(dev->xdp_prog, new);
+   if (old)
+   bpf_prog_put(old);
+
+   if (old && !new)
+   static_key_slow_dec(_xdp_needed);
+   else if (new && !old)
+   static_key_slow_inc(_xdp_needed);
+   break;
+   }
+
+   case XDP_QUERY_PROG:
+   xdp->prog_attached = !!rcu_access_pointer(dev->xdp_prog);
+   break;
+
+   default:
+   ret = -EINVAL;
+   break;
+   }
+
+   return ret;
+}
+
+static u32 netif_receive_generic_xdp(struct sk_buff *skb,
+struct bpf_prog *xdp_prog)
+{
+   struct xdp_buff xdp;
+   u32 act = XDP_DROP;
+   void *orig_data;
+   int hlen, off;
+
+   if (skb_linearize(skb))
+   goto do_drop;
+
+   /* The XDP program wants to see the packet starting at the MAC
+* header.
+*/
+   hlen = skb_headlen(skb) + skb->mac_len;
+   xdp.data = skb->data - skb->mac_len;
+   xdp.data_end = xdp.data + hlen;
+   xdp.data_hard_start = xdp.data - skb_headroom(skb);
+   orig_data = xdp.data;
+
+   act =