Re: [PATCH v3 net-next RFC] Generic XDP
On Tue, Apr 18, 2017 at 11:47:59AM +0200, Johannes Berg wrote: > > I guess any driver that would care about XDP performance would either > > implement in-driver XDP or make sure that skb_linearize() doesn't > > happen in generic XDP by doing build_skb() with the whole packet. > > The driver can be smart and avoid doing copy-break if generic XDP is > > on. > > Indeed. I think this is less about performance than ease of > experimentation. yes. I'm bringing it up only to hopefully find cases where this generic XDP can be used for real data path too (and not only experimentation). Like if it can do 6Mpps whereas in-driver XDP will do 20Mpps (when implemented), for some folks it may be good enough already to start using it. Which will put more pressure on NIC vendor to implement it.
Re: [PATCH v3 net-next RFC] Generic XDP
On Tue, Apr 18, 2017 at 02:46:25PM -0400, David Miller wrote: > From: Alexei Starovoitov> Date: Mon, 17 Apr 2017 16:04:38 -0700 > > > On Mon, Apr 17, 2017 at 03:49:55PM -0400, David Miller wrote: > >> From: Jesper Dangaard Brouer > >> Date: Sun, 16 Apr 2017 22:26:01 +0200 > >> > >> > The bpf tail-call use-case is a very good example of why the > >> > verifier cannot deduct the needed HEADROOM upfront. > >> > >> This brings up a very interesting question for me. > >> > >> I notice that tail calls are implemented by JITs largely by skipping > >> over the prologue of that destination program. > >> > >> However, many JITs preload cached SKB values into fixed registers in > >> the prologue. But they only do this if the program being JITed needs > >> those values. > >> > >> So how can it work properly if a program that does not need the SKB > >> values tail calls into one that does? > > > > For x86 JIT it's fine, since caching of skb values is not part of the > > prologue: > > emit_prologue(); > > if (seen_ld_abs) > > emit_load_skb_data_hlen(); > > and tail_call jumps into the next program as: > > EMIT4(0x48, 0x83, 0xC0, PROLOGUE_SIZE); /* add rax, prologue_size */ > > EMIT2(0xFF, 0xE0);/* jmp rax */ > > whereas inside emit_prologue() we have: > > B UILD_BUG_ON(cnt != PROLOGUE_SIZE); > > > > arm64 has similar proplogue skipping code and it's even > > simpler than x86, since it doesn't try to optimize LD_ABS/IND in assembler > > and instead calls into bpf_load_pointer() from generated code, > > so no caching of skb values at all. > > > > s390 jit has partial skipping of prologue, since bunch > > of registers are save/restored during tail_call and it looks fine > > to me as well. > > Ok, what about stack usage? > > Currently if I don't see a reference to FP then I elide allocating > MAX_BPF_STACK stack space. > > What if, with tail calls, some programs need that stack space whilst > other's done? > > It looks like, for example, JITs like powerpc avoids this issue > because they allocate the full MAX_BPF_STACK all the time. That seems > like overkill to me and bad for cache locality. For x86 we also give proper stack frame always, since optimizing for leaf functions is very rare. Most eBPF progs have at least one helper call. Even cBPF often use SKF_AD_CPU or SKF_AD_RANDOM which translate to function call as well and need stack frame. I think stack frames on sparc are much more expensive than on x86 due to register window architecture, so there it may make sense to squeeze these extra cycles, but it will be rarely exercised in practice. I was thinking to teach verifier to recognize required stack size, so we can JIT with that size instead of 512, but that's mainly to reduce kernel stack usage. I doubt it will make any performance difference. As far as big sparc and other archs JIT optimizations would be great somehow to take advantage of extra registers that these archs have. I think it will only be possible once we have verifier 2.0 with proper register liveness and stuff, so we can convert spill/fills into register copies and may be even run simple regalloc pass after verifier. Crazy talk ;)
Re: [PATCH v3 net-next RFC] Generic XDP
From: Alexei StarovoitovDate: Mon, 17 Apr 2017 16:04:38 -0700 > On Mon, Apr 17, 2017 at 03:49:55PM -0400, David Miller wrote: >> From: Jesper Dangaard Brouer >> Date: Sun, 16 Apr 2017 22:26:01 +0200 >> >> > The bpf tail-call use-case is a very good example of why the >> > verifier cannot deduct the needed HEADROOM upfront. >> >> This brings up a very interesting question for me. >> >> I notice that tail calls are implemented by JITs largely by skipping >> over the prologue of that destination program. >> >> However, many JITs preload cached SKB values into fixed registers in >> the prologue. But they only do this if the program being JITed needs >> those values. >> >> So how can it work properly if a program that does not need the SKB >> values tail calls into one that does? > > For x86 JIT it's fine, since caching of skb values is not part of the > prologue: > emit_prologue(); > if (seen_ld_abs) > emit_load_skb_data_hlen(); > and tail_call jumps into the next program as: > EMIT4(0x48, 0x83, 0xC0, PROLOGUE_SIZE); /* add rax, prologue_size */ > EMIT2(0xFF, 0xE0);/* jmp rax */ > whereas inside emit_prologue() we have: > B UILD_BUG_ON(cnt != PROLOGUE_SIZE); > > arm64 has similar proplogue skipping code and it's even > simpler than x86, since it doesn't try to optimize LD_ABS/IND in assembler > and instead calls into bpf_load_pointer() from generated code, > so no caching of skb values at all. > > s390 jit has partial skipping of prologue, since bunch > of registers are save/restored during tail_call and it looks fine > to me as well. Ok, what about stack usage? Currently if I don't see a reference to FP then I elide allocating MAX_BPF_STACK stack space. What if, with tail calls, some programs need that stack space whilst other's done? It looks like, for example, JITs like powerpc avoids this issue because they allocate the full MAX_BPF_STACK all the time. That seems like overkill to me and bad for cache locality.
Re: [PATCH v3 net-next RFC] Generic XDP
On Fri, 2017-04-14 at 12:41 -0700, Alexei Starovoitov wrote: > > ahh. i thought all drivers do at least copy-break (256) bytes We do copy-break, but that's only applicable when there's no non-linear data left afterwards :) > or copy > get_headlen or build_skb the whole thing. > Since wireless does eth_hlen, then yeah, skb_linearize() is the only > way. It's rather difficult to do this in wifi because we'd need to parse deep into the packet - at a point where we haven't even parsed the 802.11 header (which has variable length) fully. So that leaves us to do it at the very end, after conversion to ethernet format, but then it feels pointless since we might as well let the rest of the stack do it. > I guess any driver that would care about XDP performance would either > implement in-driver XDP or make sure that skb_linearize() doesn't > happen in generic XDP by doing build_skb() with the whole packet. > The driver can be smart and avoid doing copy-break if generic XDP is > on. Indeed. I think this is less about performance than ease of experimentation. johannes
Re: [PATCH v3 net-next RFC] Generic XDP
On 04/18/2017 01:04 AM, Alexei Starovoitov wrote: On Mon, Apr 17, 2017 at 03:49:55PM -0400, David Miller wrote: From: Jesper Dangaard BrouerDate: Sun, 16 Apr 2017 22:26:01 +0200 The bpf tail-call use-case is a very good example of why the verifier cannot deduct the needed HEADROOM upfront. This brings up a very interesting question for me. I notice that tail calls are implemented by JITs largely by skipping over the prologue of that destination program. However, many JITs preload cached SKB values into fixed registers in the prologue. But they only do this if the program being JITed needs those values. So how can it work properly if a program that does not need the SKB values tail calls into one that does? For x86 JIT it's fine, since caching of skb values is not part of the prologue: emit_prologue(); if (seen_ld_abs) emit_load_skb_data_hlen(); and tail_call jumps into the next program as: EMIT4(0x48, 0x83, 0xC0, PROLOGUE_SIZE); /* add rax, prologue_size */ EMIT2(0xFF, 0xE0);/* jmp rax */ whereas inside emit_prologue() we have: B UILD_BUG_ON(cnt != PROLOGUE_SIZE); arm64 has similar proplogue skipping code and it's even simpler than x86, since it doesn't try to optimize LD_ABS/IND in assembler and instead calls into bpf_load_pointer() from generated code, so no caching of skb values at all. s390 jit has partial skipping of prologue, since bunch of registers are save/restored during tail_call and it looks fine to me as well. And ppc64 does unwinding/tearing down the stack of the prog before jumping into the other program. Thus, no skipping of others prologue; looks fine, too. It's very hard to extend test_bpf.ko with tail_calls, since maps need to be allocated and populated with file descriptors which are not feasible to do from .ko. Instead we need a user space based test for it. We've started building one in tools/testing/selftests/bpf/test_progs.c much more tests need to be added. Thorough testing of tail_calls is on the todo list.
Re: [PATCH v3 net-next RFC] Generic XDP
On Mon, Apr 17, 2017 at 03:49:55PM -0400, David Miller wrote: > From: Jesper Dangaard Brouer> Date: Sun, 16 Apr 2017 22:26:01 +0200 > > > The bpf tail-call use-case is a very good example of why the > > verifier cannot deduct the needed HEADROOM upfront. > > This brings up a very interesting question for me. > > I notice that tail calls are implemented by JITs largely by skipping > over the prologue of that destination program. > > However, many JITs preload cached SKB values into fixed registers in > the prologue. But they only do this if the program being JITed needs > those values. > > So how can it work properly if a program that does not need the SKB > values tail calls into one that does? For x86 JIT it's fine, since caching of skb values is not part of the prologue: emit_prologue(); if (seen_ld_abs) emit_load_skb_data_hlen(); and tail_call jumps into the next program as: EMIT4(0x48, 0x83, 0xC0, PROLOGUE_SIZE); /* add rax, prologue_size */ EMIT2(0xFF, 0xE0);/* jmp rax */ whereas inside emit_prologue() we have: B UILD_BUG_ON(cnt != PROLOGUE_SIZE); arm64 has similar proplogue skipping code and it's even simpler than x86, since it doesn't try to optimize LD_ABS/IND in assembler and instead calls into bpf_load_pointer() from generated code, so no caching of skb values at all. s390 jit has partial skipping of prologue, since bunch of registers are save/restored during tail_call and it looks fine to me as well. It's very hard to extend test_bpf.ko with tail_calls, since maps need to be allocated and populated with file descriptors which are not feasible to do from .ko. Instead we need a user space based test for it. We've started building one in tools/testing/selftests/bpf/test_progs.c much more tests need to be added. Thorough testing of tail_calls is on the todo list.
Re: [PATCH v3 net-next RFC] Generic XDP
From: Jesper Dangaard BrouerDate: Sun, 16 Apr 2017 22:26:01 +0200 > The bpf tail-call use-case is a very good example of why the > verifier cannot deduct the needed HEADROOM upfront. This brings up a very interesting question for me. I notice that tail calls are implemented by JITs largely by skipping over the prologue of that destination program. However, many JITs preload cached SKB values into fixed registers in the prologue. But they only do this if the program being JITed needs those values. So how can it work properly if a program that does not need the SKB values tail calls into one that does? Daniel, Alexei?
Re: [PATCH v3 net-next RFC] Generic XDP
On Fri, 14 Apr 2017 17:46:44 -0700 Alexei Starovoitovwrote: > But that's probably bad idea, since I was assuming that such ring > reconfiguration can be made fast, which is unlikely. > If it takes seconds to setup a ring, then drivers should just > assume XDP_PACKET_HEADROOM, since at that time the program > properties are unknown and in the future other programs will be loaded. > > Take a look at our current setup with slots for xdp_dump, ddos, lb > programs. Only root program is attached at the begining. > If the driver configures the ring for such empty program that would break > dynamic addition of lb prog. > The driver must not interrupt the traffic when user adds another > prog to prog array. In such case XDP side doesn't even know that > prog array is used. It's all happening purely on bpf side. The bpf tail-call use-case is a very good example of why the verifier cannot deduct the needed HEADROOM upfront. Could we still make the verifier reject a program getting attached as a tail-call when a too "low"/small HEADROOM have been setup? (to satisfy programs needs) -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH v3 net-next RFC] Generic XDP
On Fri, 14 Apr 2017 17:46:44 -0700, Alexei Starovoitov wrote: > On Fri, Apr 14, 2017 at 03:30:32PM -0700, Jakub Kicinski wrote: > > On Fri, 14 Apr 2017 12:28:15 -0700, Alexei Starovoitov wrote: > > > On Fri, Apr 14, 2017 at 11:05:25AM +0200, Jesper Dangaard Brouer wrote: > > > > > > > > > > We are consistently finding that there is this real need to > > > > > communicate XDP capabilities, or somehow verify that the needs > > > > > of an XDP program can be satisfied by a given implementation. > > > > > > > > I fully agree that we need some way to express capabilities[1] > > > > > > > > > Maximum headroom is just one. > > > > > > I don't like the idea of asking program author to explain capabilities > > > to the kernel. Right now the verifier already understands more about > > > the program than human does. If the verifier cannot deduct from the > > > insns what program will be doing, it's pretty much guarantee > > > that program author has no idea either. > > > If we add 'required_headroom' as an extra flag to BPF_PROG_LOAD, > > > the users will just pass something like 64 or 128 whereas the program > > > might only be doing IPIP encap and that will cause kernel to > > > provide extra headroom for no good reason or reject the program > > > whereas it could have run just fine. > > > > > > So I very much agree with this part: > > > > ... or somehow verify that the needs > > > > of an XDP program can be satisfied by a given implementation. > > > > > > we already have cb_access, dst_needed, xdp_adjust_head flags > > > that verifier discovers in the program. > > > For headroom we need one more. The verifier can see > > > the constant passed into bpf_xdp_adjust_head(). > > > It's trickier if it's a variable, but the verifier can estimate min/max > > > of the variable already and worst case it can say that it > > > will be XDP_PACKET_HEADROOM. > > > If the program is doing variable length bpf_xdp_adjust_head(), > > > the human has no idea how much they need and cannot tell kernel anyway. > > > > If I understand correctly that the current proposal is to: > > > > 1. Assume program needs 256 (XDP_PACKET_HEADROOM) of prepend. > > 2. Make verifier try to lower that through byte code analysis. > > 3. Make the load fail if (prog->max_prepend > driver->xdp_max_prepend). > > no. the hard fail will be wrong, since verifier is conservative. > As i showed in the example with variable ip hdr length. > > > We may expect most users won't care and we may want to still depend on > > the verifier to analyze the program and enable optimizations, but > > depending on the verifier for proving prepend lengths is scary. Or are > > we just discussion the optimizations here and not the guarantees about > > headroom availability? > > yes. I was thinking as an optimization. If verifier can see that > prog needs only 20 bytes of headroom or no headroom (like netronome > already does) than the ring can be configured differently for hopefully > better performance. > But that's probably bad idea, since I was assuming that such ring > reconfiguration can be made fast, which is unlikely. > If it takes seconds to setup a ring, then drivers should just > assume XDP_PACKET_HEADROOM, since at that time the program > properties are unknown and in the future other programs will be loaded. > Take a look at our current setup with slots for xdp_dump, ddos, lb > programs. Only root program is attached at the begining. > If the driver configures the ring for such empty program that would break > dynamic addition of lb prog. > The driver must not interrupt the traffic when user adds another > prog to prog array. In such case XDP side doesn't even know that > prog array is used. It's all happening purely on bpf side. > > > Or worse not do that, use Qlogic, Broadcom, Mellanox or Netronome cards > > and then be in for a nasty surprise when they try to load the same > > program for Intel parts :( > > are you talking hw offloaded prog or normal in-driver xdp? In-driver, mostly. As you said earlier in the thread we expect Alex's/Intel's patches to provide "only" 192B of headroom. For the NFP offload there are trade-offs to full 256B offset. I think we could just fallback to the host, but if user hints that the headroom is big perhaps we would be better off not even trying to offload. > > I agree that trying to express capabilities to user space is quickly > > going to become more confusing than helpful. However, I think here we > > need the opposite - we need an opt-in way of program requesting > > guarantees about the datapath. > > what is an example of what you proposing? How will it look from user pov? This is roughly what I had in mind. From user's perspective it would probably make sense to tie it to the program itself, so: bpf_helpers.h: #define __XDP_ATTR \ __attribute__((section(ELF_SECTION_XDP_ATTRS), used)) #define XDP_GUARANTEE_HEADROOM(val) \ __u32
Re: [PATCH v3 net-next RFC] Generic XDP
On Fri, Apr 14, 2017 at 03:30:32PM -0700, Jakub Kicinski wrote: > On Fri, 14 Apr 2017 12:28:15 -0700, Alexei Starovoitov wrote: > > On Fri, Apr 14, 2017 at 11:05:25AM +0200, Jesper Dangaard Brouer wrote: > > > > > > > > We are consistently finding that there is this real need to > > > > communicate XDP capabilities, or somehow verify that the needs > > > > of an XDP program can be satisfied by a given implementation. > > > > > > I fully agree that we need some way to express capabilities[1] > > > > > > > Maximum headroom is just one. > > > > I don't like the idea of asking program author to explain capabilities > > to the kernel. Right now the verifier already understands more about > > the program than human does. If the verifier cannot deduct from the > > insns what program will be doing, it's pretty much guarantee > > that program author has no idea either. > > If we add 'required_headroom' as an extra flag to BPF_PROG_LOAD, > > the users will just pass something like 64 or 128 whereas the program > > might only be doing IPIP encap and that will cause kernel to > > provide extra headroom for no good reason or reject the program > > whereas it could have run just fine. > > > > So I very much agree with this part: > > > ... or somehow verify that the needs > > > of an XDP program can be satisfied by a given implementation. > > > > we already have cb_access, dst_needed, xdp_adjust_head flags > > that verifier discovers in the program. > > For headroom we need one more. The verifier can see > > the constant passed into bpf_xdp_adjust_head(). > > It's trickier if it's a variable, but the verifier can estimate min/max > > of the variable already and worst case it can say that it > > will be XDP_PACKET_HEADROOM. > > If the program is doing variable length bpf_xdp_adjust_head(), > > the human has no idea how much they need and cannot tell kernel anyway. > > If I understand correctly that the current proposal is to: > > 1. Assume program needs 256 (XDP_PACKET_HEADROOM) of prepend. > 2. Make verifier try to lower that through byte code analysis. > 3. Make the load fail if (prog->max_prepend > driver->xdp_max_prepend). no. the hard fail will be wrong, since verifier is conservative. As i showed in the example with variable ip hdr length. > We may expect most users won't care and we may want to still depend on > the verifier to analyze the program and enable optimizations, but > depending on the verifier for proving prepend lengths is scary. Or are > we just discussion the optimizations here and not the guarantees about > headroom availability? yes. I was thinking as an optimization. If verifier can see that prog needs only 20 bytes of headroom or no headroom (like netronome already does) than the ring can be configured differently for hopefully better performance. But that's probably bad idea, since I was assuming that such ring reconfiguration can be made fast, which is unlikely. If it takes seconds to setup a ring, then drivers should just assume XDP_PACKET_HEADROOM, since at that time the program properties are unknown and in the future other programs will be loaded. Take a look at our current setup with slots for xdp_dump, ddos, lb programs. Only root program is attached at the begining. If the driver configures the ring for such empty program that would break dynamic addition of lb prog. The driver must not interrupt the traffic when user adds another prog to prog array. In such case XDP side doesn't even know that prog array is used. It's all happening purely on bpf side. > Or worse not do that, use Qlogic, Broadcom, Mellanox or Netronome cards > and then be in for a nasty surprise when they try to load the same > program for Intel parts :( are you talking hw offloaded prog or normal in-driver xdp? > I agree that trying to express capabilities to user space is quickly > going to become more confusing than helpful. However, I think here we > need the opposite - we need an opt-in way of program requesting > guarantees about the datapath. what is an example of what you proposing? How will it look from user pov?
Re: [PATCH v3 net-next RFC] Generic XDP
On Fri, 14 Apr 2017 12:28:15 -0700, Alexei Starovoitov wrote: > On Fri, Apr 14, 2017 at 11:05:25AM +0200, Jesper Dangaard Brouer wrote: > > > > > > We are consistently finding that there is this real need to > > > communicate XDP capabilities, or somehow verify that the needs > > > of an XDP program can be satisfied by a given implementation. > > > > I fully agree that we need some way to express capabilities[1] > > > > > Maximum headroom is just one. > > I don't like the idea of asking program author to explain capabilities > to the kernel. Right now the verifier already understands more about > the program than human does. If the verifier cannot deduct from the > insns what program will be doing, it's pretty much guarantee > that program author has no idea either. > If we add 'required_headroom' as an extra flag to BPF_PROG_LOAD, > the users will just pass something like 64 or 128 whereas the program > might only be doing IPIP encap and that will cause kernel to > provide extra headroom for no good reason or reject the program > whereas it could have run just fine. > > So I very much agree with this part: > > ... or somehow verify that the needs > > of an XDP program can be satisfied by a given implementation. > > we already have cb_access, dst_needed, xdp_adjust_head flags > that verifier discovers in the program. > For headroom we need one more. The verifier can see > the constant passed into bpf_xdp_adjust_head(). > It's trickier if it's a variable, but the verifier can estimate min/max > of the variable already and worst case it can say that it > will be XDP_PACKET_HEADROOM. > If the program is doing variable length bpf_xdp_adjust_head(), > the human has no idea how much they need and cannot tell kernel anyway. If I understand correctly that the current proposal is to: 1. Assume program needs 256 (XDP_PACKET_HEADROOM) of prepend. 2. Make verifier try to lower that through byte code analysis. 3. Make the load fail if (prog->max_prepend > driver->xdp_max_prepend). Would that not cause unnecessary load failures? I'm worried it would force users to do things like this: struct map_value *val; if (val->prep_len > 64) /* Prove to verifier this is short */ return XDP_DROP; if (xdp_adjust_head(xdp, -val->prep_len)) ... Or worse not do that, use Qlogic, Broadcom, Mellanox or Netronome cards and then be in for a nasty surprise when they try to load the same program for Intel parts :( I agree that trying to express capabilities to user space is quickly going to become more confusing than helpful. However, I think here we need the opposite - we need an opt-in way of program requesting guarantees about the datapath. We may expect most users won't care and we may want to still depend on the verifier to analyze the program and enable optimizations, but depending on the verifier for proving prepend lengths is scary. Or are we just discussion the optimizations here and not the guarantees about headroom availability?
Re: [PATCH v3 net-next RFC] Generic XDP
On 04/14/2017 09:28 PM, Alexei Starovoitov wrote: On Fri, Apr 14, 2017 at 11:05:25AM +0200, Jesper Dangaard Brouer wrote: We are consistently finding that there is this real need to communicate XDP capabilities, or somehow verify that the needs of an XDP program can be satisfied by a given implementation. I fully agree that we need some way to express capabilities[1] Maximum headroom is just one. I don't like the idea of asking program author to explain capabilities to the kernel. Right now the verifier already understands more about the program than human does. If the verifier cannot deduct from the insns what program will be doing, it's pretty much guarantee that program author has no idea either. If we add 'required_headroom' as an extra flag to BPF_PROG_LOAD, the users will just pass something like 64 or 128 whereas the program might only be doing IPIP encap and that will cause kernel to provide extra headroom for no good reason or reject the program whereas it could have run just fine. Fully agree, such an extension is likely to be used wrongly or with some default size as we have right now with XDP_PACKET_HEADROOM to cover most use cases in order to not bother the user to deal with this resp. not to complicate things more. So I very much agree with this part: ... or somehow verify that the needs of an XDP program can be satisfied by a given implementation. we already have cb_access, dst_needed, xdp_adjust_head flags that verifier discovers in the program. For headroom we need one more. The verifier can see the constant passed into bpf_xdp_adjust_head(). It's trickier if it's a variable, but the verifier can estimate min/max of the variable already and worst case it can say that it will be XDP_PACKET_HEADROOM. +1, we should try hard to reuse such information from the verifier to determine specific requirements the program has, and check against them at prog attach time. This works okay so far for the already mentioned bits in struct bpf_prog, and could be further extended. If the program is doing variable length bpf_xdp_adjust_head(), the human has no idea how much they need and cannot tell kernel anyway.
Re: [PATCH v3 net-next RFC] Generic XDP
On Thu, Apr 13, 2017 at 11:38:10AM -0400, David Miller wrote: > From: Johannes Berg> Date: Thu, 13 Apr 2017 08:10:56 +0200 > > > On Wed, 2017-04-12 at 21:20 -0700, Alexei Starovoitov wrote: > > > >> > +if (skb_linearize(skb)) > >> > +goto do_drop; > >> > >> when we discussed supporting jumbo frames in XDP, the idea > >> was that the program would need to look at first 3+k bytes only > >> and the rest of the packet will be in non-contiguous pages. > >> If we do that, it means that XDP program would have to assume > >> that the packet is more than [data, data_end] and this range > >> only covers linear part. > >> If that's the future, we don't need to linearize the skb here > >> and can let the program access headlen only. > > > > I'm not sure how you think that would work - at least with our (wifi) > > driver, the headlen should be maybe ETH_HLEN or so at this point. We'd > > let the program know that it can only look at so much, but then the > > program can't do anything at all with those frames. At some point then > > we go back to bpf_skb_load_bytes() being necessary in one form or > > another, no? > > Agreed, this is completely unusable. Some wired ethernet drivers do the > same exact thing. ahh. i thought all drivers do at least copy-break (256) bytes or copy get_headlen or build_skb the whole thing. Since wireless does eth_hlen, then yeah, skb_linearize() is the only way. I guess any driver that would care about XDP performance would either implement in-driver XDP or make sure that skb_linearize() doesn't happen in generic XDP by doing build_skb() with the whole packet. The driver can be smart and avoid doing copy-break if generic XDP is on.
Re: [PATCH v3 net-next RFC] Generic XDP
On Fri, Apr 14, 2017 at 11:05:25AM +0200, Jesper Dangaard Brouer wrote: > > > > We are consistently finding that there is this real need to > > communicate XDP capabilities, or somehow verify that the needs > > of an XDP program can be satisfied by a given implementation. > > I fully agree that we need some way to express capabilities[1] > > > Maximum headroom is just one. I don't like the idea of asking program author to explain capabilities to the kernel. Right now the verifier already understands more about the program than human does. If the verifier cannot deduct from the insns what program will be doing, it's pretty much guarantee that program author has no idea either. If we add 'required_headroom' as an extra flag to BPF_PROG_LOAD, the users will just pass something like 64 or 128 whereas the program might only be doing IPIP encap and that will cause kernel to provide extra headroom for no good reason or reject the program whereas it could have run just fine. So I very much agree with this part: > ... or somehow verify that the needs > of an XDP program can be satisfied by a given implementation. we already have cb_access, dst_needed, xdp_adjust_head flags that verifier discovers in the program. For headroom we need one more. The verifier can see the constant passed into bpf_xdp_adjust_head(). It's trickier if it's a variable, but the verifier can estimate min/max of the variable already and worst case it can say that it will be XDP_PACKET_HEADROOM. If the program is doing variable length bpf_xdp_adjust_head(), the human has no idea how much they need and cannot tell kernel anyway.
Re: [PATCH v3 net-next RFC] Generic XDP
On Fri, Apr 14, 2017 at 10:07:40AM +0200, Johannes Berg wrote: > On Thu, 2017-04-13 at 16:01 -0400, David Miller wrote: > > From: Johannes Berg> > Date: Thu, 13 Apr 2017 21:22:21 +0200 > > > > > OTOH, it might depend on the frame data itself, if the program does > > > something like > > > > > > xdp->data[xdp->data[0] & 0xf] > > > > > > (read or write, doesn't really matter) so then the verifier would > > have > > > to take the maximum possible value there into account. > > > > I am not well versed enough with the verifier to understand exactly > > how and to what extent SKB accesses are validated by the verifier. > > > > My, perhaps mistaken, impression is that access range validation is > > still at least partially done at run time. > > I think you're right for SKB accesses, but I'm pretty sure that for XDP > the verifier checks that the program can't possibly access outside of > [xdp->data, xdp->data_end], see compare_ptrs_to_packet(). > > This checks that comparisons to data_end are all there, i.e. that the > program verifies it may access some bytes before actually doing so. > However, the program could start with > > if (xdp->data_end < xdp->data + 1024) > return DROP; > [...] > > and then the verifier would consider it safe. > > Still, it has to track down into the [...] code to actually understand > that it now didn't try to access xdp->data+1025, and as such it should > be able to determine the maximum desired offset. > > However, I'm coming to realize that may not necessarily mean that the > program really *needs* to access that data. > > For example, a hypothetical wifi program might want to recalculate and > compare the CRC checksum (for whatever reason, say a driver/hardware > bug). This would require accessing the last 4 bytes of the packet, > which may not be present. The program could, however, accept that > sometimes this isn't possible, and simply accept frames when it can't > see the last 4 bytes (or if the last 4 bytes aren't the CRC because > that isn't reported now, but whatever, I'm handwaving anyway.) > > So perhaps this isn't really a good idea. The program should probably > separately say how much data it really *needs* there, and then perhaps > a warning could be emitted if it never accesses the data that it > advertises as needing (i.e. if it says "I want 1024 bytes" but then > can't possibly read more than 14) The above descrption is correct. Also note that the offset within the packet can be a variable as well, since that's the only way to do variable length parsing (like ip options), so the verifier cannot tell how far into the packet the program is going to parse. The program author cannot tell that either! Consider samples/bpf/parse_varlen.c: ihl_len = iph->ihl * 4; if (iph->protocol == IPPROTO_IPIP) { iph = data + nh_off + ihl_len; if (iph + 1 > data_end) return 0; 'ihl' is 4 bit, so 2nd ip hdr may start at sizeof(eth) + 64 bytes, but we cannot ask the driver to always provide these many bytes. Technically we can add a feature to verifier to calculate the max range into the packet that the program may look at, but it will be large. Like if the program parses IPIP the range will be 64(first ip) + 64(second ip) + 64 (if it looks at tcp options). If the program parses TLVs then the range will be 64k, so I'm not sure what would be the value of such verifier feature.
Re: [PATCH v3 net-next RFC] Generic XDP
On Thu, 13 Apr 2017 11:37:22 -0400 (EDT) David Millerwrote: > From: Alexei Starovoitov > Date: Wed, 12 Apr 2017 21:20:38 -0700 > > > On Wed, Apr 12, 2017 at 02:54:15PM -0400, David Miller wrote: [...] > > If the capability is variable, it must be communicated to the user > somehow at program load time. > > We are consistently finding that there is this real need to > communicate XDP capabilities, or somehow verify that the needs > of an XDP program can be satisfied by a given implementation. I fully agree that we need some way to express capabilities[1] [1] http://prototype-kernel.readthedocs.io/en/latest/networking/XDP/design/design.html#capabilities-negotiation > Maximum headroom is just one. [...] > > We can only optimize this and elide things when we have a facility in > the future for the program to express it's needs precisely. I think > we will have to add some control structure to XDP programs that can > be filled in for this purpose. I fully agree that we need some control structure to XDP programs. My previous attempt was shot-down due to performance concerns of an extra pointer dereference. As I explained before, this is not a concern as the dereference will happen once per N packets in the NAPI loop. Plus now we see a need to elide things based on facilities the XDP program choose to use/enable, for performance reasons. I would prefer keeping these facility settings in control structure to XDP programs, instead of pulling in derived bits runtime. Again remember, adding if/branch statements checking for facilities, should have little performance impact as the branch predictor should guess correctly given we process N packets in the NAPI loop with same facilities. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH v3 net-next RFC] Generic XDP
On Thu, 2017-04-13 at 16:01 -0400, David Miller wrote: > From: Johannes Berg> Date: Thu, 13 Apr 2017 21:22:21 +0200 > > > OTOH, it might depend on the frame data itself, if the program does > > something like > > > > xdp->data[xdp->data[0] & 0xf] > > > > (read or write, doesn't really matter) so then the verifier would > have > > to take the maximum possible value there into account. > > I am not well versed enough with the verifier to understand exactly > how and to what extent SKB accesses are validated by the verifier. > > My, perhaps mistaken, impression is that access range validation is > still at least partially done at run time. I think you're right for SKB accesses, but I'm pretty sure that for XDP the verifier checks that the program can't possibly access outside of [xdp->data, xdp->data_end], see compare_ptrs_to_packet(). This checks that comparisons to data_end are all there, i.e. that the program verifies it may access some bytes before actually doing so. However, the program could start with if (xdp->data_end < xdp->data + 1024) return DROP; [...] and then the verifier would consider it safe. Still, it has to track down into the [...] code to actually understand that it now didn't try to access xdp->data+1025, and as such it should be able to determine the maximum desired offset. However, I'm coming to realize that may not necessarily mean that the program really *needs* to access that data. For example, a hypothetical wifi program might want to recalculate and compare the CRC checksum (for whatever reason, say a driver/hardware bug). This would require accessing the last 4 bytes of the packet, which may not be present. The program could, however, accept that sometimes this isn't possible, and simply accept frames when it can't see the last 4 bytes (or if the last 4 bytes aren't the CRC because that isn't reported now, but whatever, I'm handwaving anyway.) So perhaps this isn't really a good idea. The program should probably separately say how much data it really *needs* there, and then perhaps a warning could be emitted if it never accesses the data that it advertises as needing (i.e. if it says "I want 1024 bytes" but then can't possibly read more than 14) johannes
Re: [PATCH v3 net-next RFC] Generic XDP
From: Johannes BergDate: Thu, 13 Apr 2017 21:22:21 +0200 > OTOH, it might depend on the frame data itself, if the program does > something like > > xdp->data[xdp->data[0] & 0xf] > > (read or write, doesn't really matter) so then the verifier would have > to take the maximum possible value there into account. I am not well versed enough with the verifier to understand exactly how and to what extent SKB accesses are validated by the verifier. My, perhaps mistaken, impression is that access range validation is still at least partially done at run time.
Re: [PATCH v3 net-next RFC] Generic XDP
On Thu, 2017-04-13 at 11:37 -0400, David Miller wrote: > If the capability is variable, it must be communicated to the user > somehow at program load time. > > We are consistently finding that there is this real need to > communicate XDP capabilities, or somehow verify that the needs > of an XDP program can be satisfied by a given implementation. Technically, once you know the capability of the *driver*, the verifier should be able to check if the *program* is compatible. So if the driver can guarantee "you always get 2k accessible", the verifier can check that you don't access more than xdb->data + 2047, similar to how it verifies that you don't access beyond xdb->data_end. > And eth_get_headlen() only pulls protocol headers, which precludes > XDP inspecting anything below TCP/UDP/etc. This is also not > reasonable. > > Right now, as it stands, we have to assume the program can > potentially be interested in the entire packet. I agree with this though, it's not reasonable to have wildly varying implementations here that may or may not be able to access almost anything. The totally degenerate case would be having no skb header at all, which is also still entirely valid from the network stack's POV. > We can only optimize this and elide things when we have a facility in > the future for the program to express it's needs precisely. I think > we will have to add some control structure to XDP programs that can > be filled in for this purpose. Like I said above, I think this is something that you can possibly determine in the verifier. So if, for example, the verifier notices that the program never accesses anything but the first few bytes, then it would seem valid to run with only that much pulled into the skb header. OTOH, it might depend on the frame data itself, if the program does something like xdp->data[xdp->data[0] & 0xf] (read or write, doesn't really matter) so then the verifier would have to take the maximum possible value there into account. johannes
Re: [PATCH v3 net-next RFC] Generic XDP
From: Daniel BorkmannDate: Thu, 13 Apr 2017 17:57:06 +0200 > On 04/12/2017 08:54 PM, David Miller wrote: > [...] >> +static u32 netif_receive_generic_xdp(struct sk_buff *skb, >> + struct bpf_prog *xdp_prog) >> +{ >> +struct xdp_buff xdp; >> +u32 act = XDP_DROP; >> +void *orig_data; >> +int hlen, off; >> + >> +if (skb_linearize(skb)) > > Btw, given the skb can come from all kind of points in the stack, > it could also be a clone at this point. One example is act_mirred > which in fact does skb_clone() and can push the skb back to > ingress path through netif_receive_skb() and thus could then go > into generic xdp processing, where skb can be mangled. > > Instead of skb_linearize() we would therefore need to use something > like skb_ensure_writable(skb, skb->len) as equivalent, which also > makes sure that we unclone whenever needed. We could use skb_cow() for this purpose, which deals with cloning as well as enforcing headroom. However, thinking further about this, the goal is to make generic XDP match precisely how in-driver-XDP behaves. Therefore, such redirects from act_mirred would never flow through the XDP path. No other possibility can cause us to see a cloned packet here, we are before network taps are processed, etc. So in my opinion the thing to do is to elide generic XDP if the SKB is cloned.
Re: [PATCH v3 net-next RFC] Generic XDP
On 04/12/2017 08:54 PM, David Miller wrote: [...] +static u32 netif_receive_generic_xdp(struct sk_buff *skb, +struct bpf_prog *xdp_prog) +{ + struct xdp_buff xdp; + u32 act = XDP_DROP; + void *orig_data; + int hlen, off; + + if (skb_linearize(skb)) Btw, given the skb can come from all kind of points in the stack, it could also be a clone at this point. One example is act_mirred which in fact does skb_clone() and can push the skb back to ingress path through netif_receive_skb() and thus could then go into generic xdp processing, where skb can be mangled. Instead of skb_linearize() we would therefore need to use something like skb_ensure_writable(skb, skb->len) as equivalent, which also makes sure that we unclone whenever needed. + goto do_drop; + + /* The XDP program wants to see the packet starting at the MAC +* header. +*/ + hlen = skb_headlen(skb) + skb->mac_len; + xdp.data = skb->data - skb->mac_len; + xdp.data_end = xdp.data + hlen; + xdp.data_hard_start = xdp.data - skb_headroom(skb); + orig_data = xdp.data; + + act = bpf_prog_run_xdp(xdp_prog, ); + + off = xdp.data - orig_data; + if (off) + __skb_push(skb, off); + + switch (act) { + case XDP_TX: + __skb_push(skb, skb->mac_len); + /* fall through */ + case XDP_PASS: + break; + + default: + bpf_warn_invalid_xdp_action(act); + /* fall through */ + case XDP_ABORTED: + trace_xdp_exception(skb->dev, xdp_prog, act); + /* fall through */ + case XDP_DROP: + do_drop: + kfree_skb(skb); + break; + } + + return act; +} + static int netif_receive_skb_internal(struct sk_buff *skb) { int ret; @@ -4258,6 +4341,21 @@ static int netif_receive_skb_internal(struct sk_buff *skb) rcu_read_lock(); + if (static_key_false(_xdp_needed)) { + struct bpf_prog *xdp_prog = rcu_dereference(skb->dev->xdp_prog); + + if (xdp_prog) { + u32 act = netif_receive_generic_xdp(skb, xdp_prog); + + if (act != XDP_PASS) { + rcu_read_unlock(); + if (act == XDP_TX) + dev_queue_xmit(skb); + return NET_RX_DROP; + } + } + } + [...]
Re: [PATCH v3 net-next RFC] Generic XDP
From: Michael ChanDate: Wed, 12 Apr 2017 23:48:22 -0700 > On Wed, Apr 12, 2017 at 11:54 AM, David Miller wrote: >> >> +static struct static_key generic_xdp_needed __read_mostly; >> + >> +static int generic_xdp_install(struct net_device *dev, struct netdev_xdp >> *xdp) >> +{ >> + struct bpf_prog *new = xdp->prog; >> + int ret = 0; >> + >> + switch (xdp->command) { >> + case XDP_SETUP_PROG: { >> + struct bpf_prog *old = rtnl_dereference(dev->xdp_prog); >> + >> + rcu_assign_pointer(dev->xdp_prog, new); >> + if (old) >> + bpf_prog_put(old); >> + >> + if (old && !new) >> + static_key_slow_dec(_xdp_needed); >> + else if (new && !old) >> + static_key_slow_inc(_xdp_needed); > > I think we need to call dev_disable_lro(dev) to disable LRO also when > a BPF program is installed. Agreed, I'll add this. Thanks Michael.
Re: [PATCH v3 net-next RFC] Generic XDP
From: Johannes BergDate: Thu, 13 Apr 2017 08:10:56 +0200 > On Wed, 2017-04-12 at 21:20 -0700, Alexei Starovoitov wrote: > >> > + if (skb_linearize(skb)) >> > + goto do_drop; >> >> when we discussed supporting jumbo frames in XDP, the idea >> was that the program would need to look at first 3+k bytes only >> and the rest of the packet will be in non-contiguous pages. >> If we do that, it means that XDP program would have to assume >> that the packet is more than [data, data_end] and this range >> only covers linear part. >> If that's the future, we don't need to linearize the skb here >> and can let the program access headlen only. > > I'm not sure how you think that would work - at least with our (wifi) > driver, the headlen should be maybe ETH_HLEN or so at this point. We'd > let the program know that it can only look at so much, but then the > program can't do anything at all with those frames. At some point then > we go back to bpf_skb_load_bytes() being necessary in one form or > another, no? Agreed, this is completely unusable. Some wired ethernet drivers do the same exact thing.
Re: [PATCH v3 net-next RFC] Generic XDP
From: Alexei StarovoitovDate: Wed, 12 Apr 2017 21:20:38 -0700 > On Wed, Apr 12, 2017 at 02:54:15PM -0400, David Miller wrote: >> One thing I have not tried to address here is the issue of >> XDP_PACKET_HEADROOM, thanks to Daniel for spotting that. It seems >> incredibly expensive to do a skb_cow(skb, XDP_PACKET_HEADROOM) or >> whatever even if the XDP program doesn't try to push headers at all. >> I think we really need the verifier to somehow propagate whether >> certain XDP helpers are used or not. > > Looks like we need to relax the headroom requirement. > I really wanted to simplify the life of program writers, > but intel drivers insist on 192 headroom already, then for skb > every driver does 64 and netronome doesn't have the room by default at all > even for XDP and relies on expensive copy when xdp_adjust_head is used > so that dream isn't going to come true. > I guess for now every driver should _try_ to give XDP_PACKET_HEADROOM > bytes, but the driver can omit it completely if xdp_adjust_head() > is not used for performance reasons. If the capability is variable, it must be communicated to the user somehow at program load time. We are consistently finding that there is this real need to communicate XDP capabilities, or somehow verify that the needs of an XDP program can be satisfied by a given implementation. Maximum headroom is just one. Simply having bpf_xdp_adjust_head() fail and therefore drop packets isn't good behavior at all. But that is the situation we have right now. >> +if (skb_linearize(skb)) >> +goto do_drop; > > when we discussed supporting jumbo frames in XDP, the idea > was that the program would need to look at first 3+k bytes only > and the rest of the packet will be in non-contiguous pages. > If we do that, it means that XDP program would have to assume > that the packet is more than [data, data_end] and this range > only covers linear part. > If that's the future, we don't need to linearize the skb here > and can let the program access headlen only. > It means that we'd need to add 'len' field to 'struct xdp_md' uapi. > For our existing programs (ddos and l4lb) it's not a problem, > since for MTU < 3.xK and physical XDP the driver guarantees that > data_end - data == len, so nothing will break. > So I'm proposing to do two things: > - drop skb_linearize here > - introduce 'len' to 'struct xdp_md' and update it here and > in the existing drivers that support xdp. I don't think this is reasonable. The amount of inconsistency will be quite huge. First of all, every driver pulls a different amount of bytes into the linear area when building frag based SKBs on receive. There are drivers that only put the MAC header in the linear area, and depend upon the logic in the stack which does pskb_may_pull() before touching deeper headers. Of course, such drivers should be using eth_get_headlen(). But they don't, and what they are doing is not a bug. And eth_get_headlen() only pulls protocol headers, which precludes XDP inspecting anything below TCP/UDP/etc. This is also not reasonable. Right now, as it stands, we have to assume the program can potentially be interested in the entire packet. We can only optimize this and elide things when we have a facility in the future for the program to express it's needs precisely. I think we will have to add some control structure to XDP programs that can be filled in for this purpose. Right now the two things we know would need to be expressed are 1) maximum headroom needed and 2) parsing depth. I'm not %100 clear on how #2 should be specified, but in it's simplest sense it could take on two values, either "what eth_get_headlen() returns" or "entire packet" to start with. > >> +if (act == XDP_TX) >> +dev_queue_xmit(skb); > > this will go through qdisc which is not a problem per-se, > but may mislead poor users that XDP_TX goes through qdisc > even for in-driver XDP which is not the case. > So I think we need to bypass qdisc somehow. Like > txq = netdev_pick_tx(skb,.. > HARD_TX_LOCK(dev, txq.. > if (!netif_xmit_stopped(txq)) { > dev_hard_start_xmit(skb, dev, txq, > } else { > kfree_skb(skb); > txq->xdp_tx_full++; > } > HARD_TX_UNLOCK(dev, txq); > > this way it will be similar to in-driver XDP which > also has xdp_tx_full counter when HW TX queue is full. Ok, this makes sense. I'll work on that. > Re: csum and skb->encapsulate issues that were raised in the previous thread > > Today all cls_bpf csum helpers are currently disabled for XDP > and if the program mangles the packet and then does XDP_PASS, > the packet will be dropped by the stack due to incorrect csum. > So we're no better here and need a solution for both in-driver XDP and > generic XDP. I'm not so sure how I feel about checksums, I've been busy following the thread started by Tom. I want to strive for something simple. > I think the green
Re: [PATCH v3 net-next RFC] Generic XDP
On Wed, Apr 12, 2017 at 11:54 AM, David Millerwrote: > > +static struct static_key generic_xdp_needed __read_mostly; > + > +static int generic_xdp_install(struct net_device *dev, struct netdev_xdp > *xdp) > +{ > + struct bpf_prog *new = xdp->prog; > + int ret = 0; > + > + switch (xdp->command) { > + case XDP_SETUP_PROG: { > + struct bpf_prog *old = rtnl_dereference(dev->xdp_prog); > + > + rcu_assign_pointer(dev->xdp_prog, new); > + if (old) > + bpf_prog_put(old); > + > + if (old && !new) > + static_key_slow_dec(_xdp_needed); > + else if (new && !old) > + static_key_slow_inc(_xdp_needed); I think we need to call dev_disable_lro(dev) to disable LRO also when a BPF program is installed. > + break; > + } > + > + case XDP_QUERY_PROG: > + xdp->prog_attached = !!rcu_access_pointer(dev->xdp_prog); > + break; > + > + default: > + ret = -EINVAL; > + break; > + } > + > + return ret; > +} > +
Re: [PATCH v3 net-next RFC] Generic XDP
On Wed, 2017-04-12 at 21:20 -0700, Alexei Starovoitov wrote: > > + if (skb_linearize(skb)) > > + goto do_drop; > > when we discussed supporting jumbo frames in XDP, the idea > was that the program would need to look at first 3+k bytes only > and the rest of the packet will be in non-contiguous pages. > If we do that, it means that XDP program would have to assume > that the packet is more than [data, data_end] and this range > only covers linear part. > If that's the future, we don't need to linearize the skb here > and can let the program access headlen only. I'm not sure how you think that would work - at least with our (wifi) driver, the headlen should be maybe ETH_HLEN or so at this point. We'd let the program know that it can only look at so much, but then the program can't do anything at all with those frames. At some point then we go back to bpf_skb_load_bytes() being necessary in one form or another, no? johannes
Re: [PATCH v3 net-next RFC] Generic XDP
On Wed, Apr 12, 2017 at 02:54:15PM -0400, David Miller wrote: > > This provides a generic SKB based non-optimized XDP path which is used > if either the driver lacks a specific XDP implementation, or the user > requests it via a new IFLA_XDP_FLAGS value named XDP_FLAGS_SKB_MODE. > > It is arguable that perhaps I should have required something like > this as part of the initial XDP feature merge. > > I believe this is critical for two reasons: > > 1) Accessibility. More people can play with XDP with less >dependencies. Yes I know we have XDP support in virtio_net, but >that just creates another depedency for learning how to use this >facility. > >I wrote this to make life easier for the XDP newbies. > > 2) As a model for what the expected semantics are. If there is a pure >generic core implementation, it serves as a semantic example for >driver folks adding XDP support. > > This is just a rough draft and is untested. > > One thing I have not tried to address here is the issue of > XDP_PACKET_HEADROOM, thanks to Daniel for spotting that. It seems > incredibly expensive to do a skb_cow(skb, XDP_PACKET_HEADROOM) or > whatever even if the XDP program doesn't try to push headers at all. > I think we really need the verifier to somehow propagate whether > certain XDP helpers are used or not. Looks like we need to relax the headroom requirement. I really wanted to simplify the life of program writers, but intel drivers insist on 192 headroom already, then for skb every driver does 64 and netronome doesn't have the room by default at all even for XDP and relies on expensive copy when xdp_adjust_head is used so that dream isn't going to come true. I guess for now every driver should _try_ to give XDP_PACKET_HEADROOM bytes, but the driver can omit it completely if xdp_adjust_head() is not used for performance reasons. > +static inline bool netif_elide_gro(const struct net_device *dev) > +{ > + if (!(dev->features & NETIF_F_GRO) || dev->xdp_prog) > + return true; > + return false; > +} I think that's the rigth call. I've been thinking back and forth about it. On one side it's not cool to disable it and not inform user about it in ethtool, so it might feel that doing ethtool_set_one_feature(~GRO) may be cleaner, but then we'd need to remember the current gro on/off status and restore that after prog is detached. And while the prog is attached the user shouldn't be able to change gro via ethtool -K, so we'd need extra boolean flag anyway. If we go with this netif_elide_gro() approach, we don't need to mess with ndo_fix_features() and only need to hack ethtool_get_features() to report GRO disabled if (dev->xdp_prog) and mark it as [fixed]. So overall imo it's cleaner than messing with dev->features directly while attaching/detaching the prog. > + if (skb_linearize(skb)) > + goto do_drop; when we discussed supporting jumbo frames in XDP, the idea was that the program would need to look at first 3+k bytes only and the rest of the packet will be in non-contiguous pages. If we do that, it means that XDP program would have to assume that the packet is more than [data, data_end] and this range only covers linear part. If that's the future, we don't need to linearize the skb here and can let the program access headlen only. It means that we'd need to add 'len' field to 'struct xdp_md' uapi. For our existing programs (ddos and l4lb) it's not a problem, since for MTU < 3.xK and physical XDP the driver guarantees that data_end - data == len, so nothing will break. So I'm proposing to do two things: - drop skb_linearize here - introduce 'len' to 'struct xdp_md' and update it here and in the existing drivers that support xdp. > + if (act == XDP_TX) > + dev_queue_xmit(skb); this will go through qdisc which is not a problem per-se, but may mislead poor users that XDP_TX goes through qdisc even for in-driver XDP which is not the case. So I think we need to bypass qdisc somehow. Like txq = netdev_pick_tx(skb,.. HARD_TX_LOCK(dev, txq.. if (!netif_xmit_stopped(txq)) { dev_hard_start_xmit(skb, dev, txq, } else { kfree_skb(skb); txq->xdp_tx_full++; } HARD_TX_UNLOCK(dev, txq); this way it will be similar to in-driver XDP which also has xdp_tx_full counter when HW TX queue is full. Re: csum and skb->encapsulate issues that were raised in the previous thread Today all cls_bpf csum helpers are currently disabled for XDP and if the program mangles the packet and then does XDP_PASS, the packet will be dropped by the stack due to incorrect csum. So we're no better here and need a solution for both in-driver XDP and generic XDP. I think the green light to apply this patch will be when samples/bpf/xdp1, xdp2 and xdp_tx_iptunnel work just like they do in in-driver XDP and I think we're pretty close. If desired I can start hacking on this patch and testing mid next week.
Re: [PATCH v3 net-next RFC] Generic XDP
On 4/12/17 8:08 PM, David Miller wrote: > From: David Ahern> Date: Wed, 12 Apr 2017 13:54:20 -0600 > >> packet passed to xdp seems to be messed up > Ok, the problem is that skb->mac_len isn't set properly at this point. > That doesn't happen until __netif_receive_skb_core(). > > I'm also not setting xdp.data_hard_start properly. > > It should work better with this small diff: it does.
Re: [PATCH v3 net-next RFC] Generic XDP
From: David AhernDate: Wed, 12 Apr 2017 13:54:20 -0600 > packet passed to xdp seems to be messed up Ok, the problem is that skb->mac_len isn't set properly at this point. That doesn't happen until __netif_receive_skb_core(). I'm also not setting xdp.data_hard_start properly. It should work better with this small diff: diff --git a/net/core/dev.c b/net/core/dev.c index 9ed4569..d36ae8f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4289,6 +4289,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb, u32 act = XDP_DROP; void *orig_data; int hlen, off; + u32 mac_len; if (skb_linearize(skb)) goto do_drop; @@ -4296,10 +4297,11 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb, /* The XDP program wants to see the packet starting at the MAC * header. */ - hlen = skb_headlen(skb) + skb->mac_len; - xdp.data = skb->data - skb->mac_len; + mac_len = skb->data - skb_mac_header(skb); + hlen = skb_headlen(skb) + mac_len; + xdp.data = skb->data - mac_len; xdp.data_end = xdp.data + hlen; - xdp.data_hard_start = xdp.data - skb_headroom(skb); + xdp.data_hard_start = skb->data - skb_headroom(skb); orig_data = xdp.data; act = bpf_prog_run_xdp(xdp_prog, );
Re: [PATCH v3 net-next RFC] Generic XDP
From: Stephen HemmingerDate: Wed, 12 Apr 2017 14:30:37 -0700 > On Wed, 12 Apr 2017 14:54:15 -0400 (EDT) > David Miller wrote: > >> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h >> index b0aa089..071a58b 100644 >> --- a/include/linux/netdevice.h >> +++ b/include/linux/netdevice.h >> @@ -1891,9 +1891,17 @@ struct net_device { >> struct lock_class_key *qdisc_tx_busylock; >> struct lock_class_key *qdisc_running_key; >> boolproto_down; >> +struct bpf_prog __rcu *xdp_prog; > > It would be good if all devices could reuse this for the xdp_prog pointer. > It would allow for could be used for introspection utility functions in > future. We plan to do so.
Re: [PATCH v3 net-next RFC] Generic XDP
From: Eric DumazetDate: Wed, 12 Apr 2017 14:49:53 -0700 > On Wed, 2017-04-12 at 14:30 -0700, Stephen Hemminger wrote: >> On Wed, 12 Apr 2017 14:54:15 -0400 (EDT) >> David Miller wrote: >> >> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h >> > index b0aa089..071a58b 100644 >> > --- a/include/linux/netdevice.h >> > +++ b/include/linux/netdevice.h >> > @@ -1891,9 +1891,17 @@ struct net_device { >> >struct lock_class_key *qdisc_tx_busylock; >> >struct lock_class_key *qdisc_running_key; >> >boolproto_down; >> > + struct bpf_prog __rcu *xdp_prog; >> >> It would be good if all devices could reuse this for the xdp_prog pointer. >> It would allow for could be used for introspection utility functions in >> future. > > Problem is that some xdp usages were envisioning a per RX queue xdp > program. True, but that hasn't materialized yet so designing for it so soon doesn't make a lot of sense.
Re: [PATCH v3 net-next RFC] Generic XDP
On Wed, 2017-04-12 at 14:30 -0700, Stephen Hemminger wrote: > On Wed, 12 Apr 2017 14:54:15 -0400 (EDT) > David Millerwrote: > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > > index b0aa089..071a58b 100644 > > --- a/include/linux/netdevice.h > > +++ b/include/linux/netdevice.h > > @@ -1891,9 +1891,17 @@ struct net_device { > > struct lock_class_key *qdisc_tx_busylock; > > struct lock_class_key *qdisc_running_key; > > boolproto_down; > > + struct bpf_prog __rcu *xdp_prog; > > It would be good if all devices could reuse this for the xdp_prog pointer. > It would allow for could be used for introspection utility functions in > future. Problem is that some xdp usages were envisioning a per RX queue xdp program.
Re: [PATCH v3 net-next RFC] Generic XDP
On Wed, 12 Apr 2017 14:54:15 -0400 (EDT) David Millerwrote: > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index b0aa089..071a58b 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -1891,9 +1891,17 @@ struct net_device { > struct lock_class_key *qdisc_tx_busylock; > struct lock_class_key *qdisc_running_key; > boolproto_down; > + struct bpf_prog __rcu *xdp_prog; It would be good if all devices could reuse this for the xdp_prog pointer. It would allow for could be used for introspection utility functions in future.
Re: [PATCH v3 net-next RFC] Generic XDP
On 4/12/17 12:54 PM, David Miller wrote: > > This provides a generic SKB based non-optimized XDP path which is used > if either the driver lacks a specific XDP implementation, or the user > requests it via a new IFLA_XDP_FLAGS value named XDP_FLAGS_SKB_MODE. > > It is arguable that perhaps I should have required something like > this as part of the initial XDP feature merge. > > I believe this is critical for two reasons: > > 1) Accessibility. More people can play with XDP with less >dependencies. Yes I know we have XDP support in virtio_net, but >that just creates another depedency for learning how to use this >facility. > >I wrote this to make life easier for the XDP newbies. > > 2) As a model for what the expected semantics are. If there is a pure >generic core implementation, it serves as a semantic example for >driver folks adding XDP support. > > This is just a rough draft and is untested. packet passed to xdp seems to be messed up. Using samples/bpf/xdp1*.c as an example: Current with virtio-net: # xdp1 3 proto 0: 0 pkt/s proto 6: 2 pkt/s proto 17: 1610 pkt/s proto 17: 22785 pkt/s Using the skb-mode: # xdp1 3 skb-mode proto 0: 2 pkt/s proto 0: 10880 pkt/s proto 0: 29990 pkt/s proto 0: 16972 pkt/s If I change ipproto in samples/bpf/xdp1_kern.c: if (h_proto == htons(ETH_P_IP)) ipproto = parse_ipv4(data, nh_off, data_end); else if (h_proto == htons(ETH_P_IPV6)) ipproto = parse_ipv6(data, nh_off, data_end); else ipproto = 63; I get: # xdp1 3 skb-mode proto 63: 6538 pkt/s proto 63: 23679 pkt/s proto 63: 29974 pkt/s proto 63: 29599 pkt/s which suggests h_proto is goofed up with skb-mode.
[PATCH v3 net-next RFC] Generic XDP
This provides a generic SKB based non-optimized XDP path which is used if either the driver lacks a specific XDP implementation, or the user requests it via a new IFLA_XDP_FLAGS value named XDP_FLAGS_SKB_MODE. It is arguable that perhaps I should have required something like this as part of the initial XDP feature merge. I believe this is critical for two reasons: 1) Accessibility. More people can play with XDP with less dependencies. Yes I know we have XDP support in virtio_net, but that just creates another depedency for learning how to use this facility. I wrote this to make life easier for the XDP newbies. 2) As a model for what the expected semantics are. If there is a pure generic core implementation, it serves as a semantic example for driver folks adding XDP support. This is just a rough draft and is untested. One thing I have not tried to address here is the issue of XDP_PACKET_HEADROOM, thanks to Daniel for spotting that. It seems incredibly expensive to do a skb_cow(skb, XDP_PACKET_HEADROOM) or whatever even if the XDP program doesn't try to push headers at all. I think we really need the verifier to somehow propagate whether certain XDP helpers are used or not. Signed-off-by: David S. Miller-- v3: - Make sure XDP program sees packet at MAC header, push back MAC header if we do XDP_TX. (Alexei) - Elide GRO when generic XDP is in use. (Alexei) - Add XDP_FLAG_SKB_MODE flag which the user can use to request generic XDP even if the driver has an XDP implementation. (Alexei) - Report whether SKB mode is in use in rtnl_xdp_fill() via XDP_FLAGS attribute. (Daniel) v2: - Add some "fall through" comments in switch statements based upon feedback from Andrew Lunn - Use RCU for generic xdp_prog, thanks to Johannes Berg. diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b0aa089..071a58b 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1891,9 +1891,17 @@ struct net_device { struct lock_class_key *qdisc_tx_busylock; struct lock_class_key *qdisc_running_key; boolproto_down; + struct bpf_prog __rcu *xdp_prog; }; #define to_net_dev(d) container_of(d, struct net_device, dev) +static inline bool netif_elide_gro(const struct net_device *dev) +{ + if (!(dev->features & NETIF_F_GRO) || dev->xdp_prog) + return true; + return false; +} + #defineNETDEV_ALIGN32 static inline diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index 8b405af..633aa02 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -887,7 +887,9 @@ enum { /* XDP section */ #define XDP_FLAGS_UPDATE_IF_NOEXIST(1U << 0) -#define XDP_FLAGS_MASK (XDP_FLAGS_UPDATE_IF_NOEXIST) +#define XDP_FLAGS_SKB_MODE (2U << 0) +#define XDP_FLAGS_MASK (XDP_FLAGS_UPDATE_IF_NOEXIST | \ +XDP_FLAGS_SKB_MODE) enum { IFLA_XDP_UNSPEC, diff --git a/net/core/dev.c b/net/core/dev.c index ef9fe60e..9ed4569 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -95,6 +95,7 @@ #include #include #include +#include #include #include #include @@ -4247,6 +4248,88 @@ static int __netif_receive_skb(struct sk_buff *skb) return ret; } +static struct static_key generic_xdp_needed __read_mostly; + +static int generic_xdp_install(struct net_device *dev, struct netdev_xdp *xdp) +{ + struct bpf_prog *new = xdp->prog; + int ret = 0; + + switch (xdp->command) { + case XDP_SETUP_PROG: { + struct bpf_prog *old = rtnl_dereference(dev->xdp_prog); + + rcu_assign_pointer(dev->xdp_prog, new); + if (old) + bpf_prog_put(old); + + if (old && !new) + static_key_slow_dec(_xdp_needed); + else if (new && !old) + static_key_slow_inc(_xdp_needed); + break; + } + + case XDP_QUERY_PROG: + xdp->prog_attached = !!rcu_access_pointer(dev->xdp_prog); + break; + + default: + ret = -EINVAL; + break; + } + + return ret; +} + +static u32 netif_receive_generic_xdp(struct sk_buff *skb, +struct bpf_prog *xdp_prog) +{ + struct xdp_buff xdp; + u32 act = XDP_DROP; + void *orig_data; + int hlen, off; + + if (skb_linearize(skb)) + goto do_drop; + + /* The XDP program wants to see the packet starting at the MAC +* header. +*/ + hlen = skb_headlen(skb) + skb->mac_len; + xdp.data = skb->data - skb->mac_len; + xdp.data_end = xdp.data + hlen; + xdp.data_hard_start = xdp.data - skb_headroom(skb); + orig_data = xdp.data; + + act =