Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
On Sun, 16 Jul 2017 10:23:08 +0200 Jesper Dangaard Brouerwrote: > On Tue, 11 Jul 2017 12:37:10 -0700 John Fastabend > wrote: > > [...] > > > > hmm maybe Jesse or Alex have some clues. Adding them to the CC list. > > This seems related to Hyper-Threading. I tried to disable > hyperthreading in the BIOS, and the problem goes away. That is, the > benchmarks are no-longer affected by the CPU tuned-adm profile. Wow, nice job figuring that out. I went and took a look at tuned-adm documentation but didn't see anything that stood out that would cause the behavior you were seeing. I suspect your toplev results showing this is a frontend problem mesh nicely with the hyper-threading configuration inducing the bad behavior, since there is still only one execution unit, and the fetchers would have to stall. I think you've rediscovered why the forwarding / routing crowd generally turns off hyperthreading.
Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
On Tue, 11 Jul 2017 12:37:10 -0700 John Fastabendwrote: > > I have a really strange observation... if I change the CPU powersave > > settings, then the xdp_redirect_map performance drops in half! Above > > was with "tuned-adm profile powersave" (because, this is a really noisy > > server, and I'm sitting next to it). I can see that the CPU under-load > > goes into "turbomode", rest going into low-power, including the > > Hyper-thread siblings. > > > > If I change the profile to: # tuned-adm profile network-latency > > > > ifindex 6: 12964879 pkt/s > > ifindex 6: 12964683 pkt/s > > ifindex 6: 12961497 pkt/s > > ifindex 6: 11779966 pkt/s <-- change to tuned-adm profile network-latency > > ifindex 6:6853959 pkt/s > > ifindex 6:6851120 pkt/s > > ifindex 6:6856934 pkt/s > > ifindex 6:6857344 pkt/s > > ifindex 6:6857161 pkt/s > > > > The CPU efficiency goes from 2.35 to 1.24 insn per cycle. > > > > John do you know some Intel people that could help me understand what > > is going on?!? This is very strange... > > > > I tried Andi's toplev tool, which AFAIK indicate that this is a > > Frontend problem, e.g. in decoding the instructions?!? > > > > hmm maybe Jesse or Alex have some clues. Adding them to the CC list. This seems related to Hyper-Threading. I tried to disable hyperthreading in the BIOS, and the problem goes away. That is, the benchmarks are no-longer affected by the CPU tuned-adm profile. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
From: John FastabendDate: Thu, 13 Jul 2017 10:00:15 -0700 > On 07/13/2017 09:16 AM, Jesper Dangaard Brouer wrote: >> On Thu, 13 Jul 2017 13:14:30 +0200 >> Jesper Dangaard Brouer wrote: >> >>> I'm still getting crashes (but much harder to provoke), but I figured >>> out why. We sort of missed one case, where map_to_flush gets set, when >>> the ndo_xdp_xmit() call starts to fail, and the ixgbe driver then >>> forgets to call xdp_do_flush_map, if all packets in that NAPI cycle >>> failed. We could blame the driver, but yhe clean solution is making >>> sure, that we don't set map_to_flush when the __bpf_tx_xdp() call >>> fails. It should also handle the other case I fixed I'll cleanup >>> my PoC-fix patch, test it and provide it here. >> >> I changed flow in the function to be: > > > Great, I'll merge this, the other couple fixes, and the bitops optimization > and > hopefully then we are set. I'll post a v2 and we can do some final checks. I am so looking forward to merging this, great work everyone.
Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
On 07/13/2017 09:16 AM, Jesper Dangaard Brouer wrote: > On Thu, 13 Jul 2017 13:14:30 +0200 > Jesper Dangaard Brouerwrote: > >> I'm still getting crashes (but much harder to provoke), but I figured >> out why. We sort of missed one case, where map_to_flush gets set, when >> the ndo_xdp_xmit() call starts to fail, and the ixgbe driver then >> forgets to call xdp_do_flush_map, if all packets in that NAPI cycle >> failed. We could blame the driver, but yhe clean solution is making >> sure, that we don't set map_to_flush when the __bpf_tx_xdp() call >> fails. It should also handle the other case I fixed I'll cleanup >> my PoC-fix patch, test it and provide it here. > > I changed flow in the function to be: Great, I'll merge this, the other couple fixes, and the bitops optimization and hopefully then we are set. I'll post a v2 and we can do some final checks. Thanks! John > > int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp, > struct bpf_prog *xdp_prog) > { > struct redirect_info *ri = this_cpu_ptr(_info); > struct bpf_map *map = ri->map; > u32 index = ri->ifindex; > struct net_device *fwd; > int err = -EINVAL; > > ri->ifindex = 0; > ri->map = NULL; > > fwd = __dev_map_lookup_elem(map, index); > if (!fwd) > goto out; > > if (ri->map_to_flush && (ri->map_to_flush != map)) > xdp_do_flush_map(); > > err = __bpf_tx_xdp(fwd, map, xdp, index); > if (likely(!err)) > ri->map_to_flush = map; > > out: > trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT); > return err; > } > > > The diff is: > > diff --git a/net/core/filter.c b/net/core/filter.c > index 4ca895d6ed51..c50a7ec2cdab 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -2483,26 +2483,25 @@ int xdp_do_redirect_map(struct net_device *dev, > struct xdp_buff *xdp, > struct bpf_map *map = ri->map; > u32 index = ri->ifindex; > struct net_device *fwd; > + int err = -EINVAL; > + > + ri->ifindex = 0; > + ri->map = NULL; > > fwd = __dev_map_lookup_elem(map, index); > if (!fwd) > goto out; > > - ri->ifindex = 0; > - > if (ri->map_to_flush && (ri->map_to_flush != map)) > xdp_do_flush_map(); > > - ri->map_to_flush = map; > - ri->map = NULL; > + err = __bpf_tx_xdp(fwd, map, xdp, index); > + if (likely(!err)) > + ri->map_to_flush = map; > > - trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT); > - > - return __bpf_tx_xdp(fwd, map, xdp, index); > out: > - ri->ifindex = 0; > - ri->map = NULL; > - return -EINVAL; > + trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT); > + return err; > } > > int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, >
Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
On Thu, 13 Jul 2017 13:14:30 +0200 Jesper Dangaard Brouerwrote: > I'm still getting crashes (but much harder to provoke), but I figured > out why. We sort of missed one case, where map_to_flush gets set, when > the ndo_xdp_xmit() call starts to fail, and the ixgbe driver then > forgets to call xdp_do_flush_map, if all packets in that NAPI cycle > failed. We could blame the driver, but yhe clean solution is making > sure, that we don't set map_to_flush when the __bpf_tx_xdp() call > fails. It should also handle the other case I fixed I'll cleanup > my PoC-fix patch, test it and provide it here. I changed flow in the function to be: int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp, struct bpf_prog *xdp_prog) { struct redirect_info *ri = this_cpu_ptr(_info); struct bpf_map *map = ri->map; u32 index = ri->ifindex; struct net_device *fwd; int err = -EINVAL; ri->ifindex = 0; ri->map = NULL; fwd = __dev_map_lookup_elem(map, index); if (!fwd) goto out; if (ri->map_to_flush && (ri->map_to_flush != map)) xdp_do_flush_map(); err = __bpf_tx_xdp(fwd, map, xdp, index); if (likely(!err)) ri->map_to_flush = map; out: trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT); return err; } The diff is: diff --git a/net/core/filter.c b/net/core/filter.c index 4ca895d6ed51..c50a7ec2cdab 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2483,26 +2483,25 @@ int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp, struct bpf_map *map = ri->map; u32 index = ri->ifindex; struct net_device *fwd; + int err = -EINVAL; + + ri->ifindex = 0; + ri->map = NULL; fwd = __dev_map_lookup_elem(map, index); if (!fwd) goto out; - ri->ifindex = 0; - if (ri->map_to_flush && (ri->map_to_flush != map)) xdp_do_flush_map(); - ri->map_to_flush = map; - ri->map = NULL; + err = __bpf_tx_xdp(fwd, map, xdp, index); + if (likely(!err)) + ri->map_to_flush = map; - trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT); - - return __bpf_tx_xdp(fwd, map, xdp, index); out: - ri->ifindex = 0; - ri->map = NULL; - return -EINVAL; + trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT); + return err; } int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp, -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
On Tue, 11 Jul 2017 11:26:54 -0700 John Fastabendwrote: > On 07/11/2017 07:23 AM, Jesper Dangaard Brouer wrote: > > On Mon, 10 Jul 2017 17:59:17 -0700 > > John Fastabend wrote: > > > >> On 07/10/2017 11:30 AM, Jesper Dangaard Brouer wrote: > >>> On Sat, 8 Jul 2017 21:06:17 +0200 > >>> Jesper Dangaard Brouer wrote: > >>> > On Sat, 08 Jul 2017 10:46:18 +0100 (WEST) > David Miller wrote: > > > From: John Fastabend > > Date: Fri, 07 Jul 2017 10:48:36 -0700 > > > >> On 07/07/2017 10:34 AM, John Fastabend wrote: > >>> This series adds two new XDP helper routines bpf_redirect() and > >>> bpf_redirect_map(). The first variant bpf_redirect() is meant > >>> to be used the same way it is currently being used by the cls_bpf > >>> classifier. An xdp packet will be redirected immediately when this > >>> is called. > >> > >> Also other than the typo in the title there ;) I'm going to CC > >> the driver maintainers working on XDP (makes for a long CC list but) > >> because we would want to try and get support in as many as possible in > >> the next merge window. > >> > >> For this rev I just implemented on ixgbe because I wrote the > >> original XDP support there. I'll volunteer to do virtio as well. > >> > > > > I went over this series a few times and it looks great to me. > > You didn't even give me some coding style issues to pick on :-) > > We (Daniel, Andy and I) have been reviewing and improving on this > patchset the last couple of weeks ;-). We had some stability issues, > which is why it wasn't published earlier. My plan is to test this > latest patchset again, Monday and Tuesday. I'll try to assess stability > and provide some performance numbers. > >>> > >>> > >>> Damn, I though it was stable, I have been running a lot of performance > >>> tests, and then this just happened :-( > >> > >> Thanks, I'll take a look through the code and see if I can come up with > >> why this might happen. I haven't hit it on my tests yet though. > > > > I've figured out why this happens, and I have a fix, see patch below > > with some comments with questions. > > > > Awesome! > > > The problem is that we can leak map_to_flush in an error path, the fix: > > > > diff --git a/net/core/filter.c b/net/core/filter.c > > index 2ccd6ff09493..7f1f48668dcf 100644 > > --- a/net/core/filter.c > > +++ b/net/core/filter.c > > @@ -2497,11 +2497,14 @@ int xdp_do_redirect_map(struct net_device *dev, > > struct xdp_buff *xdp, > > ri->map = NULL; > > > > trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT); > > - > > + // Q: Should we also trace "goto out" (failed lookup)? > > + //like bpf_warn_invalid_xdp_redirect(); > > Maybe another trace event? trace_xdp_redirect_failed() > > > return __bpf_tx_xdp(fwd, map, xdp, index); > > out: > > ri->ifindex = 0; > > - ri->map = NULL; > > + // XXX: here we could leak ri->map_to_flush, which could be > > + // picked up later by xdp_do_flush_map() > > + xdp_do_flush_map(); /* Clears ri->map_to_flush + ri->map */ > > +1 > > ah map lookup failed and we need to do the flush nice catch. I'm still getting crashes (but much harder to provoke), but I figured out why. We sort of missed one case, where map_to_flush gets set, when the ndo_xdp_xmit() call starts to fail, and the ixgbe driver then forgets to call xdp_do_flush_map, if all packets in that NAPI cycle failed. We could blame the driver, but yhe clean solution is making sure, that we don't set map_to_flush when the __bpf_tx_xdp() call fails. It should also handle the other case I fixed I'll cleanup my PoC-fix patch, test it and provide it here. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
On 07/11/2017 12:19 PM, Jesper Dangaard Brouer wrote: > On Tue, 11 Jul 2017 11:56:21 -0700 > John Fastabendwrote: > >> On 07/11/2017 11:44 AM, Jesper Dangaard Brouer wrote: >>> On Tue, 11 Jul 2017 20:01:36 +0200 >>> Jesper Dangaard Brouer wrote: >>> On Tue, 11 Jul 2017 10:48:29 -0700 John Fastabend wrote: > On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote: >> On Sat, 8 Jul 2017 21:06:17 +0200 >> Jesper Dangaard Brouer wrote: >> >>> My plan is to test this latest patchset again, Monday and Tuesday. >>> I'll try to assess stability and provide some performance numbers. >> >> Performance numbers: >> >> 14378479 pkt/s = XDP_DROP without touching memory >> 9222401 pkt/s = xdp1: XDP_DROP with reading packet data >> 6344472 pkt/s = xdp2: XDP_TX with swap mac (writes into pkt) >> 4595574 pkt/s = xdp_redirect: XDP_REDIRECT with swap mac (simulate >> XDP_TX) >> 5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap >> >> The performance drop between xdp2 and xdp_redirect, was expected due >> to the HW-tailptr flush per packet, which is costly. >> >> (1/6344472-1/4595574)*10^9 = -59.98 ns >> >> The performance drop between xdp2 and xdp_redirect_map, is higher than >> I expected, which is not good! The avoidance of the tailptr flush per >> packet was expected to give a higher boost. The cost increased with >> 40 ns, which is too high compared to the code added (on a 4GHz machine >> approx 160 cycles). >> >> (1/6344472-1/5066243)*10^9 = -39.77 ns >> >> This system doesn't have DDIO, thus we are stalling on cache-misses, >> but I was actually expecting that the added code could "hide" behind >> these cache-misses. >> >> I'm somewhat surprised to see this large a performance drop. >> > > Yep, although there is room for optimizations in the code path for sure. > And > 5mpps is not horrible my preference is to get this series in plus any > small optimization we come up with while the merge window is closed. Then > follow up patches can do optimizations. IMHO 5Mpps is a very bad number for XDP. > One easy optimization is to get rid of the atomic bitops. They are not > needed > here we have a per cpu unsigned long. Another easy one would be to move > some of the checks out of the hotpath. For example checking for > ndo_xdp_xmit > and flush ops on the net device in the hotpath really should be done in > the > slow path. I'm already running with a similar patch as below, but it (surprisingly) only gave my 3 ns improvement. I also tried a prefetchw() on xdp.data that gave me 10 ns (which is quite good). I'm booting up another system with a CPU E5-1650 v4 @ 3.60GHz, which have DDIO ... I have high hopes for this, as the major bottleneck on this CPU i7-4790K CPU @ 4.00GHz is clearly cache-misses. Something is definitely wrong on this CPU, as perf stats shows, a very bad utilization of the CPU pipeline with 0.89 insn per cycle. >>> >>> Wow, getting DDIO working and avoiding the cache-miss, was really >>> _the_ issue. On this CPU E5-1650 v4 @ 3.60GHz things look really >>> really good for XDP_REDIRECT with maps. (p.s. with __set_bit() >>> optimization) >>> >> >> Very nice :) this was with the prefecthw() removed right? > > Yes, prefetchw removed. Great, I wanted to avoid playing prefetch games for as long as possible. At least in this initial submission. > >>> 13,939,674 pkt/s = XDP_DROP without touching memory >>> 14,290,650 pkt/s = xdp1: XDP_DROP with reading packet data >>> 13,221,812 pkt/s = xdp2: XDP_TX with swap mac (writes into pkt) >>> 7,596,576 pkt/s = xdp_redirect:XDP_REDIRECT with swap mac (like XDP_TX) >>> 13,058,435 pkt/s = xdp_redirect_map:XDP_REDIRECT with swap mac + devmap >>> >>> Surprisingly, on this DDIO capable CPU it is slightly slower NOT to >>> read packet memory. >>> >>> The large performance gap to xdp_redirect is due to the tailptr flush, >>> which really show up on this system. The CPU efficiency is 1.36 insn >>> per cycle, which for map variant is 2.15 insn per cycle. >>> >>> Gap (1/13221812-1/7596576)*10^9 = -56 ns >>> >>> The xdp_redirect_map performance is really really good, almost 10G >>> wirespeed on a single CPU!!! This is amazing, and we know that this >>> code is not even optimal yet. The performance difference to xdp2 is >>> only around 1 ns. >>> >> >> Great, yeah there are some more likely()/unlikely() hints we could add and >> also remove some of the checks in the hotpath, etc. > > Yes, plus inlining some function call. > Yep. >> Thanks for doing this! > > I have a really strange observation... if I change the
Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
On Tue, 11 Jul 2017 11:56:21 -0700 John Fastabendwrote: > On 07/11/2017 11:44 AM, Jesper Dangaard Brouer wrote: > > On Tue, 11 Jul 2017 20:01:36 +0200 > > Jesper Dangaard Brouer wrote: > > > >> On Tue, 11 Jul 2017 10:48:29 -0700 > >> John Fastabend wrote: > >> > >>> On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote: > On Sat, 8 Jul 2017 21:06:17 +0200 > Jesper Dangaard Brouer wrote: > > > My plan is to test this latest patchset again, Monday and Tuesday. > > I'll try to assess stability and provide some performance numbers. > > Performance numbers: > > 14378479 pkt/s = XDP_DROP without touching memory > 9222401 pkt/s = xdp1: XDP_DROP with reading packet data > 6344472 pkt/s = xdp2: XDP_TX with swap mac (writes into pkt) > 4595574 pkt/s = xdp_redirect: XDP_REDIRECT with swap mac (simulate > XDP_TX) > 5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap > > The performance drop between xdp2 and xdp_redirect, was expected due > to the HW-tailptr flush per packet, which is costly. > > (1/6344472-1/4595574)*10^9 = -59.98 ns > > The performance drop between xdp2 and xdp_redirect_map, is higher than > I expected, which is not good! The avoidance of the tailptr flush per > packet was expected to give a higher boost. The cost increased with > 40 ns, which is too high compared to the code added (on a 4GHz machine > approx 160 cycles). > > (1/6344472-1/5066243)*10^9 = -39.77 ns > > This system doesn't have DDIO, thus we are stalling on cache-misses, > but I was actually expecting that the added code could "hide" behind > these cache-misses. > > I'm somewhat surprised to see this large a performance drop. > > >>> > >>> Yep, although there is room for optimizations in the code path for sure. > >>> And > >>> 5mpps is not horrible my preference is to get this series in plus any > >>> small optimization we come up with while the merge window is closed. Then > >>> follow up patches can do optimizations. > >> > >> IMHO 5Mpps is a very bad number for XDP. > >> > >>> One easy optimization is to get rid of the atomic bitops. They are not > >>> needed > >>> here we have a per cpu unsigned long. Another easy one would be to move > >>> some of the checks out of the hotpath. For example checking for > >>> ndo_xdp_xmit > >>> and flush ops on the net device in the hotpath really should be done in > >>> the > >>> slow path. > >> > >> I'm already running with a similar patch as below, but it > >> (surprisingly) only gave my 3 ns improvement. I also tried a > >> prefetchw() on xdp.data that gave me 10 ns (which is quite good). > >> > >> I'm booting up another system with a CPU E5-1650 v4 @ 3.60GHz, which > >> have DDIO ... I have high hopes for this, as the major bottleneck on > >> this CPU i7-4790K CPU @ 4.00GHz is clearly cache-misses. > >> > >> Something is definitely wrong on this CPU, as perf stats shows, a very > >> bad utilization of the CPU pipeline with 0.89 insn per cycle. > > > > Wow, getting DDIO working and avoiding the cache-miss, was really > > _the_ issue. On this CPU E5-1650 v4 @ 3.60GHz things look really > > really good for XDP_REDIRECT with maps. (p.s. with __set_bit() > > optimization) > > > > Very nice :) this was with the prefecthw() removed right? Yes, prefetchw removed. > > 13,939,674 pkt/s = XDP_DROP without touching memory > > 14,290,650 pkt/s = xdp1: XDP_DROP with reading packet data > > 13,221,812 pkt/s = xdp2: XDP_TX with swap mac (writes into pkt) > > 7,596,576 pkt/s = xdp_redirect:XDP_REDIRECT with swap mac (like XDP_TX) > > 13,058,435 pkt/s = xdp_redirect_map:XDP_REDIRECT with swap mac + devmap > > > > Surprisingly, on this DDIO capable CPU it is slightly slower NOT to > > read packet memory. > > > > The large performance gap to xdp_redirect is due to the tailptr flush, > > which really show up on this system. The CPU efficiency is 1.36 insn > > per cycle, which for map variant is 2.15 insn per cycle. > > > > Gap (1/13221812-1/7596576)*10^9 = -56 ns > > > > The xdp_redirect_map performance is really really good, almost 10G > > wirespeed on a single CPU!!! This is amazing, and we know that this > > code is not even optimal yet. The performance difference to xdp2 is > > only around 1 ns. > > > > Great, yeah there are some more likely()/unlikely() hints we could add and > also remove some of the checks in the hotpath, etc. Yes, plus inlining some function call. > Thanks for doing this! I have a really strange observation... if I change the CPU powersave settings, then the xdp_redirect_map performance drops in half! Above was with "tuned-adm profile powersave" (because, this is a really noisy server, and I'm sitting next to it). I
Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
On 07/11/2017 11:44 AM, Jesper Dangaard Brouer wrote: > On Tue, 11 Jul 2017 20:01:36 +0200 > Jesper Dangaard Brouerwrote: > >> On Tue, 11 Jul 2017 10:48:29 -0700 >> John Fastabend wrote: >> >>> On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote: On Sat, 8 Jul 2017 21:06:17 +0200 Jesper Dangaard Brouer wrote: > My plan is to test this latest patchset again, Monday and Tuesday. > I'll try to assess stability and provide some performance numbers. Performance numbers: 14378479 pkt/s = XDP_DROP without touching memory 9222401 pkt/s = xdp1: XDP_DROP with reading packet data 6344472 pkt/s = xdp2: XDP_TX with swap mac (writes into pkt) 4595574 pkt/s = xdp_redirect: XDP_REDIRECT with swap mac (simulate XDP_TX) 5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap The performance drop between xdp2 and xdp_redirect, was expected due to the HW-tailptr flush per packet, which is costly. (1/6344472-1/4595574)*10^9 = -59.98 ns The performance drop between xdp2 and xdp_redirect_map, is higher than I expected, which is not good! The avoidance of the tailptr flush per packet was expected to give a higher boost. The cost increased with 40 ns, which is too high compared to the code added (on a 4GHz machine approx 160 cycles). (1/6344472-1/5066243)*10^9 = -39.77 ns This system doesn't have DDIO, thus we are stalling on cache-misses, but I was actually expecting that the added code could "hide" behind these cache-misses. I'm somewhat surprised to see this large a performance drop. >>> >>> Yep, although there is room for optimizations in the code path for sure. And >>> 5mpps is not horrible my preference is to get this series in plus any >>> small optimization we come up with while the merge window is closed. Then >>> follow up patches can do optimizations. >> >> IMHO 5Mpps is a very bad number for XDP. >> >>> One easy optimization is to get rid of the atomic bitops. They are not >>> needed >>> here we have a per cpu unsigned long. Another easy one would be to move >>> some of the checks out of the hotpath. For example checking for ndo_xdp_xmit >>> and flush ops on the net device in the hotpath really should be done in the >>> slow path. >> >> I'm already running with a similar patch as below, but it >> (surprisingly) only gave my 3 ns improvement. I also tried a >> prefetchw() on xdp.data that gave me 10 ns (which is quite good). >> >> I'm booting up another system with a CPU E5-1650 v4 @ 3.60GHz, which >> have DDIO ... I have high hopes for this, as the major bottleneck on >> this CPU i7-4790K CPU @ 4.00GHz is clearly cache-misses. >> >> Something is definitely wrong on this CPU, as perf stats shows, a very >> bad utilization of the CPU pipeline with 0.89 insn per cycle. > > Wow, getting DDIO working and avoiding the cache-miss, was really > _the_ issue. On this CPU E5-1650 v4 @ 3.60GHz things look really > really good for XDP_REDIRECT with maps. (p.s. with __set_bit() > optimization) > Very nice :) this was with the prefecthw() removed right? > 13,939,674 pkt/s = XDP_DROP without touching memory > 14,290,650 pkt/s = xdp1: XDP_DROP with reading packet data > 13,221,812 pkt/s = xdp2: XDP_TX with swap mac (writes into pkt) > 7,596,576 pkt/s = xdp_redirect:XDP_REDIRECT with swap mac (like XDP_TX) > 13,058,435 pkt/s = xdp_redirect_map:XDP_REDIRECT with swap mac + devmap > > Surprisingly, on this DDIO capable CPU it is slightly slower NOT to > read packet memory. > > The large performance gap to xdp_redirect is due to the tailptr flush, > which really show up on this system. The CPU efficiency is 1.36 insn > per cycle, which for map variant is 2.15 insn per cycle. > > Gap (1/13221812-1/7596576)*10^9 = -56 ns > > The xdp_redirect_map performance is really really good, almost 10G > wirespeed on a single CPU!!! This is amazing, and we know that this > code is not even optimal yet. The performance difference to xdp2 is > only around 1 ns. > Great, yeah there are some more likely()/unlikely() hints we could add and also remove some of the checks in the hotpath, etc. Thanks for doing this!
Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
On Tue, 11 Jul 2017 20:01:36 +0200 Jesper Dangaard Brouerwrote: > On Tue, 11 Jul 2017 10:48:29 -0700 > John Fastabend wrote: > > > On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote: > > > On Sat, 8 Jul 2017 21:06:17 +0200 > > > Jesper Dangaard Brouer wrote: > > > > > >> My plan is to test this latest patchset again, Monday and Tuesday. > > >> I'll try to assess stability and provide some performance numbers. > > > > > > Performance numbers: > > > > > > 14378479 pkt/s = XDP_DROP without touching memory > > > 9222401 pkt/s = xdp1: XDP_DROP with reading packet data > > > 6344472 pkt/s = xdp2: XDP_TX with swap mac (writes into pkt) > > > 4595574 pkt/s = xdp_redirect: XDP_REDIRECT with swap mac (simulate > > > XDP_TX) > > > 5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap > > > > > > The performance drop between xdp2 and xdp_redirect, was expected due > > > to the HW-tailptr flush per packet, which is costly. > > > > > > (1/6344472-1/4595574)*10^9 = -59.98 ns > > > > > > The performance drop between xdp2 and xdp_redirect_map, is higher than > > > I expected, which is not good! The avoidance of the tailptr flush per > > > packet was expected to give a higher boost. The cost increased with > > > 40 ns, which is too high compared to the code added (on a 4GHz machine > > > approx 160 cycles). > > > > > > (1/6344472-1/5066243)*10^9 = -39.77 ns > > > > > > This system doesn't have DDIO, thus we are stalling on cache-misses, > > > but I was actually expecting that the added code could "hide" behind > > > these cache-misses. > > > > > > I'm somewhat surprised to see this large a performance drop. > > > > > > > Yep, although there is room for optimizations in the code path for sure. And > > 5mpps is not horrible my preference is to get this series in plus any > > small optimization we come up with while the merge window is closed. Then > > follow up patches can do optimizations. > > IMHO 5Mpps is a very bad number for XDP. > > > One easy optimization is to get rid of the atomic bitops. They are not > > needed > > here we have a per cpu unsigned long. Another easy one would be to move > > some of the checks out of the hotpath. For example checking for ndo_xdp_xmit > > and flush ops on the net device in the hotpath really should be done in the > > slow path. > > I'm already running with a similar patch as below, but it > (surprisingly) only gave my 3 ns improvement. I also tried a > prefetchw() on xdp.data that gave me 10 ns (which is quite good). > > I'm booting up another system with a CPU E5-1650 v4 @ 3.60GHz, which > have DDIO ... I have high hopes for this, as the major bottleneck on > this CPU i7-4790K CPU @ 4.00GHz is clearly cache-misses. > > Something is definitely wrong on this CPU, as perf stats shows, a very > bad utilization of the CPU pipeline with 0.89 insn per cycle. Wow, getting DDIO working and avoiding the cache-miss, was really _the_ issue. On this CPU E5-1650 v4 @ 3.60GHz things look really really good for XDP_REDIRECT with maps. (p.s. with __set_bit() optimization) 13,939,674 pkt/s = XDP_DROP without touching memory 14,290,650 pkt/s = xdp1: XDP_DROP with reading packet data 13,221,812 pkt/s = xdp2: XDP_TX with swap mac (writes into pkt) 7,596,576 pkt/s = xdp_redirect:XDP_REDIRECT with swap mac (like XDP_TX) 13,058,435 pkt/s = xdp_redirect_map:XDP_REDIRECT with swap mac + devmap Surprisingly, on this DDIO capable CPU it is slightly slower NOT to read packet memory. The large performance gap to xdp_redirect is due to the tailptr flush, which really show up on this system. The CPU efficiency is 1.36 insn per cycle, which for map variant is 2.15 insn per cycle. Gap (1/13221812-1/7596576)*10^9 = -56 ns The xdp_redirect_map performance is really really good, almost 10G wirespeed on a single CPU!!! This is amazing, and we know that this code is not even optimal yet. The performance difference to xdp2 is only around 1 ns. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer [jbrouer@E5-1650v4 bpf]$ sudo ./xdp1 6 proto 17: 11919302 pkt/s proto 17: 14281659 pkt/s proto 17: 14290650 pkt/s proto 17: 14283120 pkt/s proto 17: 14303023 pkt/s proto 17: 14299496 pkt/s [jbrouer@E5-1650v4 bpf]$ sudo ./xdp2 6 proto 0: 1 pkt/s proto 17:3225455 pkt/s proto 17: 13266772 pkt/s proto 17: 13221812 pkt/s proto 17: 1300 pkt/s proto 17: 13225508 pkt/s proto 17: 13226274 pkt/s [jbrouer@E5-1650v4 bpf]$ sudo ./xdp_redirect 6 6 ifindex 6: 66040 pkt/s ifindex 6:7029143 pkt/s ifindex 6:7596576 pkt/s ifindex 6:7598499 pkt/s ifindex 6:7597025 pkt/s ifindex 6:7598462 pkt/s [jbrouer@E5-1650v4 bpf]$ sudo ./xdp_redirect_map 6 6 map[0] (vports) = 4, map[1] (map) = 5, map[2] (count) = 0 ifindex 6:
Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
On 07/11/2017 11:01 AM, Jesper Dangaard Brouer wrote: > On Tue, 11 Jul 2017 10:48:29 -0700 > John Fastabendwrote: > >> On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote: >>> On Sat, 8 Jul 2017 21:06:17 +0200 >>> Jesper Dangaard Brouer wrote: >>> My plan is to test this latest patchset again, Monday and Tuesday. I'll try to assess stability and provide some performance numbers. >>> >>> Performance numbers: >>> >>> 14378479 pkt/s = XDP_DROP without touching memory >>> 9222401 pkt/s = xdp1: XDP_DROP with reading packet data >>> 6344472 pkt/s = xdp2: XDP_TX with swap mac (writes into pkt) >>> 4595574 pkt/s = xdp_redirect: XDP_REDIRECT with swap mac (simulate >>> XDP_TX) >>> 5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap >>> >>> The performance drop between xdp2 and xdp_redirect, was expected due >>> to the HW-tailptr flush per packet, which is costly. >>> >>> (1/6344472-1/4595574)*10^9 = -59.98 ns >>> >>> The performance drop between xdp2 and xdp_redirect_map, is higher than >>> I expected, which is not good! The avoidance of the tailptr flush per >>> packet was expected to give a higher boost. The cost increased with >>> 40 ns, which is too high compared to the code added (on a 4GHz machine >>> approx 160 cycles). >>> >>> (1/6344472-1/5066243)*10^9 = -39.77 ns >>> >>> This system doesn't have DDIO, thus we are stalling on cache-misses, >>> but I was actually expecting that the added code could "hide" behind >>> these cache-misses. >>> >>> I'm somewhat surprised to see this large a performance drop. >>> >> >> Yep, although there is room for optimizations in the code path for sure. And >> 5mpps is not horrible my preference is to get this series in plus any >> small optimization we come up with while the merge window is closed. Then >> follow up patches can do optimizations. > > IMHO 5Mpps is a very bad number for XDP. > >> One easy optimization is to get rid of the atomic bitops. They are not needed >> here we have a per cpu unsigned long. Another easy one would be to move >> some of the checks out of the hotpath. For example checking for ndo_xdp_xmit >> and flush ops on the net device in the hotpath really should be done in the >> slow path. > > I'm already running with a similar patch as below, but it > (surprisingly) only gave my 3 ns improvement. I also tried a > prefetchw() on xdp.data that gave me 10 ns (which is quite good). > Ah OK good, do the above numbers use the both the bitops changes and the prefechw? > I'm booting up another system with a CPU E5-1650 v4 @ 3.60GHz, which > have DDIO ... I have high hopes for this, as the major bottleneck on > this CPU i7-4790K CPU @ 4.00GHz is clearly cache-misses. > > Something is definitely wrong on this CPU, as perf stats shows, a very > bad utilization of the CPU pipeline with 0.89 insn per cycle. > Interesting, the E5-1650 numbers will be good to know. If you have the perf trace to posting might help track down some hot spots. .John
Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
On 07/11/2017 07:23 AM, Jesper Dangaard Brouer wrote: > On Mon, 10 Jul 2017 17:59:17 -0700 > John Fastabendwrote: > >> On 07/10/2017 11:30 AM, Jesper Dangaard Brouer wrote: >>> On Sat, 8 Jul 2017 21:06:17 +0200 >>> Jesper Dangaard Brouer wrote: >>> On Sat, 08 Jul 2017 10:46:18 +0100 (WEST) David Miller wrote: > From: John Fastabend > Date: Fri, 07 Jul 2017 10:48:36 -0700 > >> On 07/07/2017 10:34 AM, John Fastabend wrote: >>> This series adds two new XDP helper routines bpf_redirect() and >>> bpf_redirect_map(). The first variant bpf_redirect() is meant >>> to be used the same way it is currently being used by the cls_bpf >>> classifier. An xdp packet will be redirected immediately when this >>> is called. >> >> Also other than the typo in the title there ;) I'm going to CC >> the driver maintainers working on XDP (makes for a long CC list but) >> because we would want to try and get support in as many as possible in >> the next merge window. >> >> For this rev I just implemented on ixgbe because I wrote the >> original XDP support there. I'll volunteer to do virtio as well. > > I went over this series a few times and it looks great to me. > You didn't even give me some coding style issues to pick on :-) We (Daniel, Andy and I) have been reviewing and improving on this patchset the last couple of weeks ;-). We had some stability issues, which is why it wasn't published earlier. My plan is to test this latest patchset again, Monday and Tuesday. I'll try to assess stability and provide some performance numbers. >>> >>> >>> Damn, I though it was stable, I have been running a lot of performance >>> tests, and then this just happened :-( >> >> Thanks, I'll take a look through the code and see if I can come up with >> why this might happen. I haven't hit it on my tests yet though. > > I've figured out why this happens, and I have a fix, see patch below > with some comments with questions. > Awesome! > The problem is that we can leak map_to_flush in an error path, the fix: > > diff --git a/net/core/filter.c b/net/core/filter.c > index 2ccd6ff09493..7f1f48668dcf 100644 > --- a/net/core/filter.c > +++ b/net/core/filter.c > @@ -2497,11 +2497,14 @@ int xdp_do_redirect_map(struct net_device *dev, > struct xdp_buff *xdp, > ri->map = NULL; > > trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT); > - > + // Q: Should we also trace "goto out" (failed lookup)? > + //like bpf_warn_invalid_xdp_redirect(); Maybe another trace event? trace_xdp_redirect_failed() > return __bpf_tx_xdp(fwd, map, xdp, index); > out: > ri->ifindex = 0; > - ri->map = NULL; > + // XXX: here we could leak ri->map_to_flush, which could be > + // picked up later by xdp_do_flush_map() > + xdp_do_flush_map(); /* Clears ri->map_to_flush + ri->map */ +1 ah map lookup failed and we need to do the flush nice catch. > return -EINVAL; > > > While debugging this, I noticed that we can have packets in-flight, > while the XDP RX rings are being reconfigured. I wonder if this is a > ixgbe driver XDP-bug? I think it would be best to add some > RCU-barrier, after ixgbe_setup_tc(). > Actually I think a synchronize_sched() is needed, after the IXGBE_DOWN bit is set but before the xdp_tx queues are cleaned up. In practice the ixgbe_down/up sequence has so many msleep() operations for napi cleanup and hardware sync I would be really surprised we ever hit this. But for correctness we should likely add it. > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > index ed97aa81a850..4872fbb54ecd 100644 > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c > @@ -9801,7 +9804,18 @@ static int ixgbe_xdp_setup(struct net_device *dev, > struct bpf_prog *prog) > > /* If transitioning XDP modes reconfigure rings */ > if (!!prog != !!old_prog) { > - int err = ixgbe_setup_tc(dev, netdev_get_num_tc(dev)); > + // XXX: Warn pkts can be in-flight in old_prog > + // while ixgbe_setup_tc() calls ixgbe_close(dev). > + // > + // Should we avoid these in-flight packets? > + // Would it be enough to add an synchronize_rcu() > + // or rcu_barrier()? > + // or do we need an napi_synchronize() call here? > + // > + int err; > + netdev_info(dev, > + "Calling ixgbe_setup_tc() to reconfig XDP > rings\n"); > + err = ixgbe_setup_tc(dev, netdev_get_num_tc(dev)); > > if (err) { >
Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
On Tue, 11 Jul 2017 10:48:29 -0700 John Fastabendwrote: > On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote: > > On Sat, 8 Jul 2017 21:06:17 +0200 > > Jesper Dangaard Brouer wrote: > > > >> My plan is to test this latest patchset again, Monday and Tuesday. > >> I'll try to assess stability and provide some performance numbers. > > > > Performance numbers: > > > > 14378479 pkt/s = XDP_DROP without touching memory > > 9222401 pkt/s = xdp1: XDP_DROP with reading packet data > > 6344472 pkt/s = xdp2: XDP_TX with swap mac (writes into pkt) > > 4595574 pkt/s = xdp_redirect: XDP_REDIRECT with swap mac (simulate > > XDP_TX) > > 5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap > > > > The performance drop between xdp2 and xdp_redirect, was expected due > > to the HW-tailptr flush per packet, which is costly. > > > > (1/6344472-1/4595574)*10^9 = -59.98 ns > > > > The performance drop between xdp2 and xdp_redirect_map, is higher than > > I expected, which is not good! The avoidance of the tailptr flush per > > packet was expected to give a higher boost. The cost increased with > > 40 ns, which is too high compared to the code added (on a 4GHz machine > > approx 160 cycles). > > > > (1/6344472-1/5066243)*10^9 = -39.77 ns > > > > This system doesn't have DDIO, thus we are stalling on cache-misses, > > but I was actually expecting that the added code could "hide" behind > > these cache-misses. > > > > I'm somewhat surprised to see this large a performance drop. > > > > Yep, although there is room for optimizations in the code path for sure. And > 5mpps is not horrible my preference is to get this series in plus any > small optimization we come up with while the merge window is closed. Then > follow up patches can do optimizations. IMHO 5Mpps is a very bad number for XDP. > One easy optimization is to get rid of the atomic bitops. They are not needed > here we have a per cpu unsigned long. Another easy one would be to move > some of the checks out of the hotpath. For example checking for ndo_xdp_xmit > and flush ops on the net device in the hotpath really should be done in the > slow path. I'm already running with a similar patch as below, but it (surprisingly) only gave my 3 ns improvement. I also tried a prefetchw() on xdp.data that gave me 10 ns (which is quite good). I'm booting up another system with a CPU E5-1650 v4 @ 3.60GHz, which have DDIO ... I have high hopes for this, as the major bottleneck on this CPU i7-4790K CPU @ 4.00GHz is clearly cache-misses. Something is definitely wrong on this CPU, as perf stats shows, a very bad utilization of the CPU pipeline with 0.89 insn per cycle. > diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c > index 1191060..899364d 100644 > --- a/kernel/bpf/devmap.c > +++ b/kernel/bpf/devmap.c > @@ -213,7 +213,7 @@ void __dev_map_insert_ctx(struct bpf_map *map, u32 key) > struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); > unsigned long *bitmap = this_cpu_ptr(dtab->flush_needed); > > - set_bit(key, bitmap); > + __set_bit(key, bitmap); > } > > struct net_device *__dev_map_lookup_elem(struct bpf_map *map, u32 key) > @@ -253,7 +253,7 @@ void __dev_map_flush(struct bpf_map *map) > > netdev = dev->dev; > > - clear_bit(bit, bitmap); > + __clear_bit(bit, bitmap); > if (unlikely(!netdev || !netdev->netdev_ops->ndo_xdp_flush)) > continue; > > @@ -287,7 +287,7 @@ static void dev_map_flush_old(struct bpf_dtab_netdev > *old_dev) > > for_each_online_cpu(cpu) { > bitmap = per_cpu_ptr(old_dev->dtab->flush_needed, > cpu); > - clear_bit(old_dev->key, bitmap); > + __clear_bit(old_dev->key, bitmap); > > fl->netdev_ops->ndo_xdp_flush(old_dev->dev); > } > -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote: > On Sat, 8 Jul 2017 21:06:17 +0200 > Jesper Dangaard Brouerwrote: > >> My plan is to test this latest patchset again, Monday and Tuesday. >> I'll try to assess stability and provide some performance numbers. > > Performance numbers: > > 14378479 pkt/s = XDP_DROP without touching memory > 9222401 pkt/s = xdp1: XDP_DROP with reading packet data > 6344472 pkt/s = xdp2: XDP_TX with swap mac (writes into pkt) > 4595574 pkt/s = xdp_redirect: XDP_REDIRECT with swap mac (simulate > XDP_TX) > 5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap > > The performance drop between xdp2 and xdp_redirect, was expected due > to the HW-tailptr flush per packet, which is costly. > > (1/6344472-1/4595574)*10^9 = -59.98 ns > > The performance drop between xdp2 and xdp_redirect_map, is higher than > I expected, which is not good! The avoidance of the tailptr flush per > packet was expected to give a higher boost. The cost increased with > 40 ns, which is too high compared to the code added (on a 4GHz machine > approx 160 cycles). > > (1/6344472-1/5066243)*10^9 = -39.77 ns > > This system doesn't have DDIO, thus we are stalling on cache-misses, > but I was actually expecting that the added code could "hide" behind > these cache-misses. > > I'm somewhat surprised to see this large a performance drop. > Yep, although there is room for optimizations in the code path for sure. And 5mpps is not horrible my preference is to get this series in plus any small optimization we come up with while the merge window is closed. Then follow up patches can do optimizations. One easy optimization is to get rid of the atomic bitops. They are not needed here we have a per cpu unsigned long. Another easy one would be to move some of the checks out of the hotpath. For example checking for ndo_xdp_xmit and flush ops on the net device in the hotpath really should be done in the slow path. diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c index 1191060..899364d 100644 --- a/kernel/bpf/devmap.c +++ b/kernel/bpf/devmap.c @@ -213,7 +213,7 @@ void __dev_map_insert_ctx(struct bpf_map *map, u32 key) struct bpf_dtab *dtab = container_of(map, struct bpf_dtab, map); unsigned long *bitmap = this_cpu_ptr(dtab->flush_needed); - set_bit(key, bitmap); + __set_bit(key, bitmap); } struct net_device *__dev_map_lookup_elem(struct bpf_map *map, u32 key) @@ -253,7 +253,7 @@ void __dev_map_flush(struct bpf_map *map) netdev = dev->dev; - clear_bit(bit, bitmap); + __clear_bit(bit, bitmap); if (unlikely(!netdev || !netdev->netdev_ops->ndo_xdp_flush)) continue; @@ -287,7 +287,7 @@ static void dev_map_flush_old(struct bpf_dtab_netdev *old_dev) for_each_online_cpu(cpu) { bitmap = per_cpu_ptr(old_dev->dtab->flush_needed, cpu); - clear_bit(old_dev->key, bitmap); + __clear_bit(old_dev->key, bitmap); fl->netdev_ops->ndo_xdp_flush(old_dev->dev); }
Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
On Sat, 8 Jul 2017 21:06:17 +0200 Jesper Dangaard Brouerwrote: > My plan is to test this latest patchset again, Monday and Tuesday. > I'll try to assess stability and provide some performance numbers. Performance numbers: 14378479 pkt/s = XDP_DROP without touching memory 9222401 pkt/s = xdp1: XDP_DROP with reading packet data 6344472 pkt/s = xdp2: XDP_TX with swap mac (writes into pkt) 4595574 pkt/s = xdp_redirect: XDP_REDIRECT with swap mac (simulate XDP_TX) 5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap The performance drop between xdp2 and xdp_redirect, was expected due to the HW-tailptr flush per packet, which is costly. (1/6344472-1/4595574)*10^9 = -59.98 ns The performance drop between xdp2 and xdp_redirect_map, is higher than I expected, which is not good! The avoidance of the tailptr flush per packet was expected to give a higher boost. The cost increased with 40 ns, which is too high compared to the code added (on a 4GHz machine approx 160 cycles). (1/6344472-1/5066243)*10^9 = -39.77 ns This system doesn't have DDIO, thus we are stalling on cache-misses, but I was actually expecting that the added code could "hide" behind these cache-misses. I'm somewhat surprised to see this large a performance drop. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer Results:: # XDP_DROP with reading packet data [jbrouer@canyon bpf]$ sudo ./xdp1 3 proto 17:6449727 pkt/s proto 17:9222639 pkt/s proto 17:9222401 pkt/s proto 17:9223083 pkt/s proto 17:9223515 pkt/s proto 17:9222477 pkt/s ^C # XDP_TX with swap mac [jbrouer@canyon bpf]$ sudo ./xdp2 3 proto 17: 934682 pkt/s proto 17:6344845 pkt/s proto 17:6344472 pkt/s proto 17:6345265 pkt/s proto 17:6345238 pkt/s proto 17:6345338 pkt/s ^C # XDP_REDIRECT with swap mac (simulate XDP_TX via same ifindex) [jbrouer@canyon bpf]$ sudo ./xdp_redirect 3 3 ifindex 3: 749567 pkt/s ifindex 3:4595025 pkt/s ifindex 3:4595574 pkt/s ifindex 3:4595429 pkt/s ifindex 3:4595340 pkt/s ifindex 3:4595352 pkt/s ifindex 3:4595364 pkt/s ^C # XDP_REDIRECT with swap mac + devmap (still simulate XDP_TX) [jbrouer@canyon bpf]$ sudo ./xdp_redirect_map 3 3 map[0] (vports) = 4, map[1] (map) = 5, map[2] (count) = 0 ifindex 3:3076506 pkt/s ifindex 3:5066282 pkt/s ifindex 3:5066243 pkt/s ifindex 3:5067376 pkt/s ifindex 3:5067226 pkt/s ifindex 3:5067622 pkt/s My own tools:: [jbrouer@canyon prototype-kernel]$ sudo ./xdp_bench01_mem_access_cost --dev ixgbe1 --sec 2 \ --action XDP_DROP XDP_action ppspps-human-readable mem XDP_DROP 0 0 no_touch XDP_DROP 98944019,894,401 no_touch XDP_DROP 14377459 14,377,459 no_touch XDP_DROP 14378228 14,378,228 no_touch XDP_DROP 14378400 14,378,400 no_touch XDP_DROP 14378319 14,378,319 no_touch XDP_DROP 14378479 14,378,479 no_touch XDP_DROP 14377332 14,377,332 no_touch XDP_DROP 14378411 14,378,411 no_touch XDP_DROP 14378095 14,378,095 no_touch ^CInterrupted: Removing XDP program on ifindex:3 device:ixgbe1 [jbrouer@canyon prototype-kernel]$ sudo ./xdp_bench01_mem_access_cost --dev ixgbe1 --sec 2 \ --action XDP_DROP --read XDP_action ppspps-human-readable mem XDP_DROP 0 0 read XDP_DROP 69941146,994,114 read XDP_DROP 89794148,979,414 read XDP_DROP 89796368,979,636 read XDP_DROP 89800878,980,087 read XDP_DROP 89790978,979,097 read XDP_DROP 89789708,978,970 read ^CInterrupted: Removing XDP program on ifindex:3 device:ixgbe1 [jbrouer@canyon prototype-kernel]$ sudo ./xdp_bench01_mem_access_cost --dev ixgbe1 --sec 2 \ --action XDP_TX --swap --read XDP_action ppspps-human-readable mem XDP_TX 0 0 swap_mac XDP_TX 21415562,141,556 swap_mac XDP_TX 61719846,171,984 swap_mac XDP_TX 61719556,171,955 swap_mac XDP_TX 61717676,171,767 swap_mac XDP_TX 61716806,171,680 swap_mac XDP_TX 61722016,172,201 swap_mac ^CInterrupted: Removing XDP program on ifindex:3 device:ixgbe1 Setting tuned-adm network-latency :: $ sudo tuned-adm list [...] Current active profile: network-latency
Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
On Mon, 10 Jul 2017 17:59:17 -0700 John Fastabendwrote: > On 07/10/2017 11:30 AM, Jesper Dangaard Brouer wrote: > > On Sat, 8 Jul 2017 21:06:17 +0200 > > Jesper Dangaard Brouer wrote: > > > >> On Sat, 08 Jul 2017 10:46:18 +0100 (WEST) > >> David Miller wrote: > >> > >>> From: John Fastabend > >>> Date: Fri, 07 Jul 2017 10:48:36 -0700 > >>> > On 07/07/2017 10:34 AM, John Fastabend wrote: > > This series adds two new XDP helper routines bpf_redirect() and > > bpf_redirect_map(). The first variant bpf_redirect() is meant > > to be used the same way it is currently being used by the cls_bpf > > classifier. An xdp packet will be redirected immediately when this > > is called. > > Also other than the typo in the title there ;) I'm going to CC > the driver maintainers working on XDP (makes for a long CC list but) > because we would want to try and get support in as many as possible in > the next merge window. > > For this rev I just implemented on ixgbe because I wrote the > original XDP support there. I'll volunteer to do virtio as well. > >>> > >>> I went over this series a few times and it looks great to me. > >>> You didn't even give me some coding style issues to pick on :-) > >> > >> We (Daniel, Andy and I) have been reviewing and improving on this > >> patchset the last couple of weeks ;-). We had some stability issues, > >> which is why it wasn't published earlier. My plan is to test this > >> latest patchset again, Monday and Tuesday. I'll try to assess stability > >> and provide some performance numbers. > > > > > > Damn, I though it was stable, I have been running a lot of performance > > tests, and then this just happened :-( > > Thanks, I'll take a look through the code and see if I can come up with > why this might happen. I haven't hit it on my tests yet though. I've figured out why this happens, and I have a fix, see patch below with some comments with questions. The problem is that we can leak map_to_flush in an error path, the fix: diff --git a/net/core/filter.c b/net/core/filter.c index 2ccd6ff09493..7f1f48668dcf 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2497,11 +2497,14 @@ int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp, ri->map = NULL; trace_xdp_redirect(dev, fwd, xdp_prog, XDP_REDIRECT); - + // Q: Should we also trace "goto out" (failed lookup)? + //like bpf_warn_invalid_xdp_redirect(); return __bpf_tx_xdp(fwd, map, xdp, index); out: ri->ifindex = 0; - ri->map = NULL; + // XXX: here we could leak ri->map_to_flush, which could be + // picked up later by xdp_do_flush_map() + xdp_do_flush_map(); /* Clears ri->map_to_flush + ri->map */ return -EINVAL; While debugging this, I noticed that we can have packets in-flight, while the XDP RX rings are being reconfigured. I wonder if this is a ixgbe driver XDP-bug? I think it would be best to add some RCU-barrier, after ixgbe_setup_tc(). diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index ed97aa81a850..4872fbb54ecd 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -9801,7 +9804,18 @@ static int ixgbe_xdp_setup(struct net_device *dev, struct bpf_prog *prog) /* If transitioning XDP modes reconfigure rings */ if (!!prog != !!old_prog) { - int err = ixgbe_setup_tc(dev, netdev_get_num_tc(dev)); + // XXX: Warn pkts can be in-flight in old_prog + // while ixgbe_setup_tc() calls ixgbe_close(dev). + // + // Should we avoid these in-flight packets? + // Would it be enough to add an synchronize_rcu() + // or rcu_barrier()? + // or do we need an napi_synchronize() call here? + // + int err; + netdev_info(dev, + "Calling ixgbe_setup_tc() to reconfig XDP rings\n"); + err = ixgbe_setup_tc(dev, netdev_get_num_tc(dev)); if (err) { rcu_assign_pointer(adapter->xdp_prog, old_prog); -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
On 07/10/2017 11:30 AM, Jesper Dangaard Brouer wrote: > On Sat, 8 Jul 2017 21:06:17 +0200 > Jesper Dangaard Brouerwrote: > >> On Sat, 08 Jul 2017 10:46:18 +0100 (WEST) >> David Miller wrote: >> >>> From: John Fastabend >>> Date: Fri, 07 Jul 2017 10:48:36 -0700 >>> On 07/07/2017 10:34 AM, John Fastabend wrote: > This series adds two new XDP helper routines bpf_redirect() and > bpf_redirect_map(). The first variant bpf_redirect() is meant > to be used the same way it is currently being used by the cls_bpf > classifier. An xdp packet will be redirected immediately when this > is called. Also other than the typo in the title there ;) I'm going to CC the driver maintainers working on XDP (makes for a long CC list but) because we would want to try and get support in as many as possible in the next merge window. For this rev I just implemented on ixgbe because I wrote the original XDP support there. I'll volunteer to do virtio as well. >>> >>> I went over this series a few times and it looks great to me. >>> You didn't even give me some coding style issues to pick on :-) >> >> We (Daniel, Andy and I) have been reviewing and improving on this >> patchset the last couple of weeks ;-). We had some stability issues, >> which is why it wasn't published earlier. My plan is to test this >> latest patchset again, Monday and Tuesday. I'll try to assess stability >> and provide some performance numbers. > > > Damn, I though it was stable, I have been running a lot of performance > tests, and then this just happened :-( Thanks, I'll take a look through the code and see if I can come up with why this might happen. I haven't hit it on my tests yet though. .John
Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
On Sat, 8 Jul 2017 21:06:17 +0200 Jesper Dangaard Brouerwrote: > On Sat, 08 Jul 2017 10:46:18 +0100 (WEST) > David Miller wrote: > > > From: John Fastabend > > Date: Fri, 07 Jul 2017 10:48:36 -0700 > > > > > On 07/07/2017 10:34 AM, John Fastabend wrote: > > >> This series adds two new XDP helper routines bpf_redirect() and > > >> bpf_redirect_map(). The first variant bpf_redirect() is meant > > >> to be used the same way it is currently being used by the cls_bpf > > >> classifier. An xdp packet will be redirected immediately when this > > >> is called. > > > > > > Also other than the typo in the title there ;) I'm going to CC > > > the driver maintainers working on XDP (makes for a long CC list but) > > > because we would want to try and get support in as many as possible in > > > the next merge window. > > > > > > For this rev I just implemented on ixgbe because I wrote the > > > original XDP support there. I'll volunteer to do virtio as well. > > > > I went over this series a few times and it looks great to me. > > You didn't even give me some coding style issues to pick on :-) > > We (Daniel, Andy and I) have been reviewing and improving on this > patchset the last couple of weeks ;-). We had some stability issues, > which is why it wasn't published earlier. My plan is to test this > latest patchset again, Monday and Tuesday. I'll try to assess stability > and provide some performance numbers. Damn, I though it was stable, I have been running a lot of performance tests, and then this just happened :-( [11357.149486] BUG: unable to handle kernel NULL pointer dereference at 0210 [11357.157393] IP: __dev_map_flush+0x58/0x90 [11357.161446] PGD 3ff685067 [11357.161446] P4D 3ff685067 [11357.164199] PUD 3ff684067 [11357.166947] PMD 0 [11357.170396] [11357.173981] Oops: [#1] PREEMPT SMP [11357.177859] Modules linked in: coretemp kvm_intel kvm irqbypass intel_cstate intel_uncore intel_rapl_perf mxm_wmi i2c_i801 pcspkr sg i2c_co] [11357.203021] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.12.0-net-next-xdp_redirect02-rfc+ #135 [11357.211706] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Extreme4, BIOS P2.10 05/12/2015 [11357.221606] task: 81c0e480 task.stack: 81c0 [11357.227568] RIP: 0010:__dev_map_flush+0x58/0x90 [11357.232138] RSP: 0018:88041fa03de0 EFLAGS: 00010286 [11357.237409] RAX: RBX: 8803fc996600 RCX: 0003 [11357.244589] RDX: 88040d0bf480 RSI: RDI: 81d901d8 [11357.251767] RBP: 88041fa03df8 R08: fffc R09: 00070008 [11357.258940] R10: 813f11d0 R11: 0040 R12: e8c014a0 [11357.266119] R13: 0003 R14: 003c R15: 8803fc9c26c0 [11357.273294] FS: () GS:88041fa0() knlGS: [11357.281454] CS: 0010 DS: ES: CR0: 80050033 [11357.287244] CR2: 0210 CR3: 0003fc41e000 CR4: 001406f0 [11357.294423] Call Trace: [11357.296912] [11357.298967] xdp_do_flush_map+0x36/0x40 [11357.302847] ixgbe_poll+0x7ea/0x1370 [ixgbe] [11357.307160] net_rx_action+0x247/0x3e0 [11357.310957] __do_softirq+0x106/0x2cb [11357.314664] irq_exit+0xbe/0xd0 [11357.317851] do_IRQ+0x4f/0xd0 [11357.320858] common_interrupt+0x86/0x86 [11357.324733] RIP: 0010:poll_idle+0x2f/0x5a [11357.328781] RSP: 0018:81c03dd0 EFLAGS: 0246 ORIG_RAX: ff8e [11357.336426] RAX: RBX: 81d689e0 RCX: 0020 [11357.343605] RDX: RSI: 81c0e480 RDI: 88041fa22800 [11357.350783] RBP: 81c03dd0 R08: 03c5 R09: 0018 [11357.357958] R10: 0327 R11: 0390 R12: 88041fa22800 [11357.365135] R13: 81d689f8 R14: R15: 81d689e0 [11357.372311] [11357.374453] cpuidle_enter_state+0xf2/0x2e0 [11357.378678] cpuidle_enter+0x17/0x20 [11357.382299] call_cpuidle+0x23/0x40 [11357.385834] do_idle+0xe8/0x190 [11357.389024] cpu_startup_entry+0x1d/0x20 [11357.392993] rest_init+0xd5/0xe0 [11357.396268] start_kernel+0x3d7/0x3e4 [11357.399979] x86_64_start_reservations+0x2a/0x2c [11357.404641] x86_64_start_kernel+0x178/0x18b [11357.408959] secondary_startup_64+0x9f/0x9f [11357.413186] ? secondary_startup_64+0x9f/0x9f [11357.417589] Code: 41 89 c5 48 8b 53 60 44 89 e8 48 8d 14 c2 48 8b 12 48 85 d2 74 23 48 8b 3a f0 49 0f b3 04 24 48 85 ff 74 15 48 8b 87 e0 0 [11357.436565] RIP: __dev_map_flush+0x58/0x90 RSP: 88041fa03de0 [11357.442613] CR2: 0210 [11357.445972] ---[ end trace f7ed232095169a98 ]--- [11357.450637] Kernel panic - not syncing: Fatal exception in interrupt [11357.457038] Kernel Offset: disabled [11357.460566] ---[ end Kernel panic - not syncing: Fatal exception in interrupt (gdb) list
Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
On Sat, 08 Jul 2017 10:46:18 +0100 (WEST) David Millerwrote: > From: John Fastabend > Date: Fri, 07 Jul 2017 10:48:36 -0700 > > > On 07/07/2017 10:34 AM, John Fastabend wrote: > >> This series adds two new XDP helper routines bpf_redirect() and > >> bpf_redirect_map(). The first variant bpf_redirect() is meant > >> to be used the same way it is currently being used by the cls_bpf > >> classifier. An xdp packet will be redirected immediately when this > >> is called. > > > > Also other than the typo in the title there ;) I'm going to CC > > the driver maintainers working on XDP (makes for a long CC list but) > > because we would want to try and get support in as many as possible in > > the next merge window. > > > > For this rev I just implemented on ixgbe because I wrote the > > original XDP support there. I'll volunteer to do virtio as well. > > I went over this series a few times and it looks great to me. > You didn't even give me some coding style issues to pick on :-) We (Daniel, Andy and I) have been reviewing and improving on this patchset the last couple of weeks ;-). We had some stability issues, which is why it wasn't published earlier. My plan is to test this latest patchset again, Monday and Tuesday. I'll try to assess stability and provide some performance numbers. I've complained/warned about the danger of redirecting with XDP, without providing (1) a way to debug/see XDP redirects, (2) a way interfaces opt-in they can be redirected. (1) is solved by patch-07/12 via a tracepoint. (2) is currently done by only forwarding to interfaces with an XDP program loaded itself, this also comes from a practical need for NIC drivers to allocate XDP-TX HW queues. I'm not satisfied with the (UAPI) name for the new map "BPF_MAP_TYPE_DEVMAP" and the filename this is placed in "kernel/bpf/devmap.c", as we want to take advantage of compiler inlining for the next redirect map types. (1) because the name doesn't tell the user that this map is connected to the redirect_map call. (2) we want to introduce other kinds of redirect maps (like redirect to CPUs and sockets), and it would be good if they shared a common "text" string. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
From: John FastabendDate: Fri, 07 Jul 2017 10:48:36 -0700 > On 07/07/2017 10:34 AM, John Fastabend wrote: >> This series adds two new XDP helper routines bpf_redirect() and >> bpf_redirect_map(). The first variant bpf_redirect() is meant >> to be used the same way it is currently being used by the cls_bpf >> classifier. An xdp packet will be redirected immediately when this >> is called. > > Also other than the typo in the title there ;) I'm going to CC > the driver maintainers working on XDP (makes for a long CC list but) > because we would want to try and get support in as many as possible in > the next merge window. > > For this rev I just implemented on ixgbe because I wrote the > original XDP support there. I'll volunteer to do virtio as well. I went over this series a few times and it looks great to me. You didn't even give me some coding style issues to pick on :-)
Re: [RFC PATCH 00/12] Implement XDP bpf_redirect vairants
On 07/07/2017 10:34 AM, John Fastabend wrote: > This series adds two new XDP helper routines bpf_redirect() and > bpf_redirect_map(). The first variant bpf_redirect() is meant > to be used the same way it is currently being used by the cls_bpf > classifier. An xdp packet will be redirected immediately when this > is called. Also other than the typo in the title there ;) I'm going to CC the driver maintainers working on XDP (makes for a long CC list but) because we would want to try and get support in as many as possible in the next merge window. For this rev I just implemented on ixgbe because I wrote the original XDP support there. I'll volunteer to do virtio as well. Thanks, John
[RFC PATCH 00/12] Implement XDP bpf_redirect vairants
This series adds two new XDP helper routines bpf_redirect() and bpf_redirect_map(). The first variant bpf_redirect() is meant to be used the same way it is currently being used by the cls_bpf classifier. An xdp packet will be redirected immediately when this is called. The other variant bpf_redirect_map(map, key, flags) uses a new map type called devmap. A devmap uses integers as keys and net_devices as values. The user provies key/ifindex pairs to update the map with new net_devices. This provides two benefits over the normal variant 'bpf_redirect()'. First the datapath bpf program is abstracted away from using hard-coded ifindex values. Allowing a single bpf program to be run any many different environments. Second, and perhaps more important, the map enables batching packet transmits. The map plus small driver changes allows for batching all send requests across a NAPI poll loop. This allows driver writers to optimize the driver xmit path and only call expensive operations once for a batch of xdp_buffs. The devmap was designed with possible future work to support multicast and broadcast as following patches. To see, in more detail, how to leverage the new helpers and map from the userspace side please review these two patches, xdp: sample program for new bpf_redirect helper xdp: bpf redirect with map sample program I'm sending this as an RFC now because (a) the merge window is closed so it seems like a good time to get feedback and (b) I'm currently doing a last round of testing to ensure all the features are still working after latest revisions as well as doing a final review. Any feedback would be welcome. Thanks to Jesper, Andy, and Daniel for all their input, patches, fixes, testing, review, etc. so far it is very much appreciated! Thanks, John --- John Fastabend (12): ixgbe: NULL xdp_tx rings on resource cleanup net: xdp: support xdp generic on virtual devices xdp: add bpf_redirect helper function xdp: sample program for new bpf_redirect helper net: implement XDP_REDIRECT for xdp generic ixgbe: add initial support for xdp redirect xdp: add trace event for xdp redirect bpf: add devmap, a map for storing net device references bpf: add bpf_redirect_map helper routine xdp: Add batching support to redirect map net: add notifier hooks for devmap bpf map xdp: bpf redirect with map sample program drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c |8 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 60 +++ include/linux/bpf.h |5 include/linux/bpf_types.h |3 include/linux/filter.h| 14 + include/linux/netdevice.h | 11 + include/trace/events/xdp.h| 31 ++ include/uapi/linux/bpf.h | 10 + kernel/bpf/Makefile |3 kernel/bpf/devmap.c | 431 + kernel/bpf/verifier.c | 14 + net/core/dev.c| 226 - net/core/filter.c | 172 ++ samples/bpf/Makefile |8 samples/bpf/bpf_helpers.h |2 samples/bpf/xdp_redirect_kern.c | 81 + samples/bpf/xdp_redirect_map_kern.c | 83 + samples/bpf/xdp_redirect_map_user.c | 105 ++ samples/bpf/xdp_redirect_user.c | 102 ++ tools/testing/selftests/bpf/test_maps.c | 15 + 20 files changed, 1283 insertions(+), 101 deletions(-) create mode 100644 kernel/bpf/devmap.c create mode 100644 samples/bpf/xdp_redirect_kern.c create mode 100644 samples/bpf/xdp_redirect_map_kern.c create mode 100644 samples/bpf/xdp_redirect_map_user.c create mode 100644 samples/bpf/xdp_redirect_user.c -- Signature