On 07/11/2017 11:44 AM, Jesper Dangaard Brouer wrote: > On Tue, 11 Jul 2017 20:01:36 +0200 > Jesper Dangaard Brouer <bro...@redhat.com> wrote: > >> On Tue, 11 Jul 2017 10:48:29 -0700 >> John Fastabend <john.fastab...@gmail.com> wrote: >> >>> On 07/11/2017 08:36 AM, Jesper Dangaard Brouer wrote: >>>> On Sat, 8 Jul 2017 21:06:17 +0200 >>>> Jesper Dangaard Brouer <bro...@redhat.com> wrote: >>>> >>>>> My plan is to test this latest patchset again, Monday and Tuesday. >>>>> I'll try to assess stability and provide some performance numbers. >>>> >>>> Performance numbers: >>>> >>>> 14378479 pkt/s = XDP_DROP without touching memory >>>> 9222401 pkt/s = xdp1: XDP_DROP with reading packet data >>>> 6344472 pkt/s = xdp2: XDP_TX with swap mac (writes into pkt) >>>> 4595574 pkt/s = xdp_redirect: XDP_REDIRECT with swap mac (simulate >>>> XDP_TX) >>>> 5066243 pkt/s = xdp_redirect_map: XDP_REDIRECT with swap mac + devmap >>>> >>>> The performance drop between xdp2 and xdp_redirect, was expected due >>>> to the HW-tailptr flush per packet, which is costly. >>>> >>>> (1/6344472-1/4595574)*10^9 = -59.98 ns >>>> >>>> The performance drop between xdp2 and xdp_redirect_map, is higher than >>>> I expected, which is not good! The avoidance of the tailptr flush per >>>> packet was expected to give a higher boost. The cost increased with >>>> 40 ns, which is too high compared to the code added (on a 4GHz machine >>>> approx 160 cycles). >>>> >>>> (1/6344472-1/5066243)*10^9 = -39.77 ns >>>> >>>> This system doesn't have DDIO, thus we are stalling on cache-misses, >>>> but I was actually expecting that the added code could "hide" behind >>>> these cache-misses. >>>> >>>> I'm somewhat surprised to see this large a performance drop. >>>> >>> >>> Yep, although there is room for optimizations in the code path for sure. And >>> 5mpps is not horrible my preference is to get this series in plus any >>> small optimization we come up with while the merge window is closed. Then >>> follow up patches can do optimizations. >> >> IMHO 5Mpps is a very bad number for XDP. >> >>> One easy optimization is to get rid of the atomic bitops. They are not >>> needed >>> here we have a per cpu unsigned long. Another easy one would be to move >>> some of the checks out of the hotpath. For example checking for ndo_xdp_xmit >>> and flush ops on the net device in the hotpath really should be done in the >>> slow path. >> >> I'm already running with a similar patch as below, but it >> (surprisingly) only gave my 3 ns improvement. I also tried a >> prefetchw() on xdp.data that gave me 10 ns (which is quite good). >> >> I'm booting up another system with a CPU E5-1650 v4 @ 3.60GHz, which >> have DDIO ... I have high hopes for this, as the major bottleneck on >> this CPU i7-4790K CPU @ 4.00GHz is clearly cache-misses. >> >> Something is definitely wrong on this CPU, as perf stats shows, a very >> bad utilization of the CPU pipeline with 0.89 insn per cycle. > > Wow, getting DDIO working and avoiding the cache-miss, was really > _the_ issue. On this CPU E5-1650 v4 @ 3.60GHz things look really > really good for XDP_REDIRECT with maps. (p.s. with __set_bit() > optimization) >
Very nice :) this was with the prefecthw() removed right? > 13,939,674 pkt/s = XDP_DROP without touching memory > 14,290,650 pkt/s = xdp1: XDP_DROP with reading packet data > 13,221,812 pkt/s = xdp2: XDP_TX with swap mac (writes into pkt) > 7,596,576 pkt/s = xdp_redirect: XDP_REDIRECT with swap mac (like XDP_TX) > 13,058,435 pkt/s = xdp_redirect_map:XDP_REDIRECT with swap mac + devmap > > Surprisingly, on this DDIO capable CPU it is slightly slower NOT to > read packet memory. > > The large performance gap to xdp_redirect is due to the tailptr flush, > which really show up on this system. The CPU efficiency is 1.36 insn > per cycle, which for map variant is 2.15 insn per cycle. > > Gap (1/13221812-1/7596576)*10^9 = -56 ns > > The xdp_redirect_map performance is really really good, almost 10G > wirespeed on a single CPU!!! This is amazing, and we know that this > code is not even optimal yet. The performance difference to xdp2 is > only around 1 ns. > Great, yeah there are some more likely()/unlikely() hints we could add and also remove some of the checks in the hotpath, etc. Thanks for doing this!