On Tue, 15 May 2018 21:06:03 +0200 Björn Töpel <bjorn.to...@gmail.com> wrote:
> e have run some benchmarks on a dual socket system with two Broadwell > E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14 > cores which gives a total of 28, but only two cores are used in these > experiments. One for TR/RX and one for the user space application. The > memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is > 8192MB and with 8 of those DIMMs in the system we have 64 GB of total > memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The > NIC is Intel I40E 40Gbit/s using the i40e driver. > > Below are the results in Mpps of the I40E NIC benchmark runs for 64 > and 1500 byte packets, generated by a commercial packet generator HW > outputing packets at full 40 Gbit/s line rate. The results are without > retpoline so that we can compare against previous numbers. > > AF_XDP performance 64 byte packets. Results from the AF_XDP V3 patch > set are also reported for ease of reference. > > Benchmark XDP_SKB XDP_DRV XDP_DRV with zerocopy > rxdrop 2.9* 9.6* 21.5 > txpush 2.6* - 21.6 > l2fwd 1.9* 2.5* 15.0 These performance numbers are actually amazing. When reaching these amazing/crazy speeds, where we are approaching the speed of light (travel 30 cm in 1 nanosec), we have to view these numbers differently, because we are actually working on a nanosec scale. 21.5 Mpps is 46.5 nanosec. If we want to optimize for +1 Mpps, then (1/22.5*10^3=44.44ns) your actually only have to optimize the code with 2 nanosec, and with this 2.0 GHz CPU it should in theory only be 4 cycles, but likely have more instructions per cycle (I see around 2.5 ins per cycle), so we are looking at (2*2*2.5) needing to find 10 cycles for +1Mpps. Comparing to XDP_DROP of 32.3Mpps vs ZC-rxdrop 21.5Mpps, this is actually only a "slowdown" of 15.55 ns, for having frame travel through xdp_do_redirect, do map lookup etc, and queue into userspace, and return frames back to kernel. That is rather amazingly fast. (1/21.5*10^3)-(1/32.3*10^3) = 15.55 ns Another performance number which is amazing is your l2fwd number of 15Mpps, because it if faster than xdp_redirect_map on i40e NICs on my system, which runs at 12.2 Mpps (2.8Mpps slower). Again looking at the nanosec scale instead, this correspond to 15.3 ns. I expect, this improvement comes from avoiding page_frag_free, and avoiding the TX dma_map call (as you premap pages DMA for TX). Reverse calculating based on perf percentage, I find that these should only cost 7.18 ns. Maybe the rest is because you are running TX and TX-dma completion on another CPU. I notice you are also using the XDP return-API, which still does a rhashtable_lookup per frame. I plan to optimize this to do bulking, to get away from per frame lookup. Thus, this should get even faster. > * From AF_XDP V3 patch set and cover letter. > > AF_XDP performance 1500 byte packets: > Benchmark XDP_SKB XDP_DRV XDP_DRV with zerocopy > rxdrop 2.1 3.3 3.3 > l2fwd 1.4 1.8 3.1 > > So why do we not get higher values for RX similar to the 34 Mpps we > had in AF_PACKET V4? We made an experiment running the rxdrop > benchmark without using the xdp_do_redirect/flush infrastructure nor > using an XDP program (all traffic on a queue goes to one > socket). Instead the driver acts directly on the AF_XDP socket. With > this we got 36.9 Mpps, a significant improvement without any change to > the uapi. So not forcing users to have an XDP program if they do not > need it, might be a good idea. This measurement is actually higher > than what we got with AF_PACKET V4. So, that are you telling me with your number 36.9 Mpps for direct-socket-rxdrop... Compared to XDP_DROP at 32.3Mpps, are you saying that it only costs 3.86 nanosec to call the XDP bpf_prog which returns XDP_DROP. That is very impressive actually. (1/32.3*10^3)-(1/36.9*10^3) Compared to ZC-AF_XDP rxdrop 21.5Mpps, are you saying the cost of XDP redirect infrastructure, map lookups etc (incl. return-API per frame) cost 19.41 nanosec (1/21.5*10^3)-(1/36.9*10^3). Which is approx 40 clock-cycles or 100 (speculative) instructions. That is not too bad, and we are still optimizing this stuff. > XDP performance on our system as a base line: > > 64 byte packets: > XDP stats CPU pps issue-pps > XDP-RX CPU 16 32.3M 0 > > 1500 byte packets: > XDP stats CPU pps issue-pps > XDP-RX CPU 16 3.3M 0 Overall I'm *very* impressed by the performance of ZC AF_XDP. Just remember that measuring improvement in +N Mpps, is actually misleading, when operating at these (light) speeds. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer