On Mon, Mar 26, 2018 at 3:54 PM, Tushar Dave <tushar.n.d...@oracle.com> wrote: > > > On 03/26/2018 09:38 AM, Jesper Dangaard Brouer wrote: >> >> >> On Mon, 26 Mar 2018 09:06:54 -0700 William Tu <u9012...@gmail.com> wrote: >> >>> On Wed, Jan 31, 2018 at 5:53 AM, Björn Töpel <bjorn.to...@gmail.com> >>> wrote: >>>> >>>> From: Björn Töpel <bjorn.to...@intel.com> >>>> >>>> This RFC introduces a new address family called AF_XDP that is >>>> optimized for high performance packet processing and zero-copy >>>> semantics. Throughput improvements can be up to 20x compared to V2 and >>>> V3 for the micro benchmarks included. Would be great to get your >>>> feedback on it. Note that this is the follow up RFC to AF_PACKET V4 >>>> from November last year. The feedback from that RFC submission and the >>>> presentation at NetdevConf in Seoul was to create a new address family >>>> instead of building on top of AF_PACKET. AF_XDP is this new address >>>> family. >>>> >>>> The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor >>>> level is that TX and RX descriptors are separated from packet >>>> buffers. An RX or TX descriptor points to a data buffer in a packet >>>> buffer area. RX and TX can share the same packet buffer so that a >>>> packet does not have to be copied between RX and TX. Moreover, if a >>>> packet needs to be kept for a while due to a possible retransmit, then >>>> the descriptor that points to that packet buffer can be changed to >>>> point to another buffer and reused right away. This again avoids >>>> copying data. >>>> >>>> The RX and TX descriptor rings are registered with the setsockopts >>>> XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer >>>> area is allocated by user space and registered with the kernel using >>>> the new XDP_MEM_REG setsockopt. All these three areas are shared >>>> between user space and kernel space. The socket is then bound with a >>>> bind() call to a device and a specific queue id on that device, and it >>>> is not until bind is completed that traffic starts to flow. >>>> >>>> An XDP program can be loaded to direct part of the traffic on that >>>> device and queue id to user space through a new redirect action in an >>>> XDP program called bpf_xdpsk_redirect that redirects a packet up to >>>> the socket in user space. All the other XDP actions work just as >>>> before. Note that the current RFC requires the user to load an XDP >>>> program to get any traffic to user space (for example all traffic to >>>> user space with the one-liner program "return >>>> bpf_xdpsk_redirect();"). We plan on introducing a patch that removes >>>> this requirement and sends all traffic from a queue to user space if >>>> an AF_XDP socket is bound to it. >>>> >>>> AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and >>>> XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there >>>> is no specific mode called XDP_DRV_ZC). If the driver does not have >>>> support for XDP, or XDP_SKB is explicitly chosen when loading the XDP >>>> program, XDP_SKB mode is employed that uses SKBs together with the >>>> generic XDP support and copies out the data to user space. A fallback >>>> mode that works for any network device. On the other hand, if the >>>> driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and >>>> ndo_xdp_flush), these NDOs, without any modifications, will be used by >>>> the AF_XDP code to provide better performance, but there is still a >>>> copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP >>>> driver support with the zero-copy user space allocator that provides >>>> even better performance. In this mode, the networking HW (or SW driver >>>> if it is a virtual driver like veth) DMAs/puts packets straight into >>>> the packet buffer that is shared between user space and kernel >>>> space. The RX and TX descriptor queues of the networking HW are NOT >>>> shared to user space. Only the kernel can read and write these and it >>>> is the kernel driver's responsibility to translate these HW specific >>>> descriptors to the HW agnostic ones in the virtual descriptor rings >>>> that user space sees. This way, a malicious user space program cannot >>>> mess with the networking HW. This mode though requires some extensions >>>> to XDP. >>>> >>>> To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a >>>> buffer pool concept so that the same XDP driver code can be used for >>>> buffers allocated using the page allocator (XDP_DRV), the user-space >>>> zero-copy allocator (XDP_DRV_ZC), or some internal driver specific >>>> allocator/cache/recycling mechanism. The ndo_bpf call has also been >>>> extended with two commands for registering and unregistering an XSK >>>> socket and is in the RX case mainly used to communicate some >>>> information about the user-space buffer pool to the driver. >>>> >>>> For the TX path, our plan was to use ndo_xdp_xmit and ndo_xdp_flush, >>>> but we run into problems with this (further discussion in the >>>> challenges section) and had to introduce a new NDO called >>>> ndo_xdp_xmit_xsk (xsk = XDP socket). It takes a pointer to a netdevice >>>> and an explicit queue id that packets should be sent out on. In >>>> contrast to ndo_xdp_xmit, it is asynchronous and pulls packets to be >>>> sent from the xdp socket (associated with the dev and queue >>>> combination that was provided with the NDO call) using a callback >>>> (get_tx_packet), and when they have been transmitted it uses another >>>> callback (tx_completion) to signal completion of packets. These >>>> callbacks are set via ndo_bpf in the new XDP_REGISTER_XSK >>>> command. ndo_xdp_xmit_xsk is exclusively used by the XDP socket code >>>> and thus does not clash with the XDP_REDIRECT use of >>>> ndo_xdp_xmit. This is one of the reasons that the XDP_DRV mode >>>> (without ZC) is currently not supported by TX. Please have a look at >>>> the challenges section for further discussions. >>>> >>>> The AF_XDP bind call acts on a queue pair (channel in ethtool speak), >>>> so the user needs to steer the traffic to the zero-copy enabled queue >>>> pair. Which queue to use, is up to the user. >>>> >>>> For an untrusted application, HW packet steering to a specific queue >>>> pair (the one associated with the application) is a requirement, as >>>> the application would otherwise be able to see other user space >>>> processes' packets. If the HW cannot support the required packet >>>> steering, XDP_DRV or XDP_SKB mode have to be used as they do not >>>> expose the NIC's packet buffer into user space as the packets are >>>> copied into user space from the NIC's packet buffer in the kernel. >>>> >>>> There is a xdpsock benchmarking/test application included. Say that >>>> you would like your UDP traffic from port 4242 to end up in queue 16, >>>> that we will enable AF_XDP on. Here, we use ethtool for this: >>>> >>>> ethtool -N p3p2 rx-flow-hash udp4 fn >>>> ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ >>>> action 16 >>>> >>>> Running the l2fwd benchmark in XDP_DRV_ZC mode can then be done using: >>>> >>>> samples/bpf/xdpsock -i p3p2 -q 16 -l -N >>>> >>>> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options >>>> can be displayed with "-h", as usual. >>>> >>>> We have run some benchmarks on a dual socket system with two Broadwell >>>> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14 >>>> cores which gives a total of 28, but only two cores are used in these >>>> experiments. One for TR/RX and one for the user space application. The >>>> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is >>>> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total >>>> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an >>>> Intel I40E 40Gbit/s using the i40e driver. >>>> >>>> Below are the results in Mpps of the I40E NIC benchmark runs for 64 >>>> byte packets, generated by commercial packet generator HW that is >>>> generating packets at full 40 Gbit/s line rate. >>>> >>>> XDP baseline numbers without this RFC: >>>> xdp_rxq_info --action XDP_DROP 31.3 Mpps >>>> xdp_rxq_info --action XDP_TX 16.7 Mpps >>>> >>>> XDP performance with this RFC i.e. with the buffer allocator: >>>> XDP_DROP 21.0 Mpps >>>> XDP_TX 11.9 Mpps >>>> >>>> AF_PACKET V4 performance from previous RFC on 4.14-rc7: >>>> Benchmark V2 V3 V4 V4+ZC >>>> rxdrop 0.67 0.73 0.74 33.7 >>>> txpush 0.98 0.98 0.91 19.6 >>>> l2fwd 0.66 0.71 0.67 15.5 >>>> >>>> AF_XDP performance: >>>> Benchmark XDP_SKB XDP_DRV XDP_DRV_ZC (all in Mpps) >>>> rxdrop 3.3 11.6 16.9 >>>> txpush 2.2 NA* 21.8 >>>> l2fwd 1.7 NA* 10.4 >>>> >>> >>> >>> Hi, >>> I also did an evaluation of AF_XDP, however the performance isn't as >>> good as above. >>> I'd like to share the result and see if there are some tuning >>> suggestions. >>> >>> System: >>> 16 core, Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz >>> Intel 10G X540-AT2 ---> so I can only run XDP_SKB mode >> >> >> Hmmm, why is X540-AT2 not able to use XDP natively? >> >>> AF_XDP performance: >>> Benchmark XDP_SKB >>> rxdrop 1.27 Mpps >>> txpush 0.99 Mpps >>> l2fwd 0.85 Mpps >> >> >> Definitely too low... >> >> What is the performance if you drop packets via iptables? >> >> Command: >> $ iptables -t raw -I PREROUTING -p udp --dport 9 --j DROP >> >>> NIC configuration: >>> the command >>> "ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 action 16" >>> doesn't work on my ixgbe driver, so I use ntuple: >>> >>> ethtool -K enp10s0f0 ntuple on >>> ethtool -U enp10s0f0 flow-type udp4 src-ip 10.1.1.100 action 1 >>> then >>> echo 1 > /proc/sys/net/core/bpf_jit_enable >>> ./xdpsock -i enp10s0f0 -r -S --queue=1 >>> >>> I also take a look at perf result: >>> For rxdrop: >>> 86.56% xdpsock xdpsock [.] main >>> 9.22% xdpsock [kernel.vmlinux] [k] nmi >>> 4.23% xdpsock xdpsock [.] xq_enq >> >> >> It looks very strange that you see non-maskable interrupt's (NMI) being >> this high... >> >> >>> >>> For l2fwd: >>> 20.81% xdpsock xdpsock [.] main >>> 10.64% xdpsock [kernel.vmlinux] [k] clflush_cache_range >> >> >> Oh, clflush_cache_range is being called! >> Do your system use an IOMMU ? > > > Whats the implication here. Should IOMMU be disabled? > I'm asking because I do see a huge difference while running pktgen test for > my performance benchmarks, with and without intel_iommu. > > > -Tushar
For the Intel parts the IOMMU can be expensive primarily for Tx, since it should have minimal impact if the Rx pages are pinned/recycled. I am assuming the same is true here for AF_XDP, Bjorn can correct me if I am wrong. Basically the IOMMU can make creating/destroying a DMA mapping really expensive. The easiest way to work around it in the case of the Intel IOMMU is to boot with "iommu=pt" which will create an identity mapping for the host. The downside is though that you then have the entire system accessible to the device unless a new mapping is created for it by assigning it to a new IOMMU domain. Thanks. - Alex