On 03/26/2018 04:03 PM, Alexander Duyck wrote:
On Mon, Mar 26, 2018 at 3:54 PM, Tushar Dave <tushar.n.d...@oracle.com> wrote:


On 03/26/2018 09:38 AM, Jesper Dangaard Brouer wrote:


On Mon, 26 Mar 2018 09:06:54 -0700 William Tu <u9012...@gmail.com> wrote:

On Wed, Jan 31, 2018 at 5:53 AM, Björn Töpel <bjorn.to...@gmail.com>
wrote:

From: Björn Töpel <bjorn.to...@intel.com>

This RFC introduces a new address family called AF_XDP that is
optimized for high performance packet processing and zero-copy
semantics. Throughput improvements can be up to 20x compared to V2 and
V3 for the micro benchmarks included. Would be great to get your
feedback on it. Note that this is the follow up RFC to AF_PACKET V4
from November last year. The feedback from that RFC submission and the
presentation at NetdevConf in Seoul was to create a new address family
instead of building on top of AF_PACKET. AF_XDP is this new address
family.

The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor
level is that TX and RX descriptors are separated from packet
buffers. An RX or TX descriptor points to a data buffer in a packet
buffer area. RX and TX can share the same packet buffer so that a
packet does not have to be copied between RX and TX. Moreover, if a
packet needs to be kept for a while due to a possible retransmit, then
the descriptor that points to that packet buffer can be changed to
point to another buffer and reused right away. This again avoids
copying data.

The RX and TX descriptor rings are registered with the setsockopts
XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer
area is allocated by user space and registered with the kernel using
the new XDP_MEM_REG setsockopt. All these three areas are shared
between user space and kernel space. The socket is then bound with a
bind() call to a device and a specific queue id on that device, and it
is not until bind is completed that traffic starts to flow.

An XDP program can be loaded to direct part of the traffic on that
device and queue id to user space through a new redirect action in an
XDP program called bpf_xdpsk_redirect that redirects a packet up to
the socket in user space. All the other XDP actions work just as
before. Note that the current RFC requires the user to load an XDP
program to get any traffic to user space (for example all traffic to
user space with the one-liner program "return
bpf_xdpsk_redirect();"). We plan on introducing a patch that removes
this requirement and sends all traffic from a queue to user space if
an AF_XDP socket is bound to it.

AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and
XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there
is no specific mode called XDP_DRV_ZC). If the driver does not have
support for XDP, or XDP_SKB is explicitly chosen when loading the XDP
program, XDP_SKB mode is employed that uses SKBs together with the
generic XDP support and copies out the data to user space. A fallback
mode that works for any network device. On the other hand, if the
driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and
ndo_xdp_flush), these NDOs, without any modifications, will be used by
the AF_XDP code to provide better performance, but there is still a
copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP
driver support with the zero-copy user space allocator that provides
even better performance. In this mode, the networking HW (or SW driver
if it is a virtual driver like veth) DMAs/puts packets straight into
the packet buffer that is shared between user space and kernel
space. The RX and TX descriptor queues of the networking HW are NOT
shared to user space. Only the kernel can read and write these and it
is the kernel driver's responsibility to translate these HW specific
descriptors to the HW agnostic ones in the virtual descriptor rings
that user space sees. This way, a malicious user space program cannot
mess with the networking HW. This mode though requires some extensions
to XDP.

To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a
buffer pool concept so that the same XDP driver code can be used for
buffers allocated using the page allocator (XDP_DRV), the user-space
zero-copy allocator (XDP_DRV_ZC), or some internal driver specific
allocator/cache/recycling mechanism. The ndo_bpf call has also been
extended with two commands for registering and unregistering an XSK
socket and is in the RX case mainly used to communicate some
information about the user-space buffer pool to the driver.

For the TX path, our plan was to use ndo_xdp_xmit and ndo_xdp_flush,
but we run into problems with this (further discussion in the
challenges section) and had to introduce a new NDO called
ndo_xdp_xmit_xsk (xsk = XDP socket). It takes a pointer to a netdevice
and an explicit queue id that packets should be sent out on. In
contrast to ndo_xdp_xmit, it is asynchronous and pulls packets to be
sent from the xdp socket (associated with the dev and queue
combination that was provided with the NDO call) using a callback
(get_tx_packet), and when they have been transmitted it uses another
callback (tx_completion) to signal completion of packets. These
callbacks are set via ndo_bpf in the new XDP_REGISTER_XSK
command. ndo_xdp_xmit_xsk is exclusively used by the XDP socket code
and thus does not clash with the XDP_REDIRECT use of
ndo_xdp_xmit. This is one of the reasons that the XDP_DRV mode
(without ZC) is currently not supported by TX. Please have a look at
the challenges section for further discussions.

The AF_XDP bind call acts on a queue pair (channel in ethtool speak),
so the user needs to steer the traffic to the zero-copy enabled queue
pair. Which queue to use, is up to the user.

For an untrusted application, HW packet steering to a specific queue
pair (the one associated with the application) is a requirement, as
the application would otherwise be able to see other user space
processes' packets. If the HW cannot support the required packet
steering, XDP_DRV or XDP_SKB mode have to be used as they do not
expose the NIC's packet buffer into user space as the packets are
copied into user space from the NIC's packet buffer in the kernel.

There is a xdpsock benchmarking/test application included. Say that
you would like your UDP traffic from port 4242 to end up in queue 16,
that we will enable AF_XDP on. Here, we use ethtool for this:

        ethtool -N p3p2 rx-flow-hash udp4 fn
        ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
            action 16

Running the l2fwd benchmark in XDP_DRV_ZC mode can then be done using:

        samples/bpf/xdpsock -i p3p2 -q 16 -l -N

For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
can be displayed with "-h", as usual.

We have run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for TR/RX and one for the user space application. The
memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
8192MB and with 8 of those DIMMs in the system we have 64 GB of total
memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
Intel I40E 40Gbit/s using the i40e driver.

Below are the results in Mpps of the I40E NIC benchmark runs for 64
byte packets, generated by commercial packet generator HW that is
generating packets at full 40 Gbit/s line rate.

XDP baseline numbers without this RFC:
xdp_rxq_info --action XDP_DROP 31.3 Mpps
xdp_rxq_info --action XDP_TX   16.7 Mpps

XDP performance with this RFC i.e. with the buffer allocator:
XDP_DROP 21.0 Mpps
XDP_TX   11.9 Mpps

AF_PACKET V4 performance from previous RFC on 4.14-rc7:
Benchmark   V2     V3     V4     V4+ZC
rxdrop      0.67   0.73   0.74   33.7
txpush      0.98   0.98   0.91   19.6
l2fwd       0.66   0.71   0.67   15.5

AF_XDP performance:
Benchmark   XDP_SKB   XDP_DRV    XDP_DRV_ZC (all in Mpps)
rxdrop      3.3        11.6         16.9
txpush      2.2         NA*         21.8
l2fwd       1.7         NA*         10.4



Hi,
I also did an evaluation of AF_XDP, however the performance isn't as
good as above.
I'd like to share the result and see if there are some tuning
suggestions.

System:
16 core, Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz
Intel 10G X540-AT2 ---> so I can only run XDP_SKB mode


Hmmm, why is X540-AT2 not able to use XDP natively?

AF_XDP performance:
Benchmark   XDP_SKB
rxdrop      1.27 Mpps
txpush      0.99 Mpps
l2fwd        0.85 Mpps


Definitely too low...

What is the performance if you drop packets via iptables?

Command:
   $ iptables -t raw -I PREROUTING -p udp --dport 9 --j DROP

NIC configuration:
the command
"ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 action 16"
doesn't work on my ixgbe driver, so I use ntuple:

ethtool -K enp10s0f0 ntuple on
ethtool -U enp10s0f0 flow-type udp4 src-ip 10.1.1.100 action 1
then
echo 1 > /proc/sys/net/core/bpf_jit_enable
./xdpsock -i enp10s0f0 -r -S --queue=1

I also take a look at perf result:
For rxdrop:
86.56%  xdpsock xdpsock           [.] main
    9.22%  xdpsock  [kernel.vmlinux]  [k] nmi
    4.23%  xdpsock  xdpsock         [.] xq_enq


It looks very strange that you see non-maskable interrupt's (NMI) being
this high...



For l2fwd:
   20.81%  xdpsock xdpsock             [.] main
   10.64%  xdpsock [kernel.vmlinux]    [k] clflush_cache_range


Oh, clflush_cache_range is being called!
Do your system use an IOMMU ?


Whats the implication here. Should IOMMU be disabled?
I'm asking because I do see a huge difference while running pktgen test for
my performance benchmarks, with and without intel_iommu.


-Tushar

For the Intel parts the IOMMU can be expensive primarily for Tx, since
it should have minimal impact if the Rx pages are pinned/recycled. I
am assuming the same is true here for AF_XDP, Bjorn can correct me if
I am wrong.

Indeed. Intel iommu has least effect on RX because of premap/recycle.
But TX dma map and unmap is really expensive!


Basically the IOMMU can make creating/destroying a DMA mapping really
expensive. The easiest way to work around it in the case of the Intel
IOMMU is to boot with "iommu=pt" which will create an identity mapping
for the host. The downside is though that you then have the entire
system accessible to the device unless a new mapping is created for it
by assigning it to a new IOMMU domain.

Yeah thats what I would say, If you really want to use intel iommu and
don't want to hit by performance , use 'iommu=pt'.

Good to have confirmation from you Alex. Thanks.

btw, I don't want to distract this thread on iommu discussion however
even using 'pt' doesn't give you the same performance numbers that you
rather get with intel iommu disabled!

-Tushar


Thanks.

- Alex

Reply via email to