On 08/09/2016 12:31 PM, Or Gerlitz wrote:
On Thu, Sep 8, 2016 at 10:38 AM, Jesper Dangaard Brouer
<bro...@redhat.com> wrote:
On Wed, 7 Sep 2016 23:55:42 +0300
Or Gerlitz <gerlitz...@gmail.com> wrote:

On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <sae...@mellanox.com> wrote:
From: Rana Shahout <ra...@mellanox.com>

Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.

When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".

On XDP set, we fail if HW LRO is set and request from user to turn it
off.  Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.

Full channels reset (close/open) is required only when setting XDP
on/off.

When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
         - rq.state = disabled
         - napi_synnchronize
         - xchg(rq->xdp_prg)
         - rq.state = enabled
         - napi_schedule // Just in case we've missed an IRQ

Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.

CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Comparison is done between:
         1. Baseline, Before this patch with TC drop action
         2. This patch with TC drop action
         3. This patch with XDP RX fast drop

Streams    Baseline(TC drop)    TC drop    XDP fast Drop
--------------------------------------------------------------
1           5.51Mpps            5.14Mpps     13.5Mpps
This (13.5 M PPS) is less than 50% of the result we presented @ the
XDP summit which was obtained by Rana. Please see if/how much does
this grows if you use more sender threads, but all of them to xmit the
same stream/flows, so we're on one ring. That (XDP with single RX ring
getting packets from N remote TX rings) would be your canonical
base-line for any further numbers.
Well, my experiments with this hardware (mlx5/CX4 at 50Gbit/s) show
that you should be able to reach 23Mpps on a single CPU.  This is
a XDP-drop-simulation with order-0 pages being recycled through my
page_pool code, plus avoiding the cache-misses (notice you are using a
CPU E5-2680 with DDIO, thus you should only see a L3 cache miss).
so this takes up from 13M to 23M, good.

Could you explain why the move from order-3 to order-0 is hurting the
performance so much (drop from 32M to 23M), any way we can overcome that?
The issue is not moving from high-order to order-0.
It's moving from Striding RQ to non-Striding RQ without using a page-reuse mechanism (not cache). In current memory-scheme, each 64B packet consumes a 4K page, including allocate/release (from cache in this case, but still...). I believe that once we implement page-reuse for non Striding RQ we'll hit 32M PPS again.
The 23Mpps number looks like some HW limitation, as the increase was
not HW, I think. As I said, Rana got 32M with striding RQ when she was
using order-3
(or did we use order-5?)
order-5.
is not proportional to page-allocator overhead I removed (and CPU freq
starts to decrease).  I also did scaling tests to more CPUs, which
showed it scaled up to 40Mpps (you reported 45M).  And at the Phy RX
level I see 60Mpps (50G max is 74Mpps).

_______________________________________________
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev

Reply via email to