On Thu, Sep 8, 2016 at 12:53 AM, Saeed Mahameed <sae...@dev.mellanox.co.il> wrote: > On Wed, Sep 7, 2016 at 11:55 PM, Or Gerlitz via iovisor-dev > <iovisor-...@lists.iovisor.org> wrote: >> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <sae...@mellanox.com> wrote: >>> From: Rana Shahout <ra...@mellanox.com> >>> >>> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver. >>> >>> When XDP is on we make sure to change channels RQs type to >>> MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to >>> ensure "page per packet". >>> >>> On XDP set, we fail if HW LRO is set and request from user to turn it >>> off. Since on ConnectX4-LX HW LRO is always on by default, this will be >>> annoying, but we prefer not to enforce LRO off from XDP set function. >>> >>> Full channels reset (close/open) is required only when setting XDP >>> on/off. >>> >>> When XDP set is called just to exchange programs, we will update >>> each RQ xdp program on the fly and for synchronization with current >>> data path RX activity of that RQ, we temporally disable that RQ and >>> ensure RX path is not running, quickly update and re-enable that RQ, >>> for that we do: >>> - rq.state = disabled >>> - napi_synnchronize >>> - xchg(rq->xdp_prg) >>> - rq.state = enabled >>> - napi_schedule // Just in case we've missed an IRQ >>> >>> Packet rate performance testing was done with pktgen 64B packets and on >>> TX side and, TC drop action on RX side compared to XDP fast drop. >>> >>> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz >>> >>> Comparison is done between: >>> 1. Baseline, Before this patch with TC drop action >>> 2. This patch with TC drop action >>> 3. This patch with XDP RX fast drop >>> >>> Streams Baseline(TC drop) TC drop XDP fast Drop >>> -------------------------------------------------------------- >>> 1 5.51Mpps 5.14Mpps 13.5Mpps >> >> This (13.5 M PPS) is less than 50% of the result we presented @ the >> XDP summit which was obtained by Rana. Please see if/how much does >> this grows if you use more sender threads, but all of them to xmit the >> same stream/flows, so we're on one ring. That (XDP with single RX ring >> getting packets from N remote TX rings) would be your canonical >> base-line for any further numbers. >> > > I used N TX senders sending 48Mpps to a single RX core. > The single RX core could handle only 13.5Mpps. > > The implementation here is different from the one we presented at the > summit, before, it was with striding RQ, now it is regular linked list > RQ, (Striding RQ ring can handle 32K 64B packets and regular RQ rings > handles only 1K)
> In striding RQ we register only 16 HW descriptors for every 32K > packets. I.e for > every 32K packets we access the HW only 16 times. on the other hand, > regular RQ will access the HW (register descriptors) once per packet, > i.e we write to HW 1K time for 1K packets. i think this explains the > difference. > the catch here is that we can't use striding RQ for XDP, bummer! yep, sounds like a bum bum bum (we went from >30M PPS to 13.5M PPS). We used striding RQ for XDP with the prev impl. and I don't see a real deep reason not to do so also when striding RQ doesn't use compound pages any more. I guess there are more details I need to catch up with here, but the bottom result is not good and we need to re-think. > As i said, we will have the full and final performance results on V1. > This is just a RFC with barely quick and dirty testing Yep, understood. But in parallel, you need to reconsider how to get along without that bumming down of numbers. Or.