On 26/08/2025 01.00, Jacob Keller wrote:
XDP_DROP performance has been tested for this version, thanks to work from
Michal Kubiak. The results are quite promising, with 3 versions being
compared:
* baseline net-next tree
* v1 applied
* v2 applied
Michal said:
I run the XDP_DROP performance comparison tests on my setup in the way I
usually do. I didn't have the pktgen configured on my link partner, but I
used 6 instances of the xdpsock running in Tx-only mode to generate
high-bandwith traffic. Also, I tried to replicate the conditions according
to Jesper's description, making sure that all the traffic is directed to a
single Rx queue and one CPU is 100% loaded.
Thank you for replicating the test setup. Using xdpsock as a traffic
generator is fine, as long as we make sure that the generator TX speeds
exceeds the Device Under Test RX XDP_DROP speed. It is also important
for the test that packets hits a single RX queue and we verify one CPU
is 100% load, as you describe.
As a reminder the pktgen kernel module comes with ready-to-use sample
shell-scripts[1].
[1] https://elixir.bootlin.com/linux/v6.16.3/source/samples/pktgen
The performance hit from v1 is replicated, and shown to be gone in v2, with
our results showing even an increase compared to baseline instead of a
drop. I've included the relative packet per second deltas compared against
a baseline test with neither v1 or v2.
Thanks for also replicating the performance hit from v1 as I did in [2].
To Michal: What CPU did you use?
- I used CPU: AMD EPYC 9684X (with SRSO=IBPB)
One of the reasons that I saw a larger percentage drop is that this CPU
doesn't have DDIO/DCA, which deliver the packet to L3 cache (and a L2
cache-miss will obviously take less time than a full main memory cache-
miss). (Details: Newer AMD CPUs will get something called PCIe TLP
Processing Hints (TPH), which resembles DDIO).
Point is that I see some opportunities in driver to move some of the
prefetches earlier. But we want to make sure it benefits both CPU types,
and I can test on the AMD platform. (This CPU is a large part of our
fleet so it makes sense for us to optimize this).
baseline to v1, no-touch:
-8,387,677 packets per second (17%) decrease.
baseline to v2, no-touch:
+4,057,000 packets per second (8%) increase!
baseline to v1, read data:
-411,709 packets per second (1%) decrease.
baseline to v2, read data:
+4,331,857 packets per second (11%) increase!
Thanks for providing these numbers.
I would also like to know the throughput PPS packet numbers before and
after, as this allows me to calculate the nanosec difference. Using
percentages are usually useful, but it can be misleading when dealing
with XDP_DROP speeds, because a small nanosec change will get
"magnified" too much.
---
Changes in v2:
- Only access shared info for fragmented frames
- Link to v1:
https://lore.kernel.org/netdev/[email protected]/
[2]
https://lore.kernel.org/netdev/[email protected]/
---
drivers/net/ethernet/intel/ice/ice_txrx.h | 1 -
drivers/net/ethernet/intel/ice/ice_txrx.c | 80 +++++++++++++------------------
2 files changed, 34 insertions(+), 47 deletions(-)
Acked-by: Jesper Dangaard Brouer <[email protected]>