Hello, Recently I tried to use bulk rx function to reduce CPU usage of rte_eth_rx_burst. However application performance with i40e_recv_pkts_bulk_alloc was significantly worse than with i40e_recv_pkts. (3m less PPS, 0.5 IPC on receiving core)
Quick investigation revealed two problems: - First payload cacheline is prefetched in i40e_recv_pkts but not in i40e_recv_pkts_bulk_alloc. - Only first line of next mbuf is prefetched during mbuf init in i40e_rx_alloc_bufs. This causes cache miss at setting 'next' field from mbuf cacheline1 to NULL. Fixing these two small issues significantly reduced CPU time spent in rte_eth_rx_burst and improved PPS compared to both original i40e_recv_pkts_bulk_alloc and i40e_recv_pkts. Regards, Vladyslav Buslov (1): net/i40e: add additional prefetch instructions for bulk rx drivers/net/i40e/i40e_rxtx.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) -- 2.8.3