> -----Original Message----- > From: Intel-wired-lan <[email protected]> On Behalf Of Jacob > Keller > Sent: 26 August 2025 04:30 > To: Kubiak, Michal <[email protected]>; Nguyen, Anthony L > <[email protected]>; Intel Wired LAN > <[email protected]>; [email protected] > Cc: Christoph Petrausch <[email protected]>; Jesper Dangaard > Brouer <[email protected]>; Jaroslav Pulchart <[email protected]>; > Keller, Jacob E <[email protected]> > Subject: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on > multi-buffer frames > > The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each > buffer in the current frame. This function was introduced as part of handling > multi-buffer XDP support in the ice driver. > > It works by iterating over the buffers from first_desc up to 1 plus the total > number of fragments in the frame, cached from before the XDP program was > executed. > > If the hardware posts a descriptor with a size of 0, the logic used in > ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added as > fragments in ice_add_xdp_frag. Since the buffer isn't counted as a fragment, > we do not iterate over it in ice_put_rx_mbuf(), and thus we don't call > ice_put_rx_buf(). > > Because we don't call ice_put_rx_buf(), we don't attempt to re-use the page > or free it. This leaves a stale page in the ring, as we don't increment > next_to_alloc. > > The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented > properly, and that it always points to a buffer with a NULL page. Since this > function doesn't check, it will happily recycle a page over the top of the > next_to_alloc buffer, losing track of the old page. > > Note that this leak only occurs for multi-buffer frames. The > ice_put_rx_mbuf() function always handles at least one buffer, so a > single-buffer frame will always get handled correctly. It is not clear > precisely why the hardware hands us descriptors with a size of 0 sometimes, > but it happens somewhat regularly with "jumbo frames" used by 9K MTU. > > To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on > all buffers between first_desc and next_to_clean. Borrow the logic of a > similar function in i40e used for this same purpose. Use the same logic also > in ice_get_pgcnts(). > > Instead of iterating over just the number of fragments, use a loop which > iterates until the current index reaches to the next_to_clean element just > past the current frame. Unlike i40e, the ice_put_rx_mbuf() function does call > ice_put_rx_buf() on the last buffer of the frame indicating the end of packet. > > For non-linear (multi-buffer) frames, we need to take care when adjusting the > pagecnt_bias. An XDP program might release fragments from the tail of the > frame, in which case that fragment page is already released. Only update the > pagecnt_bias for the first descriptor and fragments still remaining post-XDP > program. Take care to only access the shared info for fragmented buffers, as > this avoids a significant cache miss. > > The xdp_xmit value only needs to be updated if an XDP program is run, and > only once per packet. Drop the xdp_xmit pointer argument from > ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function > directly. This avoids needing to pass the argument and avoids an extra > bit-wise OR for each buffer in the frame. > > Move the increment of the ntc local variable to ensure its updated *before* > all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic > requires the index of the element just after the current frame. > > Now that we use an index pointer in the ring to identify the packet, we no > longer need to track or cache the number of fragments in the rx_ring. > > Cc: Christoph Petrausch <[email protected]> > Cc: Jesper Dangaard Brouer <[email protected]> > Reported-by: Jaroslav Pulchart <[email protected]> > Closes: > https://lore.kernel.org/netdev/cak8ffz4hy6gujnenz3wy9jaylzxgfpr7dnzxzgmyoe44car...@mail.gmail.com/ > Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current > frame") > Tested-by: Michal Kubiak <[email protected]> > Signed-off-by: Jacob Keller <[email protected]> > --- > I've tested this in a setup with MTU 9000, using a combination of iperf3 and > wrk generated traffic. > > I tested this in a couple of ways. First, I check memory allocations using > /proc/allocinfo: > > awk '/ice_alloc_mapped_page/ { printf("%s %s\n", $1, $2) }' /proc/allocinfo > | numfmt --to=iec > > Second, I ported some stats from i40e written by Joe Damato to track the page > allocation and busy counts. I consistently saw that the allocate stat > increased without the busy or waive stats increasing. I also added a stat to > track directly when we overwrote a page pointer that was non-NULL in > ice_reuse_rx_page(), and saw it increment consistently. > > With this fix, all of these indicators are fixed. I've tested both 1500 byte > and 9000 byte MTU and no longer see the leak. With the counters I was able to > immediately see a leak within a few minutes of iperf3, so I am confident that > I've resolved the leak with this fix. > > I've now also tested with xdp-bench and confirm that XDP_TX and XDP_REDIR > work properly with the fix for updating xdp_xmit. > > This version (v2) avoids the cache miss regression reported by Jesper. I > refactored a bit to only check the shared info if the XDP buffer is > fragmented. I considered adding a helper function to do this to the XDP > header file. However, I scanned several drivers and noticed that only ice and > i40e access the nr_frags in this way. The ice variation I believe will be > removed with the conversion to page pool, so I don't think the addition of a > helper is warranted. > > XDP_DROP performance has been tested for this version, thanks to work from > Michal Kubiak. The results are quite promising, with 3 versions being > compared: > > * baseline net-next tree > * v1 applied > * v2 applied > > Michal said: > > I run the XDP_DROP performance comparison tests on my setup in the way I > usually do. I didn't have the pktgen configured on my link partner, but I > used 6 instances of the xdpsock running in Tx-only mode to generate > high-bandwith traffic. Also, I tried to replicate the conditions according > to Jesper's description, making sure that all the traffic is directed to a > single Rx queue and one CPU is 100% loaded. > > The performance hit from v1 is replicated, and shown to be gone in v2, with > our results showing even an increase compared to baseline instead of a drop. > I've included the relative packet per second deltas compared against a > baseline test with neither v1 or v2. > > baseline to v1, no-touch: > -8,387,677 packets per second (17%) decrease. > > baseline to v2, no-touch: > +4,057,000 packets per second (8%) increase! > > baseline to v1, read data: > -411,709 packets per second (1%) decrease. > > baseline to v2, read data: > +4,331,857 packets per second (11%) increase! > --- > Changes in v2: > - Only access shared info for fragmented frames > - Link to v1: > https://lore.kernel.org/netdev/[email protected]/ > --- > drivers/net/ethernet/intel/ice/ice_txrx.h | 1 - > drivers/net/ethernet/intel/ice/ice_txrx.c | 80 +++++++++++++------------------ > 2 files changed, 34 insertions(+), 47 deletions(-) >
Tested-by: Rinitha S <[email protected]> (A Contingent worker at Intel)
