On Fri, Jul 11, 2025 at 05:23:49PM -0700, Jacob Keller wrote: > The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each > buffer in the current frame. This function was introduced as part of > handling multi-buffer XDP support in the ice driver. > > It works by iterating over the buffers from first_desc up to 1 plus the > total number of fragments in the frame, cached from before the XDP program > was executed. > > If the hardware posts a descriptor with a size of 0, the logic used in > ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added > as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a > fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't > call ice_put_rx_buf(). > > Because we don't call ice_put_rx_buf(), we don't attempt to re-use the > page or free it. This leaves a stale page in the ring, as we don't > increment next_to_alloc. > > The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented > properly, and that it always points to a buffer with a NULL page. Since > this function doesn't check, it will happily recycle a page over the top > of the next_to_alloc buffer, losing track of the old page. > > Note that this leak only occurs for multi-buffer frames. The > ice_put_rx_mbuf() function always handles at least one buffer, so a > single-buffer frame will always get handled correctly. It is not clear > precisely why the hardware hands us descriptors with a size of 0 sometimes, > but it happens somewhat regularly with "jumbo frames" used by 9K MTU. > > To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on > all buffers between first_desc and next_to_clean. Borrow the logic of a > similar function in i40e used for this same purpose. Use the same logic > also in ice_get_pgcnts(). > > Instead of iterating over just the number of fragments, use a loop which > iterates until the current index reaches to the next_to_clean element just > past the current frame. Check the current number of fragments (post XDP > program). For all buffers up 1 more than the number of fragments, we'll > update the pagecnt_bias. For any buffers past this, pagecnt_bias is left > as-is. This ensures that fragments released by the XDP program, as well as > any buffers with zero-size won't have their pagecnt_bias updated > incorrectly. Unlike i40e, the ice_put_rx_mbuf() function does call > ice_put_rx_buf() on the last buffer of the frame indicating end of packet. > > The xdp_xmit value only needs to be updated if an XDP program is run, and > only once per packet. Drop the xdp_xmit pointer argument from > ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function > directly. This avoids needing to pass the argument and avoids an extra > bit-wise OR for each buffer in the frame. > > Move the increment of the ntc local variable to ensure its updated *before* > all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic > requires the index of the element just after the current frame. > > This has the advantage that we also no longer need to track or cache the > number of fragments in the rx_ring, which saves a few bytes in the ring. > > Cc: Christoph Petrausch <[email protected]> > Reported-by: Jaroslav Pulchart <[email protected]> > Closes: > https://lore.kernel.org/netdev/cak8ffz4hy6gujnenz3wy9jaylzxgfpr7dnzxzgmyoe44car...@mail.gmail.com/ > Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current > frame") > Signed-off-by: Jacob Keller <[email protected]>
I have finally reviewed this and tested with xdpsock in copy mode running custom bpf program that consumes 4500 bytes via bpf_xdp_adjust_tail so that the frag consumption by bpf prog is triggered. I don't see any alarming issues that usually popped up in /proc/meminfo. Reviewed-by: Maciej Fijalkowski <[email protected]> Tested-by: Maciej Fijalkowski <[email protected]> once again thank you Jake! > --- > I've tested this in a setup with MTU 9000, using a combination of iperf3 > and wrk generated traffic. > > I tested this in a couple of ways. First, I check memory allocations using > /proc/allocinfo: > > awk '/ice_alloc_mapped_page/ { printf("%s %s\n", $1, $2) }' /proc/allocinfo > | numfmt --to=iec > > Second, I ported some stats from i40e written by Joe Damato to track the > page allocation and busy counts. I consistently saw that the allocate stat > increased without the busy or waive stats increasing. I also added a stat > to track directly when we overwrote a page pointer that was non-NULL in > ice_reuse_rx_page(), and saw it increment consistently. > > With this fix, all of these indicators are fixed. I've tested both 1500 > byte and 9000 byte MTU and no longer see the leak. With the counters I was > able to immediately see a leak within a few minutes of iperf3, so I am > confident that I've resolved the leak with this fix. > > I've now also tested with xdp-bench and confirm that XDP_TX and XDP_REDIR work > properly with the fix for updating xdp_xmit. > --- > Changes in v2: > - Fix XDP Tx/Redirect (Thanks Maciej!) > - Link to v1: > https://lore.kernel.org/r/[email protected] > --- > drivers/net/ethernet/intel/ice/ice_txrx.h | 1 - > drivers/net/ethernet/intel/ice/ice_txrx.c | 81 > +++++++++++++------------------ > 2 files changed, 33 insertions(+), 49 deletions(-) > > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h > b/drivers/net/ethernet/intel/ice/ice_txrx.h > index a4b1e9514632..07155e615f75 100644 > --- a/drivers/net/ethernet/intel/ice/ice_txrx.h > +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h > @@ -358,7 +358,6 @@ struct ice_rx_ring { > struct ice_tx_ring *xdp_ring; > struct ice_rx_ring *next; /* pointer to next ring in q_vector */ > struct xsk_buff_pool *xsk_pool; > - u32 nr_frags; > u16 max_frame; > u16 rx_buf_len; > dma_addr_t dma; /* physical address of ring */ > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c > b/drivers/net/ethernet/intel/ice/ice_txrx.c > index 0e5107fe62ad..1c5a2fa7ea86 100644 > --- a/drivers/net/ethernet/intel/ice/ice_txrx.c > +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c > @@ -865,10 +865,6 @@ ice_add_xdp_frag(struct ice_rx_ring *rx_ring, struct > xdp_buff *xdp, > __skb_fill_page_desc_noacc(sinfo, sinfo->nr_frags++, rx_buf->page, > rx_buf->page_offset, size); > sinfo->xdp_frags_size += size; > - /* remember frag count before XDP prog execution; bpf_xdp_adjust_tail() > - * can pop off frags but driver has to handle it on its own > - */ > - rx_ring->nr_frags = sinfo->nr_frags; > > if (page_is_pfmemalloc(rx_buf->page)) > xdp_buff_set_frag_pfmemalloc(xdp); > @@ -939,20 +935,20 @@ ice_get_rx_buf(struct ice_rx_ring *rx_ring, const > unsigned int size, > /** > * ice_get_pgcnts - grab page_count() for gathered fragments > * @rx_ring: Rx descriptor ring to store the page counts on > + * @ntc: the next to clean element (not included in this frame!) > * > * This function is intended to be called right before running XDP > * program so that the page recycling mechanism will be able to take > * a correct decision regarding underlying pages; this is done in such > * way as XDP program can change the refcount of page > */ > -static void ice_get_pgcnts(struct ice_rx_ring *rx_ring) > +static void ice_get_pgcnts(struct ice_rx_ring *rx_ring, unsigned int ntc) > { > - u32 nr_frags = rx_ring->nr_frags + 1; > u32 idx = rx_ring->first_desc; > struct ice_rx_buf *rx_buf; > u32 cnt = rx_ring->count; > > - for (int i = 0; i < nr_frags; i++) { > + while (idx != ntc) { > rx_buf = &rx_ring->rx_buf[idx]; > rx_buf->pgcnt = page_count(rx_buf->page); > > @@ -1125,62 +1121,48 @@ ice_put_rx_buf(struct ice_rx_ring *rx_ring, struct > ice_rx_buf *rx_buf) > } > > /** > - * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all frame frags > + * ice_put_rx_mbuf - ice_put_rx_buf() caller, for all buffers in frame > * @rx_ring: Rx ring with all the auxiliary data > * @xdp: XDP buffer carrying linear + frags part > - * @xdp_xmit: XDP_TX/XDP_REDIRECT verdict storage > - * @ntc: a current next_to_clean value to be stored at rx_ring > + * @ntc: the next to clean element (not included in this frame!) > * @verdict: return code from XDP program execution > * > - * Walk through gathered fragments and satisfy internal page > - * recycle mechanism; we take here an action related to verdict > - * returned by XDP program; > + * Called after XDP program is completed, or on error with verdict set to > + * ICE_XDP_CONSUMED. > + * > + * Walk through buffers from first_desc to the end of the frame, releasing > + * buffers and satisfying internal page recycle mechanism. The action depends > + * on verdict from XDP program. > */ > static void ice_put_rx_mbuf(struct ice_rx_ring *rx_ring, struct xdp_buff > *xdp, > - u32 *xdp_xmit, u32 ntc, u32 verdict) > + u32 ntc, u32 verdict) > { > - u32 nr_frags = rx_ring->nr_frags + 1; > + u32 nr_frags = xdp_get_shared_info_from_buff(xdp)->nr_frags; > u32 idx = rx_ring->first_desc; > u32 cnt = rx_ring->count; > - u32 post_xdp_frags = 1; > struct ice_rx_buf *buf; > - int i; > + int i = 0; > > - if (unlikely(xdp_buff_has_frags(xdp))) > - post_xdp_frags += xdp_get_shared_info_from_buff(xdp)->nr_frags; > - > - for (i = 0; i < post_xdp_frags; i++) { > + while (idx != ntc) { > buf = &rx_ring->rx_buf[idx]; > + if (++idx == cnt) > + idx = 0; > > - if (verdict & (ICE_XDP_TX | ICE_XDP_REDIR)) { > + /* An XDP program could release fragments from the end of the > + * buffer. For these, we need to keep the pagecnt_bias as-is. > + * To do this, only adjust pagecnt_bias for fragments up to > + * the total remaining after the XDP program has run. > + */ > + if (verdict != ICE_XDP_CONSUMED) > ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz); > - *xdp_xmit |= verdict; > - } else if (verdict & ICE_XDP_CONSUMED) { > + else if (i++ <= nr_frags) > buf->pagecnt_bias++; > - } else if (verdict == ICE_XDP_PASS) { > - ice_rx_buf_adjust_pg_offset(buf, xdp->frame_sz); > - } > > ice_put_rx_buf(rx_ring, buf); > - > - if (++idx == cnt) > - idx = 0; > - } > - /* handle buffers that represented frags released by XDP prog; > - * for these we keep pagecnt_bias as-is; refcount from struct page > - * has been decremented within XDP prog and we do not have to increase > - * the biased refcnt > - */ > - for (; i < nr_frags; i++) { > - buf = &rx_ring->rx_buf[idx]; > - ice_put_rx_buf(rx_ring, buf); > - if (++idx == cnt) > - idx = 0; > } > > xdp->data = NULL; > rx_ring->first_desc = ntc; > - rx_ring->nr_frags = 0; > } > > /** > @@ -1260,6 +1242,10 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int > budget) > /* retrieve a buffer from the ring */ > rx_buf = ice_get_rx_buf(rx_ring, size, ntc); > > + /* Increment ntc before calls to ice_put_rx_mbuf() */ > + if (++ntc == cnt) > + ntc = 0; > + > if (!xdp->data) { > void *hard_start; > > @@ -1268,24 +1254,23 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int > budget) > xdp_prepare_buff(xdp, hard_start, offset, size, > !!offset); > xdp_buff_clear_frags_flag(xdp); > } else if (ice_add_xdp_frag(rx_ring, xdp, rx_buf, size)) { > - ice_put_rx_mbuf(rx_ring, xdp, NULL, ntc, > ICE_XDP_CONSUMED); > + ice_put_rx_mbuf(rx_ring, xdp, ntc, ICE_XDP_CONSUMED); > break; > } > - if (++ntc == cnt) > - ntc = 0; > > /* skip if it is NOP desc */ > if (ice_is_non_eop(rx_ring, rx_desc)) > continue; > > - ice_get_pgcnts(rx_ring); > + ice_get_pgcnts(rx_ring, ntc); > xdp_verdict = ice_run_xdp(rx_ring, xdp, xdp_prog, xdp_ring, > rx_desc); > if (xdp_verdict == ICE_XDP_PASS) > goto construct_skb; > total_rx_bytes += xdp_get_buff_len(xdp); > total_rx_pkts++; > > - ice_put_rx_mbuf(rx_ring, xdp, &xdp_xmit, ntc, xdp_verdict); > + ice_put_rx_mbuf(rx_ring, xdp, ntc, xdp_verdict); > + xdp_xmit |= xdp_verdict & (ICE_XDP_TX | ICE_XDP_REDIR); > > continue; > construct_skb: > @@ -1298,7 +1283,7 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int > budget) > rx_ring->ring_stats->rx_stats.alloc_page_failed++; > xdp_verdict = ICE_XDP_CONSUMED; > } > - ice_put_rx_mbuf(rx_ring, xdp, &xdp_xmit, ntc, xdp_verdict); > + ice_put_rx_mbuf(rx_ring, xdp, ntc, xdp_verdict); > > if (!skb) > break; > > --- > base-commit: 31ec70afaaad11fb08970bd1b0dc9ebae3501e16 > change-id: 20250708-jk-ice-fix-rx-mem-leak-3a328edc4797 > > Best regards, > -- > Jacob Keller <[email protected]> >
