Re: [Intel-wired-lan] [PATCH iwl-net] ice: fix Rx page leak on multi-buffer frames

2025-07-11 Thread Jacob Keller


On 7/11/2025 8:33 AM, Maciej Fijalkowski wrote:
> On Wed, Jul 09, 2025 at 12:07:30PM -0700, Jacob Keller wrote:
>> The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each
>> buffer in the current frame. This function was introduced as part of
>> handling multi-buffer XDP support in the ice driver.
>>
>> It works by iterating over the buffers from first_desc up to 1 plus the
>> total number of fragments in the frame, cached from before the XDP program
>> was executed.
>>
>> If the hardware posts a descriptor with a size of 0, the logic used in
>> ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added
>> as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a
>> fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't
>> call ice_put_rx_buf().
>>
>> Because we don't call ice_put_rx_buf(), we don't attempt to re-use the
>> page or free it. This leaves a stale page in the ring, as we don't
>> increment next_to_alloc.
>>
>> The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented
>> properly, and that it always points to a buffer with a NULL page. Since
>> this function doesn't check, it will happily recycle a page over the top
>> of the next_to_alloc buffer, losing track of the old page.
>>
>> Note that this leak only occurs for multi-buffer frames. The
>> ice_put_rx_mbuf() function always handles at least one buffer, so a
>> single-buffer frame will always get handled correctly. It is not clear
>> precisely why the hardware hands us descriptors with a size of 0 sometimes,
>> but it happens somewhat regularly with "jumbo frames" used by 9K MTU.
>>
>> To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on
>> all buffers between first_desc and next_to_clean. Borrow the logic of a
>> similar function in i40e used for this same purpose. Use the same logic
>> also in ice_get_pgcnts().
>>
>> Instead of iterating over just the number of fragments, use a loop which
>> iterates until the current index reaches to the next_to_clean element just
>> past the current frame. Check the current number of fragments (post XDP
>> program). For all buffers up 1 more than the number of fragments, we'll
>> update the pagecnt_bias. For any buffers past this, pagecnt_bias is left
>> as-is. This ensures that fragments released by the XDP program, as well as
>> any buffers with zero-size won't have their pagecnt_bias updated
>> incorrectly. Unlike i40e, the ice_put_rx_mbuf() function does call
>> ice_put_rx_buf() on the last buffer of the frame indicating end of packet.
>>
>> Move the increment of the ntc local variable to ensure its updated *before*
>> all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic
>> requires the index of the element just after the current frame.
>>
>> This has the advantage that we also no longer need to track or cache the
>> number of fragments in the rx_ring, which saves a few bytes in the ring.
>>
>> Cc: Christoph Petrausch 
>> Reported-by: Jaroslav Pulchart 
>> Closes: 
>> https://lore.kernel.org/netdev/cak8ffz4hy6gujnenz3wy9jaylzxgfpr7dnzxzgmyoe44car...@mail.gmail.com/
>> Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current 
>> frame")
>> Signed-off-by: Jacob Keller 
>> ---
>> I've tested this in a setup with MTU 9000, using a combination of iperf3
>> and wrk generated traffic.
>>
>> I tested this in a couple of ways. First, I check memory allocations using
>> /proc/allocinfo:
>>
>>   awk '/ice_alloc_mapped_page/ { printf("%s %s\n", $1, $2) }' 
>> /proc/allocinfo | numfmt --to=iec
>>
>> Second, I ported some stats from i40e written by Joe Damato to track the
>> page allocation and busy counts. I consistently saw that the allocate stat
>> increased without the busy or waive stats increasing. I also added a stat
>> to track directly when we overwrote a page pointer that was non-NULL in
>> ice_reuse_rx_page(), and saw it increment consistently.
>>
>> With this fix, all of these indicators are fixed. I've tested both 1500
>> byte and 9000 byte MTU and no longer see the leak. With the counters I was
>> able to immediately see a leak within a few minutes of iperf3, so I am
>> confident that I've resolved the leak with this fix.
>> ---
>>  drivers/net/ethernet/intel/ice/ice_txrx.h |  1 -
>>  drivers/net/ethernet/intel/ice/ice_txrx.c | 71 
>> ---
>>  2 files changed, 28 insertions(+), 44 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h 
>> b/drivers/net/ethernet/intel/ice/ice_txrx.h
>> index a4b1e9514632..07155e615f75 100644
>> --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
>> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
>> @@ -358,7 +358,6 @@ struct ice_rx_ring {
>>  struct ice_tx_ring *xdp_ring;
>>  struct ice_rx_ring *next;   /* pointer to next ring in q_vector */
>>  struct xsk_buff_pool *xsk_pool;
>> -u32 nr_frags;
>>  u16 max_frame;
>>  u16 rx_buf_len;
>>  dma_addr_t dma; 

Re: [Intel-wired-lan] [PATCH iwl-net] ice: fix Rx page leak on multi-buffer frames

2025-07-11 Thread Maciej Fijalkowski
On Wed, Jul 09, 2025 at 12:07:30PM -0700, Jacob Keller wrote:
> The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each
> buffer in the current frame. This function was introduced as part of
> handling multi-buffer XDP support in the ice driver.
> 
> It works by iterating over the buffers from first_desc up to 1 plus the
> total number of fragments in the frame, cached from before the XDP program
> was executed.
> 
> If the hardware posts a descriptor with a size of 0, the logic used in
> ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added
> as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a
> fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't
> call ice_put_rx_buf().
> 
> Because we don't call ice_put_rx_buf(), we don't attempt to re-use the
> page or free it. This leaves a stale page in the ring, as we don't
> increment next_to_alloc.
> 
> The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented
> properly, and that it always points to a buffer with a NULL page. Since
> this function doesn't check, it will happily recycle a page over the top
> of the next_to_alloc buffer, losing track of the old page.
> 
> Note that this leak only occurs for multi-buffer frames. The
> ice_put_rx_mbuf() function always handles at least one buffer, so a
> single-buffer frame will always get handled correctly. It is not clear
> precisely why the hardware hands us descriptors with a size of 0 sometimes,
> but it happens somewhat regularly with "jumbo frames" used by 9K MTU.
> 
> To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on
> all buffers between first_desc and next_to_clean. Borrow the logic of a
> similar function in i40e used for this same purpose. Use the same logic
> also in ice_get_pgcnts().
> 
> Instead of iterating over just the number of fragments, use a loop which
> iterates until the current index reaches to the next_to_clean element just
> past the current frame. Check the current number of fragments (post XDP
> program). For all buffers up 1 more than the number of fragments, we'll
> update the pagecnt_bias. For any buffers past this, pagecnt_bias is left
> as-is. This ensures that fragments released by the XDP program, as well as
> any buffers with zero-size won't have their pagecnt_bias updated
> incorrectly. Unlike i40e, the ice_put_rx_mbuf() function does call
> ice_put_rx_buf() on the last buffer of the frame indicating end of packet.
> 
> Move the increment of the ntc local variable to ensure its updated *before*
> all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic
> requires the index of the element just after the current frame.
> 
> This has the advantage that we also no longer need to track or cache the
> number of fragments in the rx_ring, which saves a few bytes in the ring.
> 
> Cc: Christoph Petrausch 
> Reported-by: Jaroslav Pulchart 
> Closes: 
> https://lore.kernel.org/netdev/cak8ffz4hy6gujnenz3wy9jaylzxgfpr7dnzxzgmyoe44car...@mail.gmail.com/
> Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current 
> frame")
> Signed-off-by: Jacob Keller 
> ---
> I've tested this in a setup with MTU 9000, using a combination of iperf3
> and wrk generated traffic.
> 
> I tested this in a couple of ways. First, I check memory allocations using
> /proc/allocinfo:
> 
>   awk '/ice_alloc_mapped_page/ { printf("%s %s\n", $1, $2) }' /proc/allocinfo 
> | numfmt --to=iec
> 
> Second, I ported some stats from i40e written by Joe Damato to track the
> page allocation and busy counts. I consistently saw that the allocate stat
> increased without the busy or waive stats increasing. I also added a stat
> to track directly when we overwrote a page pointer that was non-NULL in
> ice_reuse_rx_page(), and saw it increment consistently.
> 
> With this fix, all of these indicators are fixed. I've tested both 1500
> byte and 9000 byte MTU and no longer see the leak. With the counters I was
> able to immediately see a leak within a few minutes of iperf3, so I am
> confident that I've resolved the leak with this fix.
> ---
>  drivers/net/ethernet/intel/ice/ice_txrx.h |  1 -
>  drivers/net/ethernet/intel/ice/ice_txrx.c | 71 
> ---
>  2 files changed, 28 insertions(+), 44 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h 
> b/drivers/net/ethernet/intel/ice/ice_txrx.h
> index a4b1e9514632..07155e615f75 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.h
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
> @@ -358,7 +358,6 @@ struct ice_rx_ring {
>   struct ice_tx_ring *xdp_ring;
>   struct ice_rx_ring *next;   /* pointer to next ring in q_vector */
>   struct xsk_buff_pool *xsk_pool;
> - u32 nr_frags;
>   u16 max_frame;
>   u16 rx_buf_len;
>   dma_addr_t dma; /* physical address of ring */
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c 
> b/drivers/net/ethernet/i