Re: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
> The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each > buffer in the current frame. This function was introduced as part of handling > multi-buffer XDP support in the ice driver. > > It works by iterating over the buffers from first_desc up to 1 plus the total > number of fragments in the frame, cached from before the XDP program was > executed. > > If the hardware posts a descriptor with a size of 0, the logic used in > ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added > as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a > fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't > call ice_put_rx_buf(). > > Because we don't call ice_put_rx_buf(), we don't attempt to re-use the page > or free it. This leaves a stale page in the ring, as we don't increment > next_to_alloc. > > The ice_reuse_rx_page() assumes that the next_to_alloc has been > incremented properly, and that it always points to a buffer with a NULL page. > Since this function doesn't check, it will happily recycle a page over the > top of > the next_to_alloc buffer, losing track of the old page. > > Note that this leak only occurs for multi-buffer frames. The > ice_put_rx_mbuf() function always handles at least one buffer, so a single- > buffer frame will always get handled correctly. It is not clear precisely why > the > hardware hands us descriptors with a size of 0 sometimes, but it happens > somewhat regularly with "jumbo frames" used by 9K MTU. > > To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on all > buffers between first_desc and next_to_clean. Borrow the logic of a similar > function in i40e used for this same purpose. Use the same logic also in > ice_get_pgcnts(). > > Instead of iterating over just the number of fragments, use a loop which > iterates until the current index reaches to the next_to_clean element just > past > the current frame. Unlike i40e, the ice_put_rx_mbuf() function does call > ice_put_rx_buf() on the last buffer of the frame indicating the end of packet. > > For non-linear (multi-buffer) frames, we need to take care when adjusting the > pagecnt_bias. An XDP program might release fragments from the tail of the > frame, in which case that fragment page is already released. Only update the > pagecnt_bias for the first descriptor and fragments still remaining post-XDP > program. Take care to only access the shared info for fragmented buffers, as > this avoids a significant cache miss. > > The xdp_xmit value only needs to be updated if an XDP program is run, and > only once per packet. Drop the xdp_xmit pointer argument from > ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function > directly. This avoids needing to pass the argument and avoids an extra bit- > wise OR for each buffer in the frame. > > Move the increment of the ntc local variable to ensure its updated *before* > all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic requires > the index of the element just after the current frame. > > Now that we use an index pointer in the ring to identify the packet, we no > longer need to track or cache the number of fragments in the rx_ring. > > Cc: Christoph Petrausch > Cc: Jesper Dangaard Brouer > Reported-by: Jaroslav Pulchart > Closes: > https://lore.kernel.org/netdev/CAK8fFZ4hY6GUJNENz3wY9jaYLZXGfpr7dnZxz > [email protected]/ > Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current > frame") > Tested-by: Michal Kubiak > Signed-off-by: Jacob Keller > --- > I've tested this in a setup with MTU 9000, using a combination of iperf3 and > wrk generated traffic. > > I tested this in a couple of ways. First, I check memory allocations using > /proc/allocinfo: > > awk '/ice_alloc_mapped_page/ { printf("%s %s\n", $1, $2) }' /proc/allocinfo > | > numfmt --to=iec > > Second, I ported some stats from i40e written by Joe Damato to track the > page allocation and busy counts. I consistently saw that the allocate stat > increased without the busy or waive stats increasing. I also added a stat to > track directly when we overwrote a page pointer that was non-NULL in > ice_reuse_rx_page(), and saw it increment consistently. > > With this fix, all of these indicators are fixed. I've tested both 1500 byte > and > 9000 byte MTU and no longer see the leak. With the counters I was able to > immediately see a leak within a few minutes of iperf3, so I am confident that > I've resolved the leak with this fix. > > I've now also tested with xdp-bench and confirm that XDP_TX and XDP_REDIR > work properly with the fix for updating xdp_xmit. > > This version (v2) avoids the cache miss regression reported by Jesper. I > refactored a bit to only check the shared info if the XDP buffer is > fragmented. I > considered adding a helper function to do this to the XDP header file. > However, I scanned several drivers and noticed that only ice and
Re: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
> -Original Message- > From: Intel-wired-lan On Behalf Of Jacob > Keller > Sent: 26 August 2025 04:30 > To: Kubiak, Michal ; Nguyen, Anthony L > ; Intel Wired LAN > ; [email protected] > Cc: Christoph Petrausch ; Jesper Dangaard > Brouer ; Jaroslav Pulchart ; > Keller, Jacob E > Subject: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on > multi-buffer frames > > The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each > buffer in the current frame. This function was introduced as part of handling > multi-buffer XDP support in the ice driver. > > It works by iterating over the buffers from first_desc up to 1 plus the total > number of fragments in the frame, cached from before the XDP program was > executed. > > If the hardware posts a descriptor with a size of 0, the logic used in > ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added as > fragments in ice_add_xdp_frag. Since the buffer isn't counted as a fragment, > we do not iterate over it in ice_put_rx_mbuf(), and thus we don't call > ice_put_rx_buf(). > > Because we don't call ice_put_rx_buf(), we don't attempt to re-use the page > or free it. This leaves a stale page in the ring, as we don't increment > next_to_alloc. > > The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented > properly, and that it always points to a buffer with a NULL page. Since this > function doesn't check, it will happily recycle a page over the top of the > next_to_alloc buffer, losing track of the old page. > > Note that this leak only occurs for multi-buffer frames. The > ice_put_rx_mbuf() function always handles at least one buffer, so a > single-buffer frame will always get handled correctly. It is not clear > precisely why the hardware hands us descriptors with a size of 0 sometimes, > but it happens somewhat regularly with "jumbo frames" used by 9K MTU. > > To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on > all buffers between first_desc and next_to_clean. Borrow the logic of a > similar function in i40e used for this same purpose. Use the same logic also > in ice_get_pgcnts(). > > Instead of iterating over just the number of fragments, use a loop which > iterates until the current index reaches to the next_to_clean element just > past the current frame. Unlike i40e, the ice_put_rx_mbuf() function does call > ice_put_rx_buf() on the last buffer of the frame indicating the end of packet. > > For non-linear (multi-buffer) frames, we need to take care when adjusting the > pagecnt_bias. An XDP program might release fragments from the tail of the > frame, in which case that fragment page is already released. Only update the > pagecnt_bias for the first descriptor and fragments still remaining post-XDP > program. Take care to only access the shared info for fragmented buffers, as > this avoids a significant cache miss. > > The xdp_xmit value only needs to be updated if an XDP program is run, and > only once per packet. Drop the xdp_xmit pointer argument from > ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function > directly. This avoids needing to pass the argument and avoids an extra > bit-wise OR for each buffer in the frame. > > Move the increment of the ntc local variable to ensure its updated *before* > all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic > requires the index of the element just after the current frame. > > Now that we use an index pointer in the ring to identify the packet, we no > longer need to track or cache the number of fragments in the rx_ring. > > Cc: Christoph Petrausch > Cc: Jesper Dangaard Brouer > Reported-by: Jaroslav Pulchart > Closes: > https://lore.kernel.org/netdev/cak8ffz4hy6gujnenz3wy9jaylzxgfpr7dnzxzgmyoe44car...@mail.gmail.com/ > Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current > frame") > Tested-by: Michal Kubiak > Signed-off-by: Jacob Keller > --- > I've tested this in a setup with MTU 9000, using a combination of iperf3 and > wrk generated traffic. > > I tested this in a couple of ways. First, I check memory allocations using > /proc/allocinfo: > > awk '/ice_alloc_mapped_page/ { printf("%s %s\n", $1, $2) }' /proc/allocinfo > | numfmt --to=iec > > Second, I ported some stats from i40e written by Joe Damato to track the page > allocation and busy counts. I consistently saw that the allocate stat > increased without the busy or waive stats increasing. I also added a stat to > track directly when we overwrote a page pointer that was non-NULL in > i
Re: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
On Tue, Aug 26, 2025 at 10:35:30AM +0200, Jesper Dangaard Brouer wrote: > > > On 26/08/2025 01.00, Jacob Keller wrote: > > XDP_DROP performance has been tested for this version, thanks to work from > > Michal Kubiak. The results are quite promising, with 3 versions being > > compared: > > > > * baseline net-next tree > > * v1 applied > > * v2 applied > > > > Michal said: > > > >I run the XDP_DROP performance comparison tests on my setup in the way I > >usually do. I didn't have the pktgen configured on my link partner, but I > >used 6 instances of the xdpsock running in Tx-only mode to generate > >high-bandwith traffic. Also, I tried to replicate the conditions > > according > >to Jesper's description, making sure that all the traffic is directed to > > a > >single Rx queue and one CPU is 100% loaded. > > > > Thank you for replicating the test setup. Using xdpsock as a traffic > generator is fine, as long as we make sure that the generator TX speeds > exceeds the Device Under Test RX XDP_DROP speed. It is also important > for the test that packets hits a single RX queue and we verify one CPU is > 100% load, as you describe. > > As a reminder the pktgen kernel module comes with ready-to-use sample > shell-scripts[1]. > > [1] https://elixir.bootlin.com/linux/v6.16.3/source/samples/pktgen > Thank you! I am aware of that and also use those scripts. The xdpsock solution was just the quickest option for that specific moment, so I decided not to change my link partner setup, (since I successfully reproduced the performance drop in v1). > > The performance hit from v1 is replicated, and shown to be gone in v2, with > > our results showing even an increase compared to baseline instead of a > > drop. I've included the relative packet per second deltas compared against > > a baseline test with neither v1 or v2. > > > > Thanks for also replicating the performance hit from v1 as I did in [2]. > > To Michal: What CPU did you use? > - I used CPU: AMD EPYC 9684X (with SRSO=IBPB) In my test I used: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz > > One of the reasons that I saw a larger percentage drop is that this CPU > doesn't have DDIO/DCA, which deliver the packet to L3 cache (and a L2 > cache-miss will obviously take less time than a full main memory cache- > miss). (Details: Newer AMD CPUs will get something called PCIe TLP > Processing Hints (TPH), which resembles DDIO). > > Point is that I see some opportunities in driver to move some of the > prefetches earlier. But we want to make sure it benefits both CPU types, > and I can test on the AMD platform. (This CPU is a large part of our > fleet so it makes sense for us to optimize this). > > > baseline to v1, no-touch: > >-8,387,677 packets per second (17%) decrease. > > > > baseline to v2, no-touch: > >+4,057,000 packets per second (8%) increase! > > > > baseline to v1, read data: > >-411,709 packets per second (1%) decrease. > > > > baseline to v2, read data: > >+4,331,857 packets per second (11%) increase! > > Thanks for providing these numbers. > I would also like to know the throughput PPS packet numbers before and > after, as this allows me to calculate the nanosec difference. Using > percentages are usually useful, but it can be misleading when dealing > with XDP_DROP speeds, because a small nanosec change will get > "magnified" too much. > I was usually told to share the percentage data, because absolute numbers may depend on various circumstances. However, I understand your point regarding XDP_DROP. In such case it may be justified. Please see my raw results (from xdp-bench summary) below: net-next (main) (drop, no touch) Duration: 105.7s Packets received: 4,960,778,583 Average packets/s : 46,951,873 Rx dropped : 4,960,778,583 net-next (main) (drop, read data) Duration: 94.5s Packets received: 3,524,346,352 Average packets/s : 37,295,056 Rx dropped : 3,524,346,352 net-next (main+v1) (drop, no touch) Duration: 122.5s Packets received: 4,722,510,839 Average packets/s : 38,564,196 Rx dropped : 4,722,510,839 net-next (main+v1) (drop, read data) Duration: 115.7s Packets received: 4,265,991,147 Average packets/s : 36,883,347 Rx dropped : 4,265,991,147 net-next (main+v2) (drop, no touch) Duration: 130.6s Packets received: 6,664,104,907 Average packets/s : 51,008,873 Rx dropped : 6,664,104,907 net-next (main+v2) (drop, read data) Duration: 143.6s Packets received: 5,975,991,044 Average packets/s : 41,626,913 Rx dropped : 5,975,991,044 Thanks, Michal > > --- > > Changes in v2: > > - Only access shared info for fragmented frames > > - Link to v1: > > https://lore.kernel.org/netdev/[email protected]/ > > [2] > https://lore.kernel.org/netdev/6e
Re: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
On 26/08/2025 01.00, Jacob Keller wrote: XDP_DROP performance has been tested for this version, thanks to work from Michal Kubiak. The results are quite promising, with 3 versions being compared: * baseline net-next tree * v1 applied * v2 applied Michal said: I run the XDP_DROP performance comparison tests on my setup in the way I usually do. I didn't have the pktgen configured on my link partner, but I used 6 instances of the xdpsock running in Tx-only mode to generate high-bandwith traffic. Also, I tried to replicate the conditions according to Jesper's description, making sure that all the traffic is directed to a single Rx queue and one CPU is 100% loaded. Thank you for replicating the test setup. Using xdpsock as a traffic generator is fine, as long as we make sure that the generator TX speeds exceeds the Device Under Test RX XDP_DROP speed. It is also important for the test that packets hits a single RX queue and we verify one CPU is 100% load, as you describe. As a reminder the pktgen kernel module comes with ready-to-use sample shell-scripts[1]. [1] https://elixir.bootlin.com/linux/v6.16.3/source/samples/pktgen The performance hit from v1 is replicated, and shown to be gone in v2, with our results showing even an increase compared to baseline instead of a drop. I've included the relative packet per second deltas compared against a baseline test with neither v1 or v2. Thanks for also replicating the performance hit from v1 as I did in [2]. To Michal: What CPU did you use? - I used CPU: AMD EPYC 9684X (with SRSO=IBPB) One of the reasons that I saw a larger percentage drop is that this CPU doesn't have DDIO/DCA, which deliver the packet to L3 cache (and a L2 cache-miss will obviously take less time than a full main memory cache- miss). (Details: Newer AMD CPUs will get something called PCIe TLP Processing Hints (TPH), which resembles DDIO). Point is that I see some opportunities in driver to move some of the prefetches earlier. But we want to make sure it benefits both CPU types, and I can test on the AMD platform. (This CPU is a large part of our fleet so it makes sense for us to optimize this). baseline to v1, no-touch: -8,387,677 packets per second (17%) decrease. baseline to v2, no-touch: +4,057,000 packets per second (8%) increase! baseline to v1, read data: -411,709 packets per second (1%) decrease. baseline to v2, read data: +4,331,857 packets per second (11%) increase! Thanks for providing these numbers. I would also like to know the throughput PPS packet numbers before and after, as this allows me to calculate the nanosec difference. Using percentages are usually useful, but it can be misleading when dealing with XDP_DROP speeds, because a small nanosec change will get "magnified" too much. --- Changes in v2: - Only access shared info for fragmented frames - Link to v1: https://lore.kernel.org/netdev/[email protected]/ [2] https://lore.kernel.org/netdev/[email protected]/ --- drivers/net/ethernet/intel/ice/ice_txrx.h | 1 - drivers/net/ethernet/intel/ice/ice_txrx.c | 80 +-- 2 files changed, 34 insertions(+), 47 deletions(-) Acked-by: Jesper Dangaard Brouer
Re: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
Dear Jacob,
Thank you for your patch.
Am 26.08.25 um 01:00 schrieb Jacob Keller:
The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each
buffer in the current frame. This function was introduced as part of
handling multi-buffer XDP support in the ice driver.
It works by iterating over the buffers from first_desc up to 1 plus the
total number of fragments in the frame, cached from before the XDP program
was executed.
If the hardware posts a descriptor with a size of 0, the logic used in
ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added
as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a
fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't
call ice_put_rx_buf().
Because we don't call ice_put_rx_buf(), we don't attempt to re-use the
page or free it. This leaves a stale page in the ring, as we don't
increment next_to_alloc.
The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented
properly, and that it always points to a buffer with a NULL page. Since
this function doesn't check, it will happily recycle a page over the top
of the next_to_alloc buffer, losing track of the old page.
Note that this leak only occurs for multi-buffer frames. The
ice_put_rx_mbuf() function always handles at least one buffer, so a
single-buffer frame will always get handled correctly. It is not clear
precisely why the hardware hands us descriptors with a size of 0 sometimes,
but it happens somewhat regularly with "jumbo frames" used by 9K MTU.
To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on
all buffers between first_desc and next_to_clean. Borrow the logic of a
similar function in i40e used for this same purpose. Use the same logic
also in ice_get_pgcnts().
Instead of iterating over just the number of fragments, use a loop which
iterates until the current index reaches to the next_to_clean element just
past the current frame. Unlike i40e, the ice_put_rx_mbuf() function does
call ice_put_rx_buf() on the last buffer of the frame indicating the end of
packet.
For non-linear (multi-buffer) frames, we need to take care when adjusting
the pagecnt_bias. An XDP program might release fragments from the tail of
the frame, in which case that fragment page is already released. Only
update the pagecnt_bias for the first descriptor and fragments still
remaining post-XDP program. Take care to only access the shared info for
fragmented buffers, as this avoids a significant cache miss.
The xdp_xmit value only needs to be updated if an XDP program is run, and
only once per packet. Drop the xdp_xmit pointer argument from
ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function
directly. This avoids needing to pass the argument and avoids an extra
bit-wise OR for each buffer in the frame.
Move the increment of the ntc local variable to ensure its updated *before*
all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic
requires the index of the element just after the current frame.
Now that we use an index pointer in the ring to identify the packet, we no
longer need to track or cache the number of fragments in the rx_ring.
Cc: Christoph Petrausch
Cc: Jesper Dangaard Brouer
Reported-by: Jaroslav Pulchart
Closes:
https://lore.kernel.org/netdev/cak8ffz4hy6gujnenz3wy9jaylzxgfpr7dnzxzgmyoe44car...@mail.gmail.com/
Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
Tested-by: Michal Kubiak
Signed-off-by: Jacob Keller
---
I've tested this in a setup with MTU 9000, using a combination of iperf3
and wrk generated traffic.
I tested this in a couple of ways. First, I check memory allocations using
/proc/allocinfo:
awk '/ice_alloc_mapped_page/ { printf("%s %s\n", $1, $2) }' /proc/allocinfo
| numfmt --to=iec
Second, I ported some stats from i40e written by Joe Damato to track the
page allocation and busy counts. I consistently saw that the allocate stat
increased without the busy or waive stats increasing. I also added a stat
to track directly when we overwrote a page pointer that was non-NULL in
ice_reuse_rx_page(), and saw it increment consistently.
With this fix, all of these indicators are fixed. I've tested both 1500
byte and 9000 byte MTU and no longer see the leak. With the counters I was
able to immediately see a leak within a few minutes of iperf3, so I am
confident that I've resolved the leak with this fix.
I've now also tested with xdp-bench and confirm that XDP_TX and XDP_REDIR work
properly with the fix for updating xdp_xmit.
This version (v2) avoids the cache miss regression reported by Jesper. I
refactored a bit to only check the shared info if the XDP buffer is
fragmented. I considered adding a helper function to do this to the XDP
header file. However, I scanned several drivers and noticed that only ice
and i40e access the nr_frags in this way. The ice variation I believe will
be removed with the conversion to page pool, so I don't
[Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each
buffer in the current frame. This function was introduced as part of
handling multi-buffer XDP support in the ice driver.
It works by iterating over the buffers from first_desc up to 1 plus the
total number of fragments in the frame, cached from before the XDP program
was executed.
If the hardware posts a descriptor with a size of 0, the logic used in
ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added
as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a
fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't
call ice_put_rx_buf().
Because we don't call ice_put_rx_buf(), we don't attempt to re-use the
page or free it. This leaves a stale page in the ring, as we don't
increment next_to_alloc.
The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented
properly, and that it always points to a buffer with a NULL page. Since
this function doesn't check, it will happily recycle a page over the top
of the next_to_alloc buffer, losing track of the old page.
Note that this leak only occurs for multi-buffer frames. The
ice_put_rx_mbuf() function always handles at least one buffer, so a
single-buffer frame will always get handled correctly. It is not clear
precisely why the hardware hands us descriptors with a size of 0 sometimes,
but it happens somewhat regularly with "jumbo frames" used by 9K MTU.
To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on
all buffers between first_desc and next_to_clean. Borrow the logic of a
similar function in i40e used for this same purpose. Use the same logic
also in ice_get_pgcnts().
Instead of iterating over just the number of fragments, use a loop which
iterates until the current index reaches to the next_to_clean element just
past the current frame. Unlike i40e, the ice_put_rx_mbuf() function does
call ice_put_rx_buf() on the last buffer of the frame indicating the end of
packet.
For non-linear (multi-buffer) frames, we need to take care when adjusting
the pagecnt_bias. An XDP program might release fragments from the tail of
the frame, in which case that fragment page is already released. Only
update the pagecnt_bias for the first descriptor and fragments still
remaining post-XDP program. Take care to only access the shared info for
fragmented buffers, as this avoids a significant cache miss.
The xdp_xmit value only needs to be updated if an XDP program is run, and
only once per packet. Drop the xdp_xmit pointer argument from
ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function
directly. This avoids needing to pass the argument and avoids an extra
bit-wise OR for each buffer in the frame.
Move the increment of the ntc local variable to ensure its updated *before*
all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic
requires the index of the element just after the current frame.
Now that we use an index pointer in the ring to identify the packet, we no
longer need to track or cache the number of fragments in the rx_ring.
Cc: Christoph Petrausch
Cc: Jesper Dangaard Brouer
Reported-by: Jaroslav Pulchart
Closes:
https://lore.kernel.org/netdev/cak8ffz4hy6gujnenz3wy9jaylzxgfpr7dnzxzgmyoe44car...@mail.gmail.com/
Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
Tested-by: Michal Kubiak
Signed-off-by: Jacob Keller
---
I've tested this in a setup with MTU 9000, using a combination of iperf3
and wrk generated traffic.
I tested this in a couple of ways. First, I check memory allocations using
/proc/allocinfo:
awk '/ice_alloc_mapped_page/ { printf("%s %s\n", $1, $2) }' /proc/allocinfo |
numfmt --to=iec
Second, I ported some stats from i40e written by Joe Damato to track the
page allocation and busy counts. I consistently saw that the allocate stat
increased without the busy or waive stats increasing. I also added a stat
to track directly when we overwrote a page pointer that was non-NULL in
ice_reuse_rx_page(), and saw it increment consistently.
With this fix, all of these indicators are fixed. I've tested both 1500
byte and 9000 byte MTU and no longer see the leak. With the counters I was
able to immediately see a leak within a few minutes of iperf3, so I am
confident that I've resolved the leak with this fix.
I've now also tested with xdp-bench and confirm that XDP_TX and XDP_REDIR work
properly with the fix for updating xdp_xmit.
This version (v2) avoids the cache miss regression reported by Jesper. I
refactored a bit to only check the shared info if the XDP buffer is
fragmented. I considered adding a helper function to do this to the XDP
header file. However, I scanned several drivers and noticed that only ice
and i40e access the nr_frags in this way. The ice variation I believe will
be removed with the conversion to page pool, so I don't think the addition
of a helper is warranted.
XDP_DROP performance has been tested for
Re: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
++ Sujai > -Original Message- > From: Singh, PriyaX > Sent: Wednesday, August 13, 2025 7:10 PM > To: Fijalkowski, Maciej ; Kitszel, Przemyslaw > ; Intel Wired LAN [email protected]>; Lobakin, Aleksander > Cc: : Joe Damato ; Nguyen, Anthony L > ; [email protected]; Christoph > Petrausch ; Jaroslav Pulchart > ; Keller, Jacob E > > Subject: RE: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on > multi-buffer frames > > > -Original Message- > > From: Intel-wired-lan On Behalf > > Of Jacob Keller > > Sent: Saturday, July 12, 2025 5:54 AM > > To: Fijalkowski, Maciej ; Kitszel, > > Przemyslaw ; Intel Wired LAN > > ; Lobakin, Aleksander > > > > Cc: Joe Damato ; Nguyen, Anthony L > > ; [email protected]; Christoph > > Petrausch ; Jaroslav Pulchart > > ; Keller, Jacob E > > > > Subject: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on > > multi- buffer frames > > > > The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for > > each buffer in the current frame. This function was introduced as part > > of handling multi-buffer XDP support in the ice driver. > > > > It works by iterating over the buffers from first_desc up to 1 plus > > the total number of fragments in the frame, cached from before the XDP > > program was executed. > > > > If the hardware posts a descriptor with a size of 0, the logic used in > > ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get > > added as fragments in ice_add_xdp_frag. Since the buffer isn't counted > > as a fragment, we do not iterate over it in ice_put_rx_mbuf(), and > > thus we don't call ice_put_rx_buf(). > > > > Because we don't call ice_put_rx_buf(), we don't attempt to re-use the > > page or free it. This leaves a stale page in the ring, as we don't > > increment next_to_alloc. > > > > The ice_reuse_rx_page() assumes that the next_to_alloc has been > > incremented properly, and that it always points to a buffer with a > > NULL page. Since this function doesn't check, it will happily recycle > > a page over the top of the next_to_alloc buffer, losing track of the old > page. > > > > Note that this leak only occurs for multi-buffer frames. The > > ice_put_rx_mbuf() function always handles at least one buffer, so a > > single- buffer frame will always get handled correctly. It is not > > clear precisely why the hardware hands us descriptors with a size of 0 > > sometimes, but it happens somewhat regularly with "jumbo frames" used > by 9K MTU. > > > > To fix ice_put_rx_mbuf(), we need to make sure to call > > ice_put_rx_buf() on all buffers between first_desc and next_to_clean. > > Borrow the logic of a similar function in i40e used for this same > > purpose. Use the same logic also in ice_get_pgcnts(). > > > > Instead of iterating over just the number of fragments, use a loop > > which iterates until the current index reaches to the next_to_clean > > element just past the current frame. Check the current number of > > fragments (post XDP program). For all buffers up 1 more than the > > number of fragments, we'll update the pagecnt_bias. For any buffers past > this, pagecnt_bias is left as-is. > > This ensures that fragments released by the XDP program, as well as > > any buffers with zero-size won't have their pagecnt_bias updated > incorrectly. > > Unlike i40e, the ice_put_rx_mbuf() function does call > > ice_put_rx_buf() on the last buffer of the frame indicating end of packet. > > > > The xdp_xmit value only needs to be updated if an XDP program is run, > > and only once per packet. Drop the xdp_xmit pointer argument from > > ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() > > function directly. This avoids needing to pass the argument and avoids > > an extra bit- wise OR for each buffer in the frame. > > > > Move the increment of the ntc local variable to ensure its updated > > *before* all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the > > loop logic requires the index of the element just after the current frame. > > > > This has the advantage that we also no longer need to track or cache > > the number of fragments in the rx_ring, which saves a few bytes in the > ring. > > > > Cc: Christoph Petrausch > > Reported-by: Jaroslav Pulchart > > Closes: > > > https://lore.kernel.org/netdev/CAK8fFZ4hY6GUJNENz3wY9jaYLZXGfpr7dnZxz > > [email protected]/ > > Fixes: 743bbd
Re: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
> -Original Message- > From: Intel-wired-lan On Behalf Of > Jacob Keller > Sent: Saturday, July 12, 2025 5:54 AM > To: Fijalkowski, Maciej ; Kitszel, Przemyslaw > ; Intel Wired LAN [email protected]>; Lobakin, Aleksander > Cc: Joe Damato ; Nguyen, Anthony L > ; [email protected]; Christoph > Petrausch ; Jaroslav Pulchart > ; Keller, Jacob E > > Subject: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi- > buffer frames > > The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each > buffer in the current frame. This function was introduced as part of handling > multi-buffer XDP support in the ice driver. > > It works by iterating over the buffers from first_desc up to 1 plus the total > number of fragments in the frame, cached from before the XDP program > was executed. > > If the hardware posts a descriptor with a size of 0, the logic used in > ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added > as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a > fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't > call ice_put_rx_buf(). > > Because we don't call ice_put_rx_buf(), we don't attempt to re-use the page > or free it. This leaves a stale page in the ring, as we don't increment > next_to_alloc. > > The ice_reuse_rx_page() assumes that the next_to_alloc has been > incremented properly, and that it always points to a buffer with a NULL > page. Since this function doesn't check, it will happily recycle a page over > the top of the next_to_alloc buffer, losing track of the old page. > > Note that this leak only occurs for multi-buffer frames. The > ice_put_rx_mbuf() function always handles at least one buffer, so a single- > buffer frame will always get handled correctly. It is not clear precisely why > the hardware hands us descriptors with a size of 0 sometimes, but it > happens somewhat regularly with "jumbo frames" used by 9K MTU. > > To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on > all buffers between first_desc and next_to_clean. Borrow the logic of a > similar function in i40e used for this same purpose. Use the same logic also > in ice_get_pgcnts(). > > Instead of iterating over just the number of fragments, use a loop which > iterates until the current index reaches to the next_to_clean element just > past the current frame. Check the current number of fragments (post XDP > program). For all buffers up 1 more than the number of fragments, we'll > update the pagecnt_bias. For any buffers past this, pagecnt_bias is left > as-is. > This ensures that fragments released by the XDP program, as well as any > buffers with zero-size won't have their pagecnt_bias updated incorrectly. > Unlike i40e, the ice_put_rx_mbuf() function does call > ice_put_rx_buf() on the last buffer of the frame indicating end of packet. > > The xdp_xmit value only needs to be updated if an XDP program is run, and > only once per packet. Drop the xdp_xmit pointer argument from > ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function > directly. This avoids needing to pass the argument and avoids an extra bit- > wise OR for each buffer in the frame. > > Move the increment of the ntc local variable to ensure its updated *before* > all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic requires > the index of the element just after the current frame. > > This has the advantage that we also no longer need to track or cache the > number of fragments in the rx_ring, which saves a few bytes in the ring. > > Cc: Christoph Petrausch > Reported-by: Jaroslav Pulchart > Closes: > https://lore.kernel.org/netdev/CAK8fFZ4hY6GUJNENz3wY9jaYLZXGfpr7dnZxz > [email protected]/ > Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current > frame") > Signed-off-by: Jacob Keller > --- > I've tested this in a setup with MTU 9000, using a combination of iperf3 and > wrk generated traffic. > > I tested this in a couple of ways. First, I check memory allocations using > /proc/allocinfo: > > awk '/ice_alloc_mapped_page/ { printf("%s %s\n", $1, $2) }' /proc/allocinfo > | numfmt --to=iec > > Second, I ported some stats from i40e written by Joe Damato to track the > page allocation and busy counts. I consistently saw that the allocate stat > increased without the busy or waive stats increasing. I also added a stat to > track directly when we overwrote a page pointer that was non-NULL in > ice_reuse_rx_page(), and saw it increment consistently. > > With this fix, a
Re: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
On Fri, Jul 11, 2025 at 05:23:49PM -0700, Jacob Keller wrote:
> The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each
> buffer in the current frame. This function was introduced as part of
> handling multi-buffer XDP support in the ice driver.
>
> It works by iterating over the buffers from first_desc up to 1 plus the
> total number of fragments in the frame, cached from before the XDP program
> was executed.
>
> If the hardware posts a descriptor with a size of 0, the logic used in
> ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added
> as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a
> fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't
> call ice_put_rx_buf().
>
> Because we don't call ice_put_rx_buf(), we don't attempt to re-use the
> page or free it. This leaves a stale page in the ring, as we don't
> increment next_to_alloc.
>
> The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented
> properly, and that it always points to a buffer with a NULL page. Since
> this function doesn't check, it will happily recycle a page over the top
> of the next_to_alloc buffer, losing track of the old page.
>
> Note that this leak only occurs for multi-buffer frames. The
> ice_put_rx_mbuf() function always handles at least one buffer, so a
> single-buffer frame will always get handled correctly. It is not clear
> precisely why the hardware hands us descriptors with a size of 0 sometimes,
> but it happens somewhat regularly with "jumbo frames" used by 9K MTU.
>
> To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on
> all buffers between first_desc and next_to_clean. Borrow the logic of a
> similar function in i40e used for this same purpose. Use the same logic
> also in ice_get_pgcnts().
>
> Instead of iterating over just the number of fragments, use a loop which
> iterates until the current index reaches to the next_to_clean element just
> past the current frame. Check the current number of fragments (post XDP
> program). For all buffers up 1 more than the number of fragments, we'll
> update the pagecnt_bias. For any buffers past this, pagecnt_bias is left
> as-is. This ensures that fragments released by the XDP program, as well as
> any buffers with zero-size won't have their pagecnt_bias updated
> incorrectly. Unlike i40e, the ice_put_rx_mbuf() function does call
> ice_put_rx_buf() on the last buffer of the frame indicating end of packet.
>
> The xdp_xmit value only needs to be updated if an XDP program is run, and
> only once per packet. Drop the xdp_xmit pointer argument from
> ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function
> directly. This avoids needing to pass the argument and avoids an extra
> bit-wise OR for each buffer in the frame.
>
> Move the increment of the ntc local variable to ensure its updated *before*
> all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic
> requires the index of the element just after the current frame.
>
> This has the advantage that we also no longer need to track or cache the
> number of fragments in the rx_ring, which saves a few bytes in the ring.
>
> Cc: Christoph Petrausch
> Reported-by: Jaroslav Pulchart
> Closes:
> https://lore.kernel.org/netdev/cak8ffz4hy6gujnenz3wy9jaylzxgfpr7dnzxzgmyoe44car...@mail.gmail.com/
> Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current
> frame")
> Signed-off-by: Jacob Keller
I have finally reviewed this and tested with xdpsock in copy mode running
custom bpf program that consumes 4500 bytes via bpf_xdp_adjust_tail so
that the frag consumption by bpf prog is triggered. I don't see any
alarming issues that usually popped up in /proc/meminfo.
Reviewed-by: Maciej Fijalkowski
Tested-by: Maciej Fijalkowski
once again thank you Jake!
> ---
> I've tested this in a setup with MTU 9000, using a combination of iperf3
> and wrk generated traffic.
>
> I tested this in a couple of ways. First, I check memory allocations using
> /proc/allocinfo:
>
> awk '/ice_alloc_mapped_page/ { printf("%s %s\n", $1, $2) }' /proc/allocinfo
> | numfmt --to=iec
>
> Second, I ported some stats from i40e written by Joe Damato to track the
> page allocation and busy counts. I consistently saw that the allocate stat
> increased without the busy or waive stats increasing. I also added a stat
> to track directly when we overwrote a page pointer that was non-NULL in
> ice_reuse_rx_page(), and saw it increment consistently.
>
> With this fix, all of these indicators are fixed. I've tested both 1500
> byte and 9000 byte MTU and no longer see the leak. With the counters I was
> able to immediately see a leak within a few minutes of iperf3, so I am
> confident that I've resolved the leak with this fix.
>
> I've now also tested with xdp-bench and confirm that XDP_TX and XDP_REDIR work
> properly with the fix for updating xdp_xmit.
> ---
> Changes in v2:
> - Fix XDP T
Re: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
> -Original Message- > From: Intel-wired-lan On Behalf Of Jacob > Keller > Sent: 12 July 2025 05:54 > To: Fijalkowski, Maciej ; Kitszel, Przemyslaw > ; Intel Wired LAN > ; Lobakin, Aleksander > > Cc: Joe Damato ; Nguyen, Anthony L > ; [email protected]; Christoph Petrausch > ; Jaroslav Pulchart > ; Keller, Jacob E > Subject: [Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on > multi-buffer frames > > The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each > buffer in the current frame. This function was introduced as part of handling > multi-buffer XDP support in the ice driver. > > It works by iterating over the buffers from first_desc up to 1 plus the total > number of fragments in the frame, cached from before the XDP program was > executed. > > If the hardware posts a descriptor with a size of 0, the logic used in > ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added as > fragments in ice_add_xdp_frag. Since the buffer isn't counted as a fragment, > we do not iterate over it in ice_put_rx_mbuf(), and thus we don't call > ice_put_rx_buf(). > > Because we don't call ice_put_rx_buf(), we don't attempt to re-use the page > or free it. This leaves a stale page in the ring, as we don't increment > next_to_alloc. > > The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented > properly, and that it always points to a buffer with a NULL page. Since this > function doesn't check, it will happily recycle a page over the top of the > next_to_alloc buffer, losing track of the old page. > > Note that this leak only occurs for multi-buffer frames. The > ice_put_rx_mbuf() function always handles at least one buffer, so a > single-buffer frame will always get handled correctly. It is not clear > precisely why the hardware hands us descriptors with a size of 0 sometimes, > but it happens somewhat regularly with "jumbo frames" used by 9K MTU. > > To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on > all buffers between first_desc and next_to_clean. Borrow the logic of a > similar function in i40e used for this same purpose. Use the same logic also > in ice_get_pgcnts(). > > Instead of iterating over just the number of fragments, use a loop which > iterates until the current index reaches to the next_to_clean element just > past the current frame. Check the current number of fragments (post XDP > program). For all buffers up 1 more than the number of fragments, > we'll > update the pagecnt_bias. For any buffers past this, pagecnt_bias is left > as-is. This ensures that fragments released by the XDP program, as well as > any buffers with zero-size won't have their pagecnt_bias updated incorrectly. > Unlike i40e, the ice_put_rx_mbuf() function does call > ice_put_rx_buf() on the last buffer of the frame indicating end of packet. > > The xdp_xmit value only needs to be updated if an XDP program is run, and > only once per packet. Drop the xdp_xmit pointer argument from > ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function > directly. This avoids needing to pass the argument and avoids an extra > > bit-wise OR for each buffer in the frame. > > Move the increment of the ntc local variable to ensure its updated *before* > all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic > requires the index of the element just after the current frame. > > This has the advantage that we also no longer need to track or cache the > number of fragments in the rx_ring, which saves a few bytes in the ring. > > Cc: Christoph Petrausch > Reported-by: Jaroslav Pulchart > Closes: > https://lore.kernel.org/netdev/cak8ffz4hy6gujnenz3wy9jaylzxgfpr7dnzxzgmyoe44car...@mail.gmail.com/ > Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current > frame") > Signed-off-by: Jacob Keller > --- > I've tested this in a setup with MTU 9000, using a combination of iperf3 and > wrk generated traffic. > > I tested this in a couple of ways. First, I check memory allocations using > /proc/allocinfo: > > awk '/ice_alloc_mapped_page/ { printf("%s %s\n", $1, $2) }' /proc/allocinfo | > numfmt --to=iec > > Second, I ported some stats from i40e written by Joe Damato to track the page > allocation and busy counts. I consistently saw that the allocate stat > increased without the busy or waive stats increasing. I also added a stat to > track directly when we overwrote a page pointer that was non-NULL in > ice_reuse_rx_page(), and saw it increment consistently. > > With this fix, all of the
[Intel-wired-lan] [PATCH iwl-net v2] ice: fix Rx page leak on multi-buffer frames
The ice_put_rx_mbuf() function handles calling ice_put_rx_buf() for each
buffer in the current frame. This function was introduced as part of
handling multi-buffer XDP support in the ice driver.
It works by iterating over the buffers from first_desc up to 1 plus the
total number of fragments in the frame, cached from before the XDP program
was executed.
If the hardware posts a descriptor with a size of 0, the logic used in
ice_put_rx_mbuf() breaks. Such descriptors get skipped and don't get added
as fragments in ice_add_xdp_frag. Since the buffer isn't counted as a
fragment, we do not iterate over it in ice_put_rx_mbuf(), and thus we don't
call ice_put_rx_buf().
Because we don't call ice_put_rx_buf(), we don't attempt to re-use the
page or free it. This leaves a stale page in the ring, as we don't
increment next_to_alloc.
The ice_reuse_rx_page() assumes that the next_to_alloc has been incremented
properly, and that it always points to a buffer with a NULL page. Since
this function doesn't check, it will happily recycle a page over the top
of the next_to_alloc buffer, losing track of the old page.
Note that this leak only occurs for multi-buffer frames. The
ice_put_rx_mbuf() function always handles at least one buffer, so a
single-buffer frame will always get handled correctly. It is not clear
precisely why the hardware hands us descriptors with a size of 0 sometimes,
but it happens somewhat regularly with "jumbo frames" used by 9K MTU.
To fix ice_put_rx_mbuf(), we need to make sure to call ice_put_rx_buf() on
all buffers between first_desc and next_to_clean. Borrow the logic of a
similar function in i40e used for this same purpose. Use the same logic
also in ice_get_pgcnts().
Instead of iterating over just the number of fragments, use a loop which
iterates until the current index reaches to the next_to_clean element just
past the current frame. Check the current number of fragments (post XDP
program). For all buffers up 1 more than the number of fragments, we'll
update the pagecnt_bias. For any buffers past this, pagecnt_bias is left
as-is. This ensures that fragments released by the XDP program, as well as
any buffers with zero-size won't have their pagecnt_bias updated
incorrectly. Unlike i40e, the ice_put_rx_mbuf() function does call
ice_put_rx_buf() on the last buffer of the frame indicating end of packet.
The xdp_xmit value only needs to be updated if an XDP program is run, and
only once per packet. Drop the xdp_xmit pointer argument from
ice_put_rx_mbuf(). Instead, set xdp_xmit in the ice_clean_rx_irq() function
directly. This avoids needing to pass the argument and avoids an extra
bit-wise OR for each buffer in the frame.
Move the increment of the ntc local variable to ensure its updated *before*
all calls to ice_get_pgcnts() or ice_put_rx_mbuf(), as the loop logic
requires the index of the element just after the current frame.
This has the advantage that we also no longer need to track or cache the
number of fragments in the rx_ring, which saves a few bytes in the ring.
Cc: Christoph Petrausch
Reported-by: Jaroslav Pulchart
Closes:
https://lore.kernel.org/netdev/cak8ffz4hy6gujnenz3wy9jaylzxgfpr7dnzxzgmyoe44car...@mail.gmail.com/
Fixes: 743bbd93cf29 ("ice: put Rx buffers after being done with current frame")
Signed-off-by: Jacob Keller
---
I've tested this in a setup with MTU 9000, using a combination of iperf3
and wrk generated traffic.
I tested this in a couple of ways. First, I check memory allocations using
/proc/allocinfo:
awk '/ice_alloc_mapped_page/ { printf("%s %s\n", $1, $2) }' /proc/allocinfo |
numfmt --to=iec
Second, I ported some stats from i40e written by Joe Damato to track the
page allocation and busy counts. I consistently saw that the allocate stat
increased without the busy or waive stats increasing. I also added a stat
to track directly when we overwrote a page pointer that was non-NULL in
ice_reuse_rx_page(), and saw it increment consistently.
With this fix, all of these indicators are fixed. I've tested both 1500
byte and 9000 byte MTU and no longer see the leak. With the counters I was
able to immediately see a leak within a few minutes of iperf3, so I am
confident that I've resolved the leak with this fix.
I've now also tested with xdp-bench and confirm that XDP_TX and XDP_REDIR work
properly with the fix for updating xdp_xmit.
---
Changes in v2:
- Fix XDP Tx/Redirect (Thanks Maciej!)
- Link to v1:
https://lore.kernel.org/r/[email protected]
---
drivers/net/ethernet/intel/ice/ice_txrx.h | 1 -
drivers/net/ethernet/intel/ice/ice_txrx.c | 81 +--
2 files changed, 33 insertions(+), 49 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h
b/drivers/net/ethernet/intel/ice/ice_txrx.h
index a4b1e9514632..07155e615f75 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -358,7 +358,6 @@ struct ice_rx_ring {
struct
