On 9/18/25 5:12 PM, Konstantin Ananyev wrote:


Subject: RE: Fixing MBUF_FAST_FREE TX offload requirements?

From: Bruce Richardson [mailto:[email protected]]
Sent: Thursday, 18 September 2025 11.09

On Thu, Sep 18, 2025 at 10:50:11AM +0200, Morten Brørup wrote:
Dear NIC driver maintainers (CC: DPDK Tech Board),

The DPDK Tech Board has discussed that patch [1] (included in DPDK
25.07) extended the documented requirements to the
RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE offload.
These changes put additional limitations on applications' use of the
MBUF_FAST_FREE TX offload, and made MBUF_FAST_FREE mutually exclusive
with MULTI_SEGS (which is typically used for jumbo frame support).
The Tech Board discussed that these changes do not reflect the
intention of the MBUF_FAST_FREE TX offload, and wants to fix it.
Mainly, MBUF_FAST_FREE and MULTI_SEGS should not be mutually
exclusive.

The original RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE requirements were:
When set, application must guarantee that
1) per-queue all mbufs come from the same mempool, and
2) mbufs have refcnt = 1.

The patch added the following requirements to the MBUF_FAST_FREE
offload, reflecting rte_pktmbuf_prefree_seg() postconditions:
3) mbufs are direct,
4) mbufs have next = NULL and nb_segs = 1.

Now, the key question is:
Can we roll back to the original two requirements?
Or do the drivers also depend on the third and/or fourth
requirements?

<advertisement>
Drivers freeing mbufs directly to a mempool should use the new
rte_mbuf_raw_free_bulk() instead of rte_mempool_put_bulk(), so the
preconditions for freeing mbufs directly into a mempool are validated
in mbuf debug mode (with RTE_LIBRTE_MBUF_DEBUG enabled).
Similarly, rte_mbuf_raw_alloc_bulk() should be used instead of
rte_mempool_get_bulk().
</advertisement>

PS: The feature documentation [2] still reflects the original
requirements.

[1]:

https://github.com/DPDK/dpdk/commit/55624173bacb2becaa67793b7139188487
6
673c1
[2]:
https://elixir.bootlin.com/dpdk/v25.07/source/doc/guides/nics/features.
rst#L125


Venlig hilsen / Kind regards,
-Morten Brørup

I'm a little torn on this question, because I can see benefits for both
approaches. Firstly, it would be nice if we made FAST_FREE as
accessible
for driver use as it was originally, with minimal requirements.
However, on
looking at the code, I believe that many drivers actually took it to
mean
that scattered packets couldn't occur in that case either, so the use
was
incorrect.

I primarily look at Intel drivers, and that's how I read the driver code too.

Similarly, and secondly, if we do have the extra
requirements
for FAST_FREE, it does mean that any use of it can be very, very
minimal
and efficient, since we don't need to check anything before freeing the
buffers.

Given where we are now, I think keeping the more restrictive definition
of
FAST_FREE is the way to go - keeping it exclusive with MULTI_SEGS -
because
it means that we are less likely to have bugs. If we look to change it
back, I think we'd have to check all drivers to ensure they are using
the
flag safely.

However, those driver bugs are not new.
If we haven't received bug reports from users affected by them, maybe we can
disregard them (in this discussion about pros and cons).
I prefer we register them as driver bugs, instead of changing the API to
accommodate bugs in the drivers.

 From an application perspective, here's an idea for consideration:
Assuming that indirect mbufs are uncommon, we keep requirement #3.
To allow MULTI_SEGS (jumbo frames) with FAST_FREE, we get rid of requirement
#4.

Do we really need to enable FAST_FREE for jumbo-frames?
Jumbo-frames usually means much smaller PPS number and actual RX/TX overhead
becomes really tiny.

+1
Since the driver knows that refcnt == 1, the driver can set next = NULL and
nb_segs = 1 at any time, either when writing the TX descriptor (when it reads 
the
mbuf anyway), or when freeing the mbuf.
Regarding performance, this means that the driver's TX code path has to write to
the mbufs (i.e. adding the performance cost of memory store operations) when
segmented - but that is a universal requirement when freeing segmented mbufs
to the mempool.

It might work, but I think it will become way too complicated.
Again I don't know who is going to inspect/fix all the drivers.
Just not allowing FAST_FREE for mulsti-seg seems like a much more simpler and 
safer approach.
For even more optimized driver performance, as Bruce mentions...
If a port is configured for FAST_FREE and not MULTI_SEGS, the driver can use a
super lean transmit function.
Since the driver's transmit function pointer is per port (not per queue), this 
would
require the driver to provide the MULTI_SEGS capability only per port, and not
per queue. (Or we would have to add a NOT_MULTI_SEGS offload flag, to ensure
that no queue is configured for MULTI_SEGS.)


FAST_FREE is not a real Tx offload, since there is no promise from
driver to do something (like other Tx offloads, e.g. checksumming or
TSO). Is it a promise to ignore refcount or take a look at memory pool
of some packets only? I guess no. If so, basically any driver may
advertise it and simply ignore if the offload is requested, but
driver can do nothing with these limitations on input data.

It is a performance hint in fact and promise from application to
follow specified limitations on Tx mbufs.

So, if application specifies both FAST_FREE and MULTI_SEG, but driver
code can't FAST_FREE with MULTI_SEG, it should just ignore FAST_FREE.
That's it. The performance hint is simply useless in this case.
There is no point to make FAST_FREE and MULTI_SEG mutual exclusive.
If some drivers can really support both - great. If no, just ignore
FAST_FREE and support MULTI_SEG.

"mbufs are direct" must be FAST_FREE requirement. Since otherwise
freeing is not simple. I guess is was simply lost in the original
definition of FAST_FREE.

I'm sorry for the late reply.

Reply via email to