RE: dpdk Tx falling short

Ivan Malov Wed, 09 Jul 2025 23:45:38 -0700

Hi Ed,

If the two NICs sit on different NUMA nodes, then yes, apparently, it should be
practical to allocate resources and run worker lcores in accordance with that.


For example, one can use API [1] on an initialised DPDK port to get its NUMA
socket ID. Then this value can be used for mempool creation and queue set up.
API [2] can be used to identify an lcore (among available lcores) to find the
one that sits on a matching NUMA node and that should be used to launch the
Rx/Tx worker with API [3]. If both ports in the port pair sit on the same NUMA
node, and if traffic flows of separate pairs are independent from each other and
not intermixed, then this may theoretically be a practial set up. Worth trying.

Ideally, the business logic should also be run in the context of worker lcores.
The use of just one "housekeeping" lcore may not give best performance.
I apologise in case I've got something wrong.
Thank you.

[1] 
https://doc.dpdk.org/api-25.03/rte__ethdev_8h.html#ad032e25f712e6ffeb0c19eab1ec1fd2e
[2] 
https://doc.dpdk.org/api-25.03/rte__lcore_8h.html#a023b4909f52c3cdf0351d71d2b5032bc
[3] 
https://doc.dpdk.org/api-25.03/rte__launch_8h.html#a2bf98eda211728b3dc69aa7694758c6d

On Wed, 9 Jul 2025, Lombardo, Ed wrote:

Hi Ivan,
Do you see any benefit to creating two mempools, one per NUMA node versus both 
on the same NUMA as the NIC?

If I try creating Hugepage memory on both NUMA nodes and associated mempools , 
do I need to have a DPDK lcore on each NUMA node, or can I get by with one 
lcore (strictly for housekeeping).  We use POSIX threads in our application, in 
which DPDK knows nothing about.

Thanks,
Ed

-----Original Message-----
From: Lombardo, Ed
Sent: Tuesday, July 8, 2025 9:09 PM
To: Ivan Malov <[email protected]>
Cc: Stephen Hemminger <[email protected]>; users <[email protected]>
Subject: RE: dpdk Tx falling short

Hi Ivan,
I added two mempools per port pair as you suggeted.
The results of Tx performance improved and when turn on one port pair and it 
does not affect the second port pair.  The tx ring no longer fills up but 
drains near empty.

Improved: Tx - 1.5 Mpps to 8.3 Mpps, 1.5 Mpps to 11.2 Mpps

I need to do the perf analysis again but wanted to provide you results.

I still need to improve the performance on Tx, but this is much needed break 
through (with your help).

Thanks,
Ed

-----Original Message-----
From: Ivan Malov <[email protected]>
Sent: Tuesday, July 8, 2025 12:53 PM
To: Lombardo, Ed <[email protected]>
Cc: Stephen Hemminger <[email protected]>; users <[email protected]>
Subject: RE: dpdk Tx falling short

External Email: This message originated outside of NETSCOUT. Do not click links 
or open attachments unless you recognize the sender and know the content is 
safe.

Hi Ed,

On Tue, 8 Jul 2025, Lombardo, Ed wrote:

Hi Ivan,
Thanks, this clears up my confusion.  Using API[2] to create one mempool for 
the network Rx and Tx queues must be MP/MC.  The CPU Cycles spent on the 
common_ring_mp_enqueue increase as more ports are transmitting.  The transmit 
operation causes the call for Rx and Tx queues results in fight for access to 
the mbuf mempool because of one mempool?


Not really. Mempools in DPDK in general (and, in particular, as shown in your 
monitor printout) have per-lcore object cache, which, if I'm not mistaken, is 
to avoid such contention when accessing the pool. And, since only a single pool 
is used in your case, the use of MP/MC seems logical, as well as the use of the 
per-lcore object cache. But it's not obvious if this is optimal in your case.

This is why you suggested creating two mempools, one for each pair of ports.


It could be a low-hanging fruit to do a quick check with two separate mempools, 
probably also MP/MC even (allocated via the same API [2]), to know if it 
affects performance or not. Again, as Stephen noted, this may even worsen CPU 
cache performance, but may be it still pays to do a quick check after all.

If I go this route what are the precautions I need to take?

I will try RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE offload flag first.


This is somehow unrelated to pools and rings, yet it should enable the PMD's 
internal Tx handling to accumulate bulks of mbufs to be freed upon transmission 
via bulk operations that, akin Tx and Rx bursts, may also improve CPU cache 
utilisation and overall performance. The only prerequisite - all mbufs passed 
to a given Tx queue have to come from the same mempool. Hopefully this holds 
for you, if the logic does not intermix packets from 2 pools into the same Tx 
queue.

May be Stephen's suggestion to use a Tx buffer API is also worth the shot.

Thank you.

Thanks,
Ed

-----Original Message-----
From: Ivan Malov <[email protected]>
Sent: Tuesday, July 8, 2025 10:49 AM
To: Lombardo, Ed <[email protected]>
Cc: Stephen Hemminger <[email protected]>; users
<[email protected]>
Subject: RE: dpdk Tx falling short

External Email: This message originated outside of NETSCOUT. Do not click links 
or open attachments unless you recognize the sender and know the content is 
safe.

On Tue, 8 Jul 2025, Lombardo, Ed wrote:

Hi Ivan,
Yes, only the user space created rings.
Can you add more to your thoughts?


I was seeking to address the probable confusion here. If the application creates a SC / MP ring for 
its own pipiline logic using API [1] and then invokes another API [2] to create a common "mbuf 
mempool" to be used with Rx and Tx queues of the network ports, then the observed appearance 
of "common_ring_mp_enqueue" is likely attributed to the fact that API [2] creates a 
ring-based mempool internally, and in MP / MC mode by default. And the latter ring is not the same 
as the one created by the application logic. These are two independent rings.

BTW, does your application set RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE offload flag 
when configuring Tx port/queue offloads on the network ports?

Thank you.

[1]
https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__ring_8
h.html*a155cb48ef311eddae9b2e34808338b17__;Iw!!Nzg7nt7_!GXTS2DQR0JZFGh
dahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7izij4
0zap9fvA$ [2]
https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__mbuf_8
h.html*a8f4abb0d54753d2fde515f35c1ba402a__;Iw!!Nzg7nt7_!GXTS2DQR0JZFGh
dahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7izij4
07rwGv1P$ [3]
https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__mempoo
l_8h.html*a0b64d611bc140a4d2a0c94911580efd5__;Iw!!Nzg7nt7_!GXTS2DQR0JZ
FGhdahtcpSBjmoh-AZ4dzS73R_J9A1I0JxlORLHvylHea80X_KHTZRcZV4qcMEvJd7Z7iz
ij402Z4uOww$

Ed

-----Original Message-----
From: Ivan Malov <[email protected]>
Sent: Tuesday, July 8, 2025 10:19 AM
To: Lombardo, Ed <[email protected]>
Cc: Stephen Hemminger <[email protected]>; users
<[email protected]>
Subject: RE: dpdk Tx falling short

External Email: This message originated outside of NETSCOUT. Do not click links 
or open attachments unless you recognize the sender and know the content is 
safe.

Hi Ed,

On Tue, 8 Jul 2025, Lombardo, Ed wrote:

Hi Stephen,
When I replace rte_eth_tx_burst() with mbuf free bulk I do not see the tx ring 
fill up.  I think this is valuable information.  Also, perf analysis of the tx 
thread shows common_ring_mp_enqueue and rte_atomic32_cmpset, where I did not 
expect to see if I created all the Tx  rings as SP and SC (and the workers and 
ack rings as well, essentially all the 16 rings).

Perf report snippet:
+   57.25%  DPDK_TX_1  test            [.] common_ring_mp_enqueue
+   25.51%  DPDK_TX_1  test            [.] rte_atomic32_cmpset
+    9.13%  DPDK_TX_1  test             [.] i40e_xmit_pkts
+    6.50%  DPDK_TX_1  test             [.] rte_pause
     0.21%  DPDK_TX_1  test              [.] rte_mempool_ops_enqueue_bulk.isra.0
     0.20%  DPDK_TX_1  test              [.] dpdk_tx_thread

The traffic load is constant 10 Gbps 84 bytes packets with no idles.  The burst 
size of 512 is a desired burst of mbufs, however the tx thread will transmit 
what ever it can get from the Tx ring.

I think if resolving why the perf analysis shows ring is MP when it has been 
created as SP / SC should resolve this issue.


The 'common_ring_mp_enqueue' is the enqueue method of mempool variant 'ring', 
that is, based on RTE Ring internally. When you say that ring has been created 
as SP / SC you seemingly refer to the regular RTE ring created by your 
application logic, not the internal ring of the mempool. Am I missing something?

Thank you.

Thanks,
ed

-----Original Message-----
From: Stephen Hemminger <[email protected]>
Sent: Tuesday, July 8, 2025 9:47 AM
To: Lombardo, Ed <[email protected]>
Cc: Ivan Malov <[email protected]>; users <[email protected]>
Subject: Re: dpdk Tx falling short

External Email: This message originated outside of NETSCOUT. Do not click links 
or open attachments unless you recognize the sender and know the content is 
safe.

On Tue, 8 Jul 2025 04:10:05 +0000
"Lombardo, Ed" <[email protected]> wrote:

Hi Stephen,
I ensured that in every pipeline stage that enqueue or dequeues mbufs it uses 
the burst version, perf showed the repercussions of doing one mbuf dequeue and 
enqueue.
For the receive stage rte_eth_rx_burst() is used and Tx stage we use 
rte_eth_tx_burst().  The burst size used in tx_thread for dequeue burst is 512 
Mbufs.


You might try buffering like rte_eth_tx_buffer does.
Need to add an additional mechanism to ensure that buffer gets flushed when you 
detect idle period.

RE: dpdk Tx falling short

Reply via email to