Hi Stephen,
I ensured that in every pipeline stage that enqueue or dequeues mbufs it uses 
the burst version, perf showed the repercussions of doing one mbuf dequeue and 
enqueue.
For the receive stage rte_eth_rx_burst() is used and Tx stage we use 
rte_eth_tx_burst().  The burst size used in tx_thread for dequeue burst is 512 
Mbufs.

Thanks,
Ed

-----Original Message-----
From: Stephen Hemminger <step...@networkplumber.org> 
Sent: Monday, July 7, 2025 7:04 PM
To: Ivan Malov <ivan.ma...@arknetworks.am>
Cc: Lombardo, Ed <ed.lomba...@netscout.com>; users <users@dpdk.org>
Subject: Re: dpdk Tx falling short

External Email: This message originated outside of NETSCOUT. Do not click links 
or open attachments unless you recognize the sender and know the content is 
safe.

On Tue, 8 Jul 2025 01:49:44 +0400 (+04)
Ivan Malov <ivan.ma...@arknetworks.am> wrote:

> Hi Ed,
> 
> On Mon, 7 Jul 2025, Lombardo, Ed wrote:
> 
> > Hi Stephen,
> > I ran a perf diff on two perf records and reveals the real problem with the 
> > tx thread in transmitting packets.
> >
> > The comparison is traffic received on ifn3 and transmit ifn4 to traffic 
> > received on ifn3, ifn5 and transmit on ifn4, ifn6.
> > When transmit packets on one port the performance is better, however when 
> > transmit on two ports the performance across the two drops dramatically.
> >
> > There is increase of 55.29% of the CPU spent in common_ring_mp_enqueue and 
> > 54.18% less time in i40e_xmit_pkts (was E810 tried x710).
> > The common_ring_mp_enqueue is multi-producer,  is the enqueue of mbuf 
> > pointers passed in to rte_eth_tx_burst() have to be multi-producer?  
> 
> I may be wrong, but rte_eth_tx_burst(), as part of what is known as "reap"
> process, should check for "done" Tx descriptors resulting from 
> previous invocations and free (enqueue) the associated mbufs into respective 
> mempools.
> In your case, you say you only have a single mempool shared between 
> the port pairs, which, as I understand, are served by concurrent 
> threads, so it might be logical to use a multi-producer mempool in this case. 
> Or am I missing something?
> 
> The pktmbuf API for mempool allocation is a wrapper around generic API 
> and it might request multi-producer multi-consumer by default (see [1], 
> 'flags').
> According to your original mempool monitor printout, the per-lcore 
> cache size is 512. On the premise that separate lcores serve the two 
> port pairs, and taking into account the burst size, it should be OK, 
> yet you may want to play with the per-lcore cache size argument when creating 
> the pool. Does it change anything?
> 
> Regarding separate mempools, -- I saw Stephen's response about those 
> making CPU cache behaviour worse and not better. Makes sense and I 
> won't argue. And yet, why not just try an make sure this indeed holds 
> in this particular case? Also, since you're seeking single-producer 
> behaviour, having separate per-port-pair mempools might allow to 
> create such (again, see 'flags' at [1]), provided that API [1] is used for 
> mempool creation. Please correct me in case I'm mistaken.
> 
> Also, PMDs can support "fast free" Tx offload. Please see [2] to check 
> whether the application asks for this offload flag or not. It may be worth 
> enabling.
> 
> [1] 
> https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__mempoo
> l_8h.html*a0b64d611bc140a4d2a0c94911580efd5__;Iw!!Nzg7nt7_!EvEznHI_mP3
> GsiSVrhbQfDE2va8UxZ5-8okSD-Cq_gTm9nP0Q34d6XPWYoQhUGqoJifjjk4Na1a8j5EZH
> SqWzqXGztg$ [2] 
> https://urldefense.com/v3/__https://doc.dpdk.org/api-25.03/rte__ethdev
> _8h.html*a43f198c6b59d965130d56fd8f40ceac1__;Iw!!Nzg7nt7_!EvEznHI_mP3G
> siSVrhbQfDE2va8UxZ5-8okSD-Cq_gTm9nP0Q34d6XPWYoQhUGqoJifjjk4Na1a8j5EZHS
> qWypWjs8A$
> 
> Thank you.
> 
> >
> > Is there a way to change dpdk to use single-producer?
> >
> > # Event 'cycles'
> > #
> > # Baseline  Delta Abs  Shared Object      Symbol
> > # ........  .........  .................  
> > ......................................
> > #
> >    36.37%    +55.29%  test                        [.] common_ring_mp_enqueue
> >    62.36%    -54.18%   test                        [.] i40e_xmit_pkts
> >     1.10%     -0.94%     test                         [.] dpdk_tx_thread
> >     0.01%     -0.01%     [kernel.kallsyms]  [k] native_sched_clock
> >                     +0.00%    [kernel.kallsyms]  [k] fill_pmd
> >                     +0.00%    [kernel.kallsyms]  [k] perf_sample_event_took
> >     0.00%     +0.00%    [kernel.kallsyms]  [k] 
> > __flush_smp_call_function_queue
> >     0.02%                      [kernel.kallsyms]  [k] 
> > __intel_pmu_enable_all.constprop.0
> >     0.02%                      [kernel.kallsyms]  [k] native_irq_return_iret
> >     0.02%                      [kernel.kallsyms]  [k] 
> > native_tss_update_io_bitmap
> >     0.01%                      [kernel.kallsyms]  [k] ktime_get
> >     0.01%                      [kernel.kallsyms]  [k] 
> > perf_adjust_freq_unthr_context
> >     0.01%                      [kernel.kallsyms]  [k] __update_blocked_fair
> >     0.01%                      [kernel.kallsyms]  [k] 
> > perf_adjust_freq_unthr_events
> >
> > Thanks,
> > Ed
> >
> > -----Original Message-----
> > From: Lombardo, Ed
> > Sent: Sunday, July 6, 2025 1:45 PM
> > To: Stephen Hemminger <step...@networkplumber.org>
> > Cc: Ivan Malov <ivan.ma...@arknetworks.am>; users <users@dpdk.org>
> > Subject: RE: dpdk Tx falling short
> >
> > Hi Stephen,
> > If using dpdk rings comes with this penalty then what should I use, is 
> > there an alterative to rings.  We do not want to use shared memory and do 
> > buffer copies?
> >
> > Thanks,
> > Ed
> >
> > -----Original Message-----
> > From: Stephen Hemminger <step...@networkplumber.org>
> > Sent: Sunday, July 6, 2025 12:03 PM
> > To: Lombardo, Ed <ed.lomba...@netscout.com>
> > Cc: Ivan Malov <ivan.ma...@arknetworks.am>; users <users@dpdk.org>
> > Subject: Re: dpdk Tx falling short
> >
> > External Email: This message originated outside of NETSCOUT. Do not click 
> > links or open attachments unless you recognize the sender and know the 
> > content is safe.
> >
> > On Sun, 6 Jul 2025 00:03:16 +0000
> > "Lombardo, Ed" <ed.lomba...@netscout.com> wrote:
> >  
> >> Hi Stephen,
> >> Here are comments to the list of obvious causes of cache misses you 
> >> mentiond.
> >>
> >> Obvious cache misses.
> >>  - passing packets to worker with ring - we use lots of rings to pass mbuf 
> >> pointers.  If I skip the rte_eth_tx_burst() and just free mbuf bulk, the 
> >> tx ring does not fill up.
> >>  - using spinlocks (cost 16ns)  - The driver does not use spinlocks, other 
> >> than what dpdk uses.
> >>  - fetching TSC  - We don't do this, we let Rx offload timestamp packets.
> >>  - syscalls?  - No syscalls are done in our driver fast path.
> >>
> >> You mention "passing packets to worker with ring", do you mean using rings 
> >> to pass mbuf pointers causes cache misses and should be avoided?  
> >
> > Rings do cause data to be modified by one core and examined by another so 
> > they are a cache miss.
> >
> >  

How many packets is your application seeing per-burst? Ideally it should be 
getting chunks not single packet a  time. And then the driver can use defer 
free to put back bursts.
If you have multi-stage pipeline it helps if you pass a burst to each stage 
rather than looping over the burst in the outer loop. Imagine getting a burst 
of 16 packets. If you pass an array down the pipeline, then there is one call 
per burst. If you process packets one at a time, it can mean 16 calls, and if 
the pipeline exceeds the instruction cache it can mean 16 cache misses.

The point is bursting is a big win in data and instruction cache.
If you really want to tune investigate prefetching like VPP does.

Reply via email to