On Tue, 8 Jul 2025 01:49:44 +0400 (+04)
Ivan Malov <ivan.ma...@arknetworks.am> wrote:

> Hi Ed,
> 
> On Mon, 7 Jul 2025, Lombardo, Ed wrote:
> 
> > Hi Stephen,
> > I ran a perf diff on two perf records and reveals the real problem with the 
> > tx thread in transmitting packets.
> >
> > The comparison is traffic received on ifn3 and transmit ifn4 to traffic 
> > received on ifn3, ifn5 and transmit on ifn4, ifn6.
> > When transmit packets on one port the performance is better, however when 
> > transmit on two ports the performance across the two drops dramatically.
> >
> > There is increase of 55.29% of the CPU spent in common_ring_mp_enqueue and 
> > 54.18% less time in i40e_xmit_pkts (was E810 tried x710).
> > The common_ring_mp_enqueue is multi-producer,  is the enqueue of mbuf 
> > pointers passed in to rte_eth_tx_burst() have to be multi-producer?  
> 
> I may be wrong, but rte_eth_tx_burst(), as part of what is known as "reap"
> process, should check for "done" Tx descriptors resulting from previous
> invocations and free (enqueue) the associated mbufs into respective mempools.
> In your case, you say you only have a single mempool shared between the port
> pairs, which, as I understand, are served by concurrent threads, so it might 
> be
> logical to use a multi-producer mempool in this case. Or am I missing 
> something?
> 
> The pktmbuf API for mempool allocation is a wrapper around generic API and it
> might request multi-producer multi-consumer by default (see [1], 'flags').
> According to your original mempool monitor printout, the per-lcore cache size 
> is
> 512. On the premise that separate lcores serve the two port pairs, and taking
> into account the burst size, it should be OK, yet you may want to play with 
> the
> per-lcore cache size argument when creating the pool. Does it change anything?
> 
> Regarding separate mempools, -- I saw Stephen's response about those making 
> CPU
> cache behaviour worse and not better. Makes sense and I won't argue. And yet,
> why not just try an make sure this indeed holds in this particular case? Also,
> since you're seeking single-producer behaviour, having separate per-port-pair
> mempools might allow to create such (again, see 'flags' at [1]), provided that
> API [1] is used for mempool creation. Please correct me in case I'm mistaken.
> 
> Also, PMDs can support "fast free" Tx offload. Please see [2] to check whether
> the application asks for this offload flag or not. It may be worth enabling.
> 
> [1] 
> https://doc.dpdk.org/api-25.03/rte__mempool_8h.html#a0b64d611bc140a4d2a0c94911580efd5
> [2] 
> https://doc.dpdk.org/api-25.03/rte__ethdev_8h.html#a43f198c6b59d965130d56fd8f40ceac1
> 
> Thank you.
> 
> >
> > Is there a way to change dpdk to use single-producer?
> >
> > # Event 'cycles'
> > #
> > # Baseline  Delta Abs  Shared Object      Symbol
> > # ........  .........  .................  
> > ......................................
> > #
> >    36.37%    +55.29%  test                        [.] common_ring_mp_enqueue
> >    62.36%    -54.18%   test                        [.] i40e_xmit_pkts
> >     1.10%     -0.94%     test                         [.] dpdk_tx_thread
> >     0.01%     -0.01%     [kernel.kallsyms]  [k] native_sched_clock
> >                     +0.00%    [kernel.kallsyms]  [k] fill_pmd
> >                     +0.00%    [kernel.kallsyms]  [k] perf_sample_event_took
> >     0.00%     +0.00%    [kernel.kallsyms]  [k] 
> > __flush_smp_call_function_queue
> >     0.02%                      [kernel.kallsyms]  [k] 
> > __intel_pmu_enable_all.constprop.0
> >     0.02%                      [kernel.kallsyms]  [k] native_irq_return_iret
> >     0.02%                      [kernel.kallsyms]  [k] 
> > native_tss_update_io_bitmap
> >     0.01%                      [kernel.kallsyms]  [k] ktime_get
> >     0.01%                      [kernel.kallsyms]  [k] 
> > perf_adjust_freq_unthr_context
> >     0.01%                      [kernel.kallsyms]  [k] __update_blocked_fair
> >     0.01%                      [kernel.kallsyms]  [k] 
> > perf_adjust_freq_unthr_events
> >
> > Thanks,
> > Ed
> >
> > -----Original Message-----
> > From: Lombardo, Ed
> > Sent: Sunday, July 6, 2025 1:45 PM
> > To: Stephen Hemminger <step...@networkplumber.org>
> > Cc: Ivan Malov <ivan.ma...@arknetworks.am>; users <users@dpdk.org>
> > Subject: RE: dpdk Tx falling short
> >
> > Hi Stephen,
> > If using dpdk rings comes with this penalty then what should I use, is 
> > there an alterative to rings.  We do not want to use shared memory and do 
> > buffer copies?
> >
> > Thanks,
> > Ed
> >
> > -----Original Message-----
> > From: Stephen Hemminger <step...@networkplumber.org>
> > Sent: Sunday, July 6, 2025 12:03 PM
> > To: Lombardo, Ed <ed.lomba...@netscout.com>
> > Cc: Ivan Malov <ivan.ma...@arknetworks.am>; users <users@dpdk.org>
> > Subject: Re: dpdk Tx falling short
> >
> > External Email: This message originated outside of NETSCOUT. Do not click 
> > links or open attachments unless you recognize the sender and know the 
> > content is safe.
> >
> > On Sun, 6 Jul 2025 00:03:16 +0000
> > "Lombardo, Ed" <ed.lomba...@netscout.com> wrote:
> >  
> >> Hi Stephen,
> >> Here are comments to the list of obvious causes of cache misses you 
> >> mentiond.
> >>
> >> Obvious cache misses.
> >>  - passing packets to worker with ring - we use lots of rings to pass mbuf 
> >> pointers.  If I skip the rte_eth_tx_burst() and just free mbuf bulk, the 
> >> tx ring does not fill up.
> >>  - using spinlocks (cost 16ns)  - The driver does not use spinlocks, other 
> >> than what dpdk uses.
> >>  - fetching TSC  - We don't do this, we let Rx offload timestamp packets.
> >>  - syscalls?  - No syscalls are done in our driver fast path.
> >>
> >> You mention "passing packets to worker with ring", do you mean using rings 
> >> to pass mbuf pointers causes cache misses and should be avoided?  
> >
> > Rings do cause data to be modified by one core and examined by another so 
> > they are a cache miss.
> >
> >  

How many packets is your application seeing per-burst? Ideally it should be 
getting chunks
not single packet a  time. And then the driver can use defer free to put back 
bursts.
If you have multi-stage pipeline it helps if you pass a burst to each stage 
rather than
looping over the burst in the outer loop. Imagine getting a burst of 16 
packets. If you
pass an array down the pipeline, then there is one call per burst. If you 
process packets
one at a time, it can mean 16 calls, and if the pipeline exceeds the 
instruction cache
it can mean 16 cache misses.

The point is bursting is a big win in data and instruction cache.
If you really want to tune investigate prefetching like VPP does.

Reply via email to