On Tue, 8 Jul 2025 01:49:44 +0400 (+04) Ivan Malov <ivan.ma...@arknetworks.am> wrote:
> Hi Ed, > > On Mon, 7 Jul 2025, Lombardo, Ed wrote: > > > Hi Stephen, > > I ran a perf diff on two perf records and reveals the real problem with the > > tx thread in transmitting packets. > > > > The comparison is traffic received on ifn3 and transmit ifn4 to traffic > > received on ifn3, ifn5 and transmit on ifn4, ifn6. > > When transmit packets on one port the performance is better, however when > > transmit on two ports the performance across the two drops dramatically. > > > > There is increase of 55.29% of the CPU spent in common_ring_mp_enqueue and > > 54.18% less time in i40e_xmit_pkts (was E810 tried x710). > > The common_ring_mp_enqueue is multi-producer, is the enqueue of mbuf > > pointers passed in to rte_eth_tx_burst() have to be multi-producer? > > I may be wrong, but rte_eth_tx_burst(), as part of what is known as "reap" > process, should check for "done" Tx descriptors resulting from previous > invocations and free (enqueue) the associated mbufs into respective mempools. > In your case, you say you only have a single mempool shared between the port > pairs, which, as I understand, are served by concurrent threads, so it might > be > logical to use a multi-producer mempool in this case. Or am I missing > something? > > The pktmbuf API for mempool allocation is a wrapper around generic API and it > might request multi-producer multi-consumer by default (see [1], 'flags'). > According to your original mempool monitor printout, the per-lcore cache size > is > 512. On the premise that separate lcores serve the two port pairs, and taking > into account the burst size, it should be OK, yet you may want to play with > the > per-lcore cache size argument when creating the pool. Does it change anything? > > Regarding separate mempools, -- I saw Stephen's response about those making > CPU > cache behaviour worse and not better. Makes sense and I won't argue. And yet, > why not just try an make sure this indeed holds in this particular case? Also, > since you're seeking single-producer behaviour, having separate per-port-pair > mempools might allow to create such (again, see 'flags' at [1]), provided that > API [1] is used for mempool creation. Please correct me in case I'm mistaken. > > Also, PMDs can support "fast free" Tx offload. Please see [2] to check whether > the application asks for this offload flag or not. It may be worth enabling. > > [1] > https://doc.dpdk.org/api-25.03/rte__mempool_8h.html#a0b64d611bc140a4d2a0c94911580efd5 > [2] > https://doc.dpdk.org/api-25.03/rte__ethdev_8h.html#a43f198c6b59d965130d56fd8f40ceac1 > > Thank you. > > > > > Is there a way to change dpdk to use single-producer? > > > > # Event 'cycles' > > # > > # Baseline Delta Abs Shared Object Symbol > > # ........ ......... ................. > > ...................................... > > # > > 36.37% +55.29% test [.] common_ring_mp_enqueue > > 62.36% -54.18% test [.] i40e_xmit_pkts > > 1.10% -0.94% test [.] dpdk_tx_thread > > 0.01% -0.01% [kernel.kallsyms] [k] native_sched_clock > > +0.00% [kernel.kallsyms] [k] fill_pmd > > +0.00% [kernel.kallsyms] [k] perf_sample_event_took > > 0.00% +0.00% [kernel.kallsyms] [k] > > __flush_smp_call_function_queue > > 0.02% [kernel.kallsyms] [k] > > __intel_pmu_enable_all.constprop.0 > > 0.02% [kernel.kallsyms] [k] native_irq_return_iret > > 0.02% [kernel.kallsyms] [k] > > native_tss_update_io_bitmap > > 0.01% [kernel.kallsyms] [k] ktime_get > > 0.01% [kernel.kallsyms] [k] > > perf_adjust_freq_unthr_context > > 0.01% [kernel.kallsyms] [k] __update_blocked_fair > > 0.01% [kernel.kallsyms] [k] > > perf_adjust_freq_unthr_events > > > > Thanks, > > Ed > > > > -----Original Message----- > > From: Lombardo, Ed > > Sent: Sunday, July 6, 2025 1:45 PM > > To: Stephen Hemminger <step...@networkplumber.org> > > Cc: Ivan Malov <ivan.ma...@arknetworks.am>; users <users@dpdk.org> > > Subject: RE: dpdk Tx falling short > > > > Hi Stephen, > > If using dpdk rings comes with this penalty then what should I use, is > > there an alterative to rings. We do not want to use shared memory and do > > buffer copies? > > > > Thanks, > > Ed > > > > -----Original Message----- > > From: Stephen Hemminger <step...@networkplumber.org> > > Sent: Sunday, July 6, 2025 12:03 PM > > To: Lombardo, Ed <ed.lomba...@netscout.com> > > Cc: Ivan Malov <ivan.ma...@arknetworks.am>; users <users@dpdk.org> > > Subject: Re: dpdk Tx falling short > > > > External Email: This message originated outside of NETSCOUT. Do not click > > links or open attachments unless you recognize the sender and know the > > content is safe. > > > > On Sun, 6 Jul 2025 00:03:16 +0000 > > "Lombardo, Ed" <ed.lomba...@netscout.com> wrote: > > > >> Hi Stephen, > >> Here are comments to the list of obvious causes of cache misses you > >> mentiond. > >> > >> Obvious cache misses. > >> - passing packets to worker with ring - we use lots of rings to pass mbuf > >> pointers. If I skip the rte_eth_tx_burst() and just free mbuf bulk, the > >> tx ring does not fill up. > >> - using spinlocks (cost 16ns) - The driver does not use spinlocks, other > >> than what dpdk uses. > >> - fetching TSC - We don't do this, we let Rx offload timestamp packets. > >> - syscalls? - No syscalls are done in our driver fast path. > >> > >> You mention "passing packets to worker with ring", do you mean using rings > >> to pass mbuf pointers causes cache misses and should be avoided? > > > > Rings do cause data to be modified by one core and examined by another so > > they are a cache miss. > > > > How many packets is your application seeing per-burst? Ideally it should be getting chunks not single packet a time. And then the driver can use defer free to put back bursts. If you have multi-stage pipeline it helps if you pass a burst to each stage rather than looping over the burst in the outer loop. Imagine getting a burst of 16 packets. If you pass an array down the pipeline, then there is one call per burst. If you process packets one at a time, it can mean 16 calls, and if the pipeline exceeds the instruction cache it can mean 16 cache misses. The point is bursting is a big win in data and instruction cache. If you really want to tune investigate prefetching like VPP does.