On Thu, 19 May 2022 09:07:28 +0000 "Kinsella, Ray" <[email protected]> wrote:
> Better is a _very_ subjective. > > pcm-memory does one thing well. > That whole suite is worth playing with though. > > Ray K > > From: Antonio Di Bacco <[email protected]> > Sent: Thursday 19 May 2022 10:04 > To: Kinsella, Ray <[email protected]> > Cc: Sanford, Robert <[email protected]>; [email protected] > Subject: Re: DPDK performances surprise > > This tool seems awesome!!! Better than VTUNE? > > On Thu, May 19, 2022 at 10:29 AM Kinsella, Ray > <[email protected]<mailto:[email protected]>> wrote: > I’d say that is likely yes. > > FYI - pcm-memory is very handy tool for looking at memory traffic. > https://github.com/opcm/pcm > > Thanks, > > Ray K > > From: Sanford, Robert <[email protected]<mailto:[email protected]>> > Sent: Wednesday 18 May 2022 17:53 > To: Antonio Di Bacco <[email protected]<mailto:[email protected]>>; > [email protected]<mailto:[email protected]> > Subject: Re: DPDK performances surprise > > My guess is that most of the packet data has a short life in the L3 cache > (before being overwritten by newer packets), but is never flushed to memory. > > From: Antonio Di Bacco <[email protected]<mailto:[email protected]>> > Date: Wednesday, May 18, 2022 at 12:40 PM > To: "[email protected]<mailto:[email protected]>" > <[email protected]<mailto:[email protected]>> > Subject: DPDK performances surprise > > I recently read a performance test where l2fwd was able to receive packets > (8000B) from a 100 Gbps card, swap the L2 addresses and send them back to the > same port to be received by an ethernet analyzer. The throughput achieved was > close to 100 Gbps on a XEON machine (Intel(R) Xeon(R) Platinum 8176 CPU @ > 2.10GHz) . This is the same processor I have and I know that, if I try to > write around 8000B to the attached DDR4 (2666MT/s) on an allocated 1GB > hugepage, I get a maximum throughput of around 20GB/s. > > Now, a 100 Gbps can generate a flow of around 12 GB/s, these packets have to > be written to the DDR and then read back to swap L2 addresses and this leads > to a cumulative bandwidth on the DDR that is around 2x12 GB/s and is more > than the 20GB/s of available bandwidth on the DDR4. > > How can this be possible ? If you are comparing forwarding versus writing the whole packet: - for the forwarding case swapping mac address is a single cache line read/write - for software writing the whole packet it will end up walking through many cache lines and dirtying them. Also for that test are you rewriting same packet or walking much larger memory area
