> -----Original Message-----
> From: Wu, Jingjing [mailto:jingjing.wu at intel.com]
> Sent: Monday, October 10, 2016 4:26 PM
> To: Yigit, Ferruh; Vladyslav Buslov; Zhang, Helin
> Cc: dev at dpdk.org
> Subject: RE: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> instructions for bulk rx
> 
> 
> 
> > -----Original Message-----
> > From: Yigit, Ferruh
> > Sent: Wednesday, September 14, 2016 9:25 PM
> > To: Vladyslav Buslov <vladyslav.buslov at harmonicinc.com>; Zhang, Helin
> > <helin.zhang at intel.com>; Wu, Jingjing <jingjing.wu at intel.com>
> > Cc: dev at dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH] net/i40e: add additional prefetch
> > instructions for bulk rx
> >
> > On 7/14/2016 6:27 PM, Vladyslav Buslov wrote:
> > > Added prefetch of first packet payload cacheline in
> > > i40e_rx_scan_hw_ring Added prefetch of second mbuf cacheline in
> > > i40e_rx_alloc_bufs
> > >
> > > Signed-off-by: Vladyslav Buslov <vladyslav.buslov at harmonicinc.com>
> > > ---
> > >  drivers/net/i40e/i40e_rxtx.c | 7 +++++--
> > >  1 file changed, 5 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > > b/drivers/net/i40e/i40e_rxtx.c index d3cfb98..e493fb4 100644
> > > --- a/drivers/net/i40e/i40e_rxtx.c
> > > +++ b/drivers/net/i40e/i40e_rxtx.c
> > > @@ -1003,6 +1003,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue
> *rxq)
> > >                 /* Translate descriptor info to mbuf parameters */
> > >                 for (j = 0; j < nb_dd; j++) {
> > >                         mb = rxep[j].mbuf;
> > > +                       rte_prefetch0(RTE_PTR_ADD(mb->buf_addr,
> > RTE_PKTMBUF_HEADROOM));
> 
> Why did prefetch here? I think if application need to deal with packet, it is
> more suitable to put it in application.
> 
> > >                         qword1 = rte_le_to_cpu_64(\
> > >                                 rxdp[j].wb.qword1.status_error_len);
> > >                         pkt_len = ((qword1 &
> > I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
> > > @@ -1086,9 +1087,11 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue
> *rxq)
> > >
> > >         rxdp = &rxq->rx_ring[alloc_idx];
> > >         for (i = 0; i < rxq->rx_free_thresh; i++) {
> > > -               if (likely(i < (rxq->rx_free_thresh - 1)))
> > > +               if (likely(i < (rxq->rx_free_thresh - 1))) {
> > >                         /* Prefetch next mbuf */
> > > -                       rte_prefetch0(rxep[i + 1].mbuf);
> > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline0);
> > > +                       rte_prefetch0(&rxep[i + 1].mbuf->cacheline1);
> > > +               }
> Agree with this change. And when I test it by testpmd with iofwd, no
> performance increase is observed but minor decrease.
> Can you share will us when it will benefit the performance in your scenario ?
> 
> 
> Thanks
> Jingjing

Hello Jingjing,

Thanks for code review.

My use case: We have simple distributor thread that receives packets from port 
and distributes them among worker threads according to VLAN and MAC address 
hash. 

While working on performance optimization we determined that most of CPU usage 
of this thread is in DPDK.
As and optimization we decided to switch to rx burst alloc function, however 
that caused additional performance degradation compared to scatter rx mode.
In profiler two major culprits were:
  1. Access to packet data Eth header in application code. (cache miss)
  2. Setting next packet descriptor field to NULL in DPDK i40e_rx_alloc_bufs 
code. (this field is in second descriptor cache line that was not prefetched)
After applying my fixes performance improved compared to scatter rx mode.

I assumed that prefetch of first cache line of packet data belongs to DPDK 
because it is done in scatter rx mode. (in i40e_recv_scattered_pkts)
It can be moved to application side but IMO it is better to be consistent 
across all rx modes.

Regards,
Vladyslav

Reply via email to