-----Original Message-----
> Date: Mon, 18 Dec 2017 02:51:19 +0000
> From: Herbert Guan <herbert.g...@arm.com>
> To: Jerin Jacob <jerin.ja...@caviumnetworks.com>
> CC: Jianbo Liu <jianbo....@arm.com>, "dev@dpdk.org" <dev@dpdk.org>
> Subject: RE: [PATCH] arch/arm: optimization for memcpy on AArch64
> 
> Hi Jerin,

Hi Herbert,

> > >
> > > Here the value of '64' is not the cache line size.  But for the reason 
> > > that
> > prefetch itself will cost some cycles, it's not worthwhile to do prefetch 
> > for
> > small size (e.g. < 64 bytes) copy.  Per my test, prefetching for small size 
> > copy
> > will actually lower the performance.
> >
> > But
> > I think, '64' is a function of cache size. ie. Any reason why we haven't 
> > used
> > rte_memcpy_ge16_lt128()/rte_memcpy_ge128 pair instead of
> > rte_memcpy_ge16_lt64//rte_memcpy_ge64 pair?
> > I think, if you can add one more conditional compilation to choose between
> > rte_memcpy_ge16_lt128()/rte_memcpy_ge128 vs
> > rte_memcpy_ge16_lt64//rte_memcpy_ge64,
> > will address the all arm64 variants supported in current DPDK.
> >
> 
> The logic for 128B cache is implemented as you've suggested, and has been 
> added in V3 patch.
> 
> > >
> > > In the other hand, I can only find one 128B cache line aarch64 machine 
> > > here.
> > And it do exist some specific optimization for this machine.  Not sure if 
> > it'll be
> > beneficial for other 128B cache machines or not.  I prefer not to put it in 
> > this
> > patch but in a later standalone specific patch for 128B cache machines.
> > >
> > > > > +__builtin_prefetch(src, 0, 0);  // rte_prefetch_non_temporal(src);
> > > > > +__builtin_prefetch(dst, 1, 0);  //  * unchanged *
> >
> > # Why only once __builtin_prefetch used? Why not invoke in
> > rte_memcpy_ge64 loop
> > # Does it make sense to prefetch src + 64/128 * n
> 
> Prefetch is only necessary once at the beginning.  CPU will do auto 
> incremental prefetch when the continuous memory access starts.  It's not 
> necessary to do prefetch in the loop.  In fact doing it in loop will actually 
> break CPU's HW prefetch and degrade the performance.

Yes. But, aarch64 specification does not mandate that all implementation should 
have HW prefetch
mechanism(ie. it is IMPLEMENTATION DEFINED).
I think, You have provided a good start for memcpy implementation and we
can fine tune it _latter_ based different micro architecture.
Your v3 looks good.


> IMPORTANT NOTICE: The contents of this email and any attachments are 
> confidential and may also be privileged. If you are not the intended 
> recipient, please notify the sender immediately and do not disclose the 
> contents to any other person, use it for any purpose, or store or copy the 
> information in any medium. Thank you.

Please remove such notice from public mailing list.

Reply via email to