-----Original Message----- > Date: Mon, 18 Dec 2017 02:51:19 +0000 > From: Herbert Guan <herbert.g...@arm.com> > To: Jerin Jacob <jerin.ja...@caviumnetworks.com> > CC: Jianbo Liu <jianbo....@arm.com>, "dev@dpdk.org" <dev@dpdk.org> > Subject: RE: [PATCH] arch/arm: optimization for memcpy on AArch64 > > Hi Jerin,
Hi Herbert, > > > > > > Here the value of '64' is not the cache line size. But for the reason > > > that > > prefetch itself will cost some cycles, it's not worthwhile to do prefetch > > for > > small size (e.g. < 64 bytes) copy. Per my test, prefetching for small size > > copy > > will actually lower the performance. > > > > But > > I think, '64' is a function of cache size. ie. Any reason why we haven't > > used > > rte_memcpy_ge16_lt128()/rte_memcpy_ge128 pair instead of > > rte_memcpy_ge16_lt64//rte_memcpy_ge64 pair? > > I think, if you can add one more conditional compilation to choose between > > rte_memcpy_ge16_lt128()/rte_memcpy_ge128 vs > > rte_memcpy_ge16_lt64//rte_memcpy_ge64, > > will address the all arm64 variants supported in current DPDK. > > > > The logic for 128B cache is implemented as you've suggested, and has been > added in V3 patch. > > > > > > > In the other hand, I can only find one 128B cache line aarch64 machine > > > here. > > And it do exist some specific optimization for this machine. Not sure if > > it'll be > > beneficial for other 128B cache machines or not. I prefer not to put it in > > this > > patch but in a later standalone specific patch for 128B cache machines. > > > > > > > > +__builtin_prefetch(src, 0, 0); // rte_prefetch_non_temporal(src); > > > > > +__builtin_prefetch(dst, 1, 0); // * unchanged * > > > > # Why only once __builtin_prefetch used? Why not invoke in > > rte_memcpy_ge64 loop > > # Does it make sense to prefetch src + 64/128 * n > > Prefetch is only necessary once at the beginning. CPU will do auto > incremental prefetch when the continuous memory access starts. It's not > necessary to do prefetch in the loop. In fact doing it in loop will actually > break CPU's HW prefetch and degrade the performance. Yes. But, aarch64 specification does not mandate that all implementation should have HW prefetch mechanism(ie. it is IMPLEMENTATION DEFINED). I think, You have provided a good start for memcpy implementation and we can fine tune it _latter_ based different micro architecture. Your v3 looks good. > IMPORTANT NOTICE: The contents of this email and any attachments are > confidential and may also be privileged. If you are not the intended > recipient, please notify the sender immediately and do not disclose the > contents to any other person, use it for any purpose, or store or copy the > information in any medium. Thank you. Please remove such notice from public mailing list.