21/10/2021 21:03, Song, Keesang: > From: Thomas Monjalon <tho...@monjalon.net> > > 21/10/2021 20:12, Song, Keesang: > > > From: Ananyev, Konstantin <konstantin.anan...@intel.com> > > > > 21/10/2021 19:10, Song, Keesang: > > > > > 19/10/2021 17:35, Stephen Hemminger: > > > > > > From: Thomas Monjalon <tho...@monjalon.net> > > > > > > > 19/10/2021 12:47, Aman Kumar: > > > > > > > > This patch provides rte_memcpy* calls optimized for AMD EPYC > > > > > > > > platforms. Use config/x86/x86_amd_epyc_linux_gcc as cross-file > > > > > > > > with meson to build dpdk for AMD EPYC platforms. > > > > > > > > > > > > > > Please split in 2 patches: platform & memcpy. > > > > > > > > > > > > > > What optimization is specific to EPYC? > > > > > > > > > > > > > > I dislike the asm code below. > > > > > > > What is AMD specific inside? > > > > > > > Can it use compiler intrinsics as it is done elsewhere? > > > > > > > > > > > > And why is this not done by Gcc? > > > > > > > > > > I hope this can make some explanation to your question. > > > > > We(AMD Linux library support team) have implemented the custom > > > > > tailored memcpy solution which is a close match with DPDK use case > > > > > requirements like the below. > > > > > 1) Min 64B length data packet with cache aligned > > > > > Source and Destination. > > > > > 2) Non-Temporal load and temporal store for cache aligned > > > > > source for both RX and TX paths. > > > > > Could not implement the non-temporal store for TX_PATH, > > > > > as non-Temporal load/stores works only with 32B aligned addresses > > > > > for AVX2 > > > > > 3) This solution works for all AVX2 supported AMD machines. > > > > > > > > > > Internally we have completed the integrity testing and benchmarking > > > > > of the solution and found gains of 8.4% to 14.5% specifically on > > > > > Milan CPU(3rd Gen of EPYC Processor) > > > > > > > > It still not clear to me why it has to be written in assembler. > > > > Why similar stuff can't be written in C with instincts, as rest of > > > > rte_memcpy.h does? > > > > > > The current memcpy implementation in Glibc is based out of assembly > > > coding. > > > Although memcpy could have been implemented with intrinsic, > > > but since our AMD library developers are working on the Glibc > > > functions, they have provided a tailored implementation based > > > out of inline assembly coding. > > > > Please convert it to C code, thanks. > > I've already asked our AMD tools team, but they're saying > they are not really familiar with C code implementation. > We need your approval for now since we really need to get > this patch submitted to 21.11 LTS.
Not sure it is urgent given that v2 came after the planned -rc1 date, after 6 weeks of silence. About the approval, there are already 3 technical board members (Konstantin, Stephen and me) objecting against this patch. Not being familiar with C code when working on CPU optimization in 2021 is a strange argument. In general, I don't really understand why we should maintain memcpy functions in DPDK instead of relying on libc optimizations. Having big asm code to maintain and debug is not helping. I think this case shows that AMD needs to become more familiar with DPDK schedule and expectations. I would encourage you to contribute more in the project, so such misunderstanding won't happen in future. Hope that's all understandable PS: discussion is more readable with replies below