Hello, What's the status of this feature?
10/10/2022 08:46, Morten Brørup: > This patch provides a function for memory copy using non-temporal store, > load or both, controlled by flags passed to the function. > > Applications sometimes copy data to another memory location, which is only > used much later. > In this case, it is inefficient to pollute the data cache with the copied > data. > > An example use case (originating from a real life application): > Copying filtered packets, or the first part of them, into a capture buffer > for offline analysis. > > The purpose of the function is to achieve a performance gain by not > polluting the cache when copying data. > Although the throughput can be improved by further optimization, I do not > have time to do it now. > > The functional tests and performance tests for memory copy have been > expanded to include non-temporal copying. > > A non-temporal version of the mbuf library's function to create a full > copy of a given packet mbuf is provided. > > The packet capture and packet dump libraries have been updated to use > non-temporal memory copy of the packets. > > Implementation notes: > > Implementations for non-x86 architectures can be provided by anyone at a > later time. I am not going to do it. > > x86 non-temporal load instructions must be 16 byte aligned [1], and > non-temporal store instructions must be 4, 8 or 16 byte aligned [2]. > > ARM non-temporal load and store instructions seem to require 4 byte > alignment [3]. > > [1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/ > index.html#text=_mm_stream_load > [2] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/ > index.html#text=_mm_stream_si > [3] https://developer.arm.com/documentation/100076/0100/ > A64-Instruction-Set-Reference/A64-Floating-point-Instructions/ > LDNP--SIMD-and-FP- > > This patch is a major rewrite from the RFC v3, so no version log comparing > to the RFC is provided. > > v4 > * Also ignore the warning for clang int the workaround for > _mm_stream_load_si128() missing const in the parameter. > * Add missing C linkage specifier in rte_memcpy.h. > > v3 > * _mm_stream_si64() is not supported on 32-bit x86 architecture, so only > use it on 64-bit x86 architecture. > * CLANG warns that _mm_stream_load_si128_const() and > rte_memcpy_nt_15_or_less_s16a() are not public, > so remove __rte_internal from them. It also affects the documentation > for the functions, so the fix can't be limited to CLANG. > * Use __rte_experimental instead of __rte_internal. > * Replace <n> with nnn in function documentation; it doesn't look like > HTML. > * Slightly modify the workaround for _mm_stream_load_si128() missing const > in the parameter; the ancient GCC 4.5.8 in RHEL7 doesn't understand > #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers", so use > #pragma GCC diagnostic ignored "-Wcast-qual" instead. I hope that works. > * Fixed one coding style issue missed in v2. > > v2 > * The last 16 byte block of data, incl. any trailing bytes, were not > copied from the source memory area in rte_memcpy_nt_buf(). > * Fix many coding style issues. > * Add some missing header files. > * Fix build time warning for non-x86 architectures by using a different > method to mark the flags parameter unused. > * CLANG doesn't understand RTE_BUILD_BUG_ON(!__builtin_constant_p(flags)), > so omit it when using CLANG. > > Signed-off-by: Morten Brørup <m...@smartsharesystems.com> > --- > app/test/test_memcpy.c | 65 +- > app/test/test_memcpy_perf.c | 187 ++-- > lib/eal/include/generic/rte_memcpy.h | 127 +++ > lib/eal/x86/include/rte_memcpy.h | 1238 ++++++++++++++++++++++++++ > lib/mbuf/rte_mbuf.c | 77 ++ > lib/mbuf/rte_mbuf.h | 32 + > lib/mbuf/version.map | 1 + > lib/pcapng/rte_pcapng.c | 3 +- > lib/pdump/rte_pdump.c | 6 +- > 9 files changed, 1645 insertions(+), 91 deletions(-)