On 2023-07-31 14:25, Morten Brørup wrote:
From: Thomas Monjalon [mailto:tho...@monjalon.net]
Sent: Monday, 31 July 2023 14.14

Hello,

What's the status of this feature?

I haven't given up on upstreaming this feature, but there doesn't seem to be 
much demand for it, so working on it has low priority.


This would definitely be a useful addition to the EAL, IMO.

It's also a case where it's difficult to provide a generic and portable solution with both good performance and reasonable semantics. The upside is you seem to come pretty far already.



10/10/2022 08:46, Morten Brørup:
This patch provides a function for memory copy using non-temporal store,
load or both, controlled by flags passed to the function.

Applications sometimes copy data to another memory location, which is only
used much later.
In this case, it is inefficient to pollute the data cache with the copied
data.

An example use case (originating from a real life application):
Copying filtered packets, or the first part of them, into a capture buffer
for offline analysis.

The purpose of the function is to achieve a performance gain by not
polluting the cache when copying data.
Although the throughput can be improved by further optimization, I do not
have time to do it now.

The functional tests and performance tests for memory copy have been
expanded to include non-temporal copying.

A non-temporal version of the mbuf library's function to create a full
copy of a given packet mbuf is provided.

The packet capture and packet dump libraries have been updated to use
non-temporal memory copy of the packets.

Implementation notes:

Implementations for non-x86 architectures can be provided by anyone at a
later time. I am not going to do it.

x86 non-temporal load instructions must be 16 byte aligned [1], and
non-temporal store instructions must be 4, 8 or 16 byte aligned [2].

ARM non-temporal load and store instructions seem to require 4 byte
alignment [3].

[1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
index.html#text=_mm_stream_load
[2] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/
index.html#text=_mm_stream_si
[3] https://developer.arm.com/documentation/100076/0100/
A64-Instruction-Set-Reference/A64-Floating-point-Instructions/
LDNP--SIMD-and-FP-

This patch is a major rewrite from the RFC v3, so no version log comparing
to the RFC is provided.

v4
* Also ignore the warning for clang int the workaround for
   _mm_stream_load_si128() missing const in the parameter.
* Add missing C linkage specifier in rte_memcpy.h.

v3
* _mm_stream_si64() is not supported on 32-bit x86 architecture, so only
   use it on 64-bit x86 architecture.
* CLANG warns that _mm_stream_load_si128_const() and
   rte_memcpy_nt_15_or_less_s16a() are not public,
   so remove __rte_internal from them. It also affects the documentation
   for the functions, so the fix can't be limited to CLANG.
* Use __rte_experimental instead of __rte_internal.
* Replace <n> with nnn in function documentation; it doesn't look like
   HTML.
* Slightly modify the workaround for _mm_stream_load_si128() missing const
   in the parameter; the ancient GCC 4.5.8 in RHEL7 doesn't understand
   #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers", so use
   #pragma GCC diagnostic ignored "-Wcast-qual" instead. I hope that works.
* Fixed one coding style issue missed in v2.

v2
* The last 16 byte block of data, incl. any trailing bytes, were not
   copied from the source memory area in rte_memcpy_nt_buf().
* Fix many coding style issues.
* Add some missing header files.
* Fix build time warning for non-x86 architectures by using a different
   method to mark the flags parameter unused.
* CLANG doesn't understand RTE_BUILD_BUG_ON(!__builtin_constant_p(flags)),
   so omit it when using CLANG.

Signed-off-by: Morten Brørup <m...@smartsharesystems.com>
---
  app/test/test_memcpy.c               |   65 +-
  app/test/test_memcpy_perf.c          |  187 ++--
  lib/eal/include/generic/rte_memcpy.h |  127 +++
  lib/eal/x86/include/rte_memcpy.h     | 1238 ++++++++++++++++++++++++++
  lib/mbuf/rte_mbuf.c                  |   77 ++
  lib/mbuf/rte_mbuf.h                  |   32 +
  lib/mbuf/version.map                 |    1 +
  lib/pcapng/rte_pcapng.c              |    3 +-
  lib/pdump/rte_pdump.c                |    6 +-
  9 files changed, 1645 insertions(+), 91 deletions(-)




Reply via email to