Jiaqi-YP7 opened a new pull request, #18855:
URL: https://github.com/apache/nuttx/pull/18855

   Add dedicated NEON implementations for mutually aligned medium and long 
memcpy copies when building with __ARM_NEON__. These paths use NEON 
multi-register loads and stores while preserving the existing VFP 
implementation for non-NEON VFP configurations.
   
   NEON builds also define USE_VFP, so select the NEON implementation 
explicitly before falling back to VFP. Apply the same aligned-copy optimization 
to the armv7-a, armv7-r, and armv8-r implementations.
   
   *Note: Please adhere to [Contributing 
Guidelines](https://github.com/apache/nuttx/blob/master/CONTRIBUTING.md).*
   
   ## Summary
   
   This change adds dedicated NEON implementations for mutually aligned medium 
and long `memcpy` copies when building with `__ARM_NEON__`.
   
   The new NEON paths use 64-byte multi-register loads and stores with the 
existing destination alignment hint. The existing VFP implementation is 
preserved and remains the fallback for VFP-enabled builds without NEON.
   
   NEON builds also define `USE_VFP`, so the implementation now selects the 
NEON path explicitly before falling back to VFP. This is framed as an 
optimization for NEON-capable targets while keeping the existing VFP path valid 
for VFP-only builds.
   
   The same update is applied to the `armv7-a`, `armv7-r`, and `armv8-r` 
implementations, which share the same `USE_NEON`/`USE_VFP` structure.
   
   ## Impact
   
   This affects only ARM builds that enable `__ARM_NEON__` and use the 
architecture-specific `memcpy` implementation.
   
   The functional behavior of `memcpy` is unchanged. The intended impact is 
improved throughput for mutually aligned medium and long copies on NEON-capable 
ARM targets while keeping non-NEON VFP builds on the existing VFP path.
   
   M-profile implementations are not changed. `armv7-m` does not use NEON, and 
`armv8-m` has a separate MVE path.
   
   ## Testing
   
   **Platform:** NuttX / ARMv8-R, Cortex-R52, 1250 MHz
   **Compiler:** arm-none-eabi-gcc 12.2, `-O2`, NEON enabled
   
   Test design:
   
   We built and ran the same `memcpy_bench` application before and after this 
change, using the same board, configuration, toolchain, and runtime 
environment. The only intended difference between the two test images is the 
`arch_memcpy.S` change in this PR.
   
   The benchmark uses two DDR buffers, each 3 MB plus offset headroom and 
aligned to a 64-byte boundary. Each case runs 8 warm-up iterations followed by 
32 measured iterations, using the platform performance counter to report best 
time, average time, throughput in MB/s, and cycles per byte.
   
   The test matrix is split by the execution paths in `arch_memcpy.S`:
   
   - Small copies below 64 bytes, where the new medium/long NEON loops are not 
expected to run.
   - Medium mutually aligned copies from 64 to 511 bytes, covering 
`.Lcpy_body_medium`.
   - Long mutually aligned copies from 512 bytes to 3 MB, covering 
`.Lcpy_body_long`.
   - Misaligned copies where `src & 7 != dst & 7`, covering `.Lcpy_notaligned` 
as a control path that is not changed by this PR.
   
   For the mutually aligned groups, I tested both fully aligned buffers 
(`src_off = 0`, `dst_off = 0`) and shifted-but-mutually-aligned buffers 
(`src_off = 5`, `dst_off = 5`). For the misaligned control group, I tested 
offset pairs such as `src_off = 0`, `dst_off = 1` and `src_off = 3`, `dst_off = 
7`.
   
   In other words, we designed four test groups.
   
   ```
   Group A  Small (<64 B)
   Group B  Medium aligned 64–511 B .Lcpy_body_medium  ← add NEON path
   Group C  Long aligned ≥ 512 B    .Lcpy_body_long    ← add NEON path
   Group D  Misaligned (src&7≠dst&7).Lcpy_notaligned
   ```
   
   Expected result:
   
   - Medium and long mutually aligned cases should show improved throughput or 
lower cycles per byte after this change.
   - Small-copy cases should not show meaningful change from this PR.
   - Misaligned control cases should remain broadly unchanged, since they 
continue to use the existing `.Lcpy_notaligned` path.
   
   Result:
   
   ### Group A — Small (<64 B): no change (expected)
   
   ### Group B — Medium aligned (64–511 B): **+70–73% throughput**
   
   | Size | Before (MB/s) | After (MB/s) | Gain | Before (cyc/B) | After 
(cyc/B) |
   |---|---|---|---|---|---|
   | 64 B | 42.7 | 73.5 | **+72 %** | 27.9 | 16.2 |
   | 128 B | 41.9 | 71.4 | **+70 %** | 28.5 | 16.7 |
   | 192 B | 43.2 | 74.6 | **+73 %** | 27.6 | 16.0 |
   | 256 B | 42.7 | 73.4 | **+72 %** | 27.9 | 16.2 |
   | 320 B | 43.2 | 74.8 | **+73 %** | 27.6 | 15.9 |
   | 384 B | 42.9 | 74.0 | **+73 %** | 27.8 | 16.1 |
   | 448 B | 43.2 | 74.9 | **+73 %** | 27.6 | 15.9 |
   
   ### Group C — Long aligned (≥ 512 B): **+73–110% throughput**
   
   | Size | Before (MB/s) | After (MB/s) | Gain | Before (cyc/B) | After 
(cyc/B) |
   |---|---|---|---|---|---|
   | 512 B | 43.0 | 74.3 | **+73 %** | 27.7 | 16.0 |
   | 1 KB | 39.5 | 74.6 | **+89 %** | 30.2 | 16.0 |
   | 4 KB | 38.0 | 73.6 | **+94 %** | 31.4 | 16.2 |
   | 8 KB | 35.1 | 72.3 | **+106 %** | 33.9 | 16.5 |
   | 16 KB | 31.0 | 64.8 | **+109 %** | 38.4 | 18.4 |
   | 64 KB | 31.0 | 64.7 | **+109 %** | 38.4 | 18.4 |
   | 256 KB | 30.9 | 64.7 | **+110 %** | 38.6 | 18.4 |
   | 1 MB | 30.8 | 64.6 | **+110 %** | 38.7 | 18.5 |
   | 3 MB | 30.9 | 64.6 | **+109 %** | 38.6 | 18.5 |
   
   ### Group D — Misaligned (control group): **no change (expected)**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to