Re: [PATCH v11] eal/x86: optimize memcpy of small sizes

Stephen Hemminger Thu, 21 May 2026 15:42:31 -0700

On Thu, 21 May 2026 18:56:31 +0000
Morten Brørup <[email protected]> wrote:


> The implementation for copying up to 64 bytes does not depend on address
> alignment with the size of the CPU's vector registers. Nonetheless, the
> exact same code for copying up to 64 bytes was present in both the aligned
> copy function and all the CPU vector register size specific variants of
> the unaligned copy functions.
> With this patch, the implementation for copying up to 64 bytes was
> consolidated into one instance, located in the common copy function,
> before checking alignment requirements.
> This provides three benefits:
> 1. No copy-paste in the source code.
> 2. A performance gain for copying up to 64 bytes, because the
> address alignment check is avoided in this case.
> 3. Reduced instruction memory footprint, because the compiler only
> generates one instance of the function for copying up to 64 bytes, instead
> of two instances (one in the unaligned copy function, and one in the
> aligned copy function).
> 
> Furthermore, __rte_restrict was added to source and destination addresses.
> 
> Also, the missing implementation of rte_mov48() was added.
> 
> Until recently, some drivers required disabling stringop-overflow warnings
> when using rte_memcpy().
> For some strange reason, these warnings were disabled in the rte_memcpy
> header file, instead of in the problematic drivers.
> With series-38174 ("remove use of rte_memcpy from net/intel"), the
> problematic drivers were updated to use memcpy() instead of rte_memcpy(),
> so disabling these warnings is no longer required, and was removed.
> 
> Regarding performance...
> The memcpy performance test (cache-to-cache copy) shows:
> Copying up to 15 bytes takes ca. 4.5 cycles, versus ca. 6.5 cycles before.
> Copying 8 bytes takes 4 cycles, versus 7 cycles before.
> Copying 16 bytes takes 2 cycles, versus 4 cycles before.
> Copying 64 bytes takes 4 cycles, versus 7 cycles before.
> 
> Depends-on: series-38174 ("remove use of rte_memcpy from net/intel")
> 
> Signed-off-by: Morten Brørup <[email protected]>
> Acked-by: Bruce Richardson <[email protected]>
> Acked-by: Konstantin Ananyev <[email protected]>
> ---

Here is the full wordy all providers reviews.
Mostly this is test to make sure script works.

$ ./devtools/ai/compare-patch-reviews.sh 
~/Downloads/v11-eal-x86-optimize-memcpy-of-small-sizes.patch 
Reviewing patch: v11-eal-x86-optimize-memcpy-of-small-sizes.patch
Providers: anthropic,openai,xai,google
Format: text
========================================

>>> Running review with: anthropic

=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via Claude) 
===
Review date: 2026-05-21

# Review of v11-eal-x86-optimize-memcpy-of-small-sizes.patch

## Correctness Issues

### Errors

**Use-after-free pattern in overlapping copy optimization (lines 672-673, 684, 
704, 707)**

The pattern `(uint8_t *)dst - 16 + n` performs pointer arithmetic that, when `n 
< 16`, results in accessing memory before the `dst` pointer. This is used in an 
optimization technique (two overlapping stores) that is correct when `n >= 16`, 
but the code paths in question guarantee `n` is in specific ranges where this 
is safe. However, the readability is poor and the intent is not obvious from 
the code.

**Confidence: 80%** - This pattern is widely used in memcpy implementations and 
appears intentional, but deserves a comment explaining the technique to avoid 
confusion in future reviews.

---

## Warnings

**Missing documentation for new function rte_mov48() (line 130)**

The patch adds `rte_mov48()` but does not mention it in the commit message's 
"Also, the missing implementation of rte_mov48() was added." The implementation 
itself is correct, but there's no context about why it was missing or whether 
it's used elsewhere in DPDK.

**Suggestion:** Verify that `rte_mov48()` is actually used somewhere, or 
document why it's being added speculatively.

---

**Inconsistent handling of compile-time constant size (lines 682-683)**

The code checks `__rte_constant(n) && n == 16` to avoid a "harmless duplicate 
copy" but does not apply this optimization consistently. For example:
- Line 682: avoids duplicate for `n == 16`
- Line 690: avoids duplicate for `n == 32`
- Line 693: avoids duplicate for `n == 64`

But in the 33-64 byte range (lines 695-707), there's no similar optimization. 
This is not a bug, but the inconsistency in optimization strategy could be 
noted.

**Suggestion:** Add a comment explaining the rationale for when this 
optimization matters vs when the "harmless duplicate" is acceptable.

---

**Potential alignment assumption in AVX path (line 699)**

The AVX path at line 699 uses two overlapping `rte_mov32()` calls for sizes 
33-64 bytes:
```c
rte_mov32((uint8_t *)dst, (const uint8_t *)src);
rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
```

For `n = 33`, the second call accesses `dst + 1` and `src + 1`, which may be 
misaligned for the 32-byte `__m256i` operations. However, 
`_mm256_loadu_si256()` and `_mm256_storeu_si256()` are documented as supporting 
unaligned access, so this is safe.

**Confidence: 90%** - Not a bug, but worth noting that the AVX path relies on 
unaligned load/store support.

---

## Style and Process Issues

### Warnings

**Unconventional comment style in rte_mov15_or_less() (lines 67-72)**

The multi-line comment explaining the rationale for the `rte_uint*_alias` 
structures does not follow the standard DPDK comment block format with the `*` 
on each line. However, this may be intentional for a technical explanation.

---

**Return value inconsistency (lines 682, 690-691, 693, 707-708)**

The function returns `dst` in some branches but the original `ret` pointer in 
others:
- Lines 682, 690, 707: return `dst`
- Line 714: return from `rte_memcpy_aligned_more_than_64(dst, src, n)` which 
returns `ret = dst`
- Line 716: return from `rte_memcpy_generic_more_than_64(dst, src, n)` which 
returns `ret = dst`

Both `ret` and `dst` point to the same location, so this is harmless, but using 
`dst` consistently would be clearer. The patch description states "Pointer to 
the destination data" is returned, and both satisfy this.

**Confidence: 95%** - Not an error, but inconsistent style within the same 
function.

---

## Documentation Issues

### Warnings

**Missing documentation update for removed warning suppression (lines removed 
at top)**

The patch removes:
```c
#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wstringop-overflow"
#endif
```

The commit message explains this is because the problematic drivers were 
updated, with a `Depends-on: series-38174`. However, there's no comment in the 
code explaining why the suppression was removed or a note in 
`doc/guides/rel_notes/` documenting this change.

**Suggestion:** Add a release note entry documenting the removal of the warning 
suppression and the dependency on the driver updates.

---

## Summary

**Priority findings:**

1. **(Info)** The overlapping copy optimization `dst - 16 + n` is correct but 
deserves a comment explaining the technique.
2. **(Warning)** Missing documentation for why `rte_mov48()` was added.
3. **(Warning)** Missing release notes entry for removal of stringop-overflow 
warning suppression.
4. **(Info)** Return value uses both `dst` and `ret` inconsistently (harmless 
but could be unified).

**Overall assessment:** The patch appears functionally correct. The main 
concerns are documentation completeness (missing context for `rte_mov48()`, 
missing release notes for warning removal) and code clarity (overlapping copy 
pattern could use explanatory comments). No blocking correctness issues were 
identified.
(anthropic: review reported errors)

========================================

>>> Running review with: openai

=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via 
ChatGPT) ===
Review date: 2026-05-21

**DPDK Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch**

---

### Errors

None found.

- The patch does not introduce resource leaks, use-after-free, race conditions, 
or missing synchronization.
- There are no buffer overflows or out-of-bounds accesses in the memcpy 
routines; all copy size calculations and code paths for small copies (<64 
bytes) are explicit and conservative.
- No uninitialized or stale variable use detected.
- All error paths are clean (these are computational functions; return values 
are always valid).

---

### Warnings

1. **Potential inconsistent comment phrasing regarding overlap requirements.**

   - Some Doxygen comments use "locations should not overlap" while others now 
correctly say "locations must not overlap".
   - While this isn't a correctness bug, consistent strong wording clarifies 
requirement for users.

   **Suggested fix:** Standardize all relevant Doxygen comments to use 
"locations must not overlap."

   ```c
   /**
    * Copy bytes from one location to another,
    * locations must not overlap.
    */
   ```

2. **Redundant Doxygen param alignment in function headers.**

   - Some parameter descriptions ("Pointer to the destination of the 
data"/"Pointer to the destination data") remain a bit unclear or inconsistent. 
If updating comments, consider making param descriptions exact and explicit.

   **Suggested fix:** For each `@param` clarify if it's the destination or 
source pointer. For `rte_memcpy`, `"Pointer to the destination."` and `"Pointer 
to the source."` suffice.

---

### Info

- Use of `__rte_restrict` is correct and improves optimization; matches DPDK 
style for pointer restrict qualifiers.
- Removal of `#pragma GCC diagnostic ignored "-Wstringop-overflow"` is correct 
because the problematic drivers now use standard `memcpy()`.
- All new `rte_mov*` functions properly use `static __rte_always_inline` and 
restrict-qualified pointers.
- Macro duplication is avoided by consolidation, in line with maintainability 
guidelines.
- No forbidden tokens, banned constructs, or use of deprecated APIs/macros were 
added.
- No symbol namespace pollution or missing `rte_`/`RTE_` prefix issues present.
- Coding style is maintained in indentation, tabs use, and function/variable 
names.
- No API, public header, or C11 atomic/volatile/race issues are present.
- No documentation or API tag requirements are triggered (no public API change).
- Patch is self-contained and does not require changes to release notes or test 
infrastructure.
- No RST or external documentation is involved in this patch.

---

### Summary

**No correctness/blocking issues found.**
Minor documentation consistency updates suggested for comment clarity, but 
these do not affect correctness.

**This patch is in good shape and complies with DPDK coding and architectural 
guidelines.**
(openai: review reported errors)

========================================

>>> Running review with: xai

=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via Grok) 
===
Review date: 2026-05-21

**Review of patch: v11-eal-x86-optimize-memcpy-of-small-sizes.patch**

### Errors

**1. Missing `__rte_restrict` in rte_mov48() implementation (consistency / 
correctness risk)**  
The new `rte_mov48()` function is declared with `__rte_restrict`, but the 
implementation casts away the restrict qualifier when calling `rte_mov32()` and 
`rte_mov16()`:

```c
rte_mov32((uint8_t *)dst, (const uint8_t *)src);
rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
```

**Suggested fix**: Keep the restrict-qualified parameters in the implementation 
or cast only after documenting the aliasing contract. This is a minor style 
inconsistency but worth fixing for uniformity.

**2. Potential sign-extension issue on 64-bit offset calculations (left-shift 
of narrow type)**  
In `rte_mov15_or_less()`, the expression `(n & 8)` is `size_t` (ok), but the 
subsequent pointer arithmetic:

```c
*(uint64_t *)((uint8_t *)dst + n - 8) = *(const uint64_t *)((const uint8_t 
*)src + n - 8);
```

relies on implicit promotion. While currently safe, the pattern of adding a 
`uint8_t`/`size_t` value to a pointer after a narrow mask (`n & 7`) could 
trigger Coverity SIGN_EXTENSION warnings on some toolchains. Consider an 
explicit cast:

```c
uint8_t *d = (uint8_t *)dst + (size_t)(n - 8);
```

**3. Use of `__rte_constant` inside hot path without documented behavior**  
The code repeatedly uses `__rte_constant(n)` to optimize compile-time known 
sizes. This is a DPDK internal macro. While acceptable for performance-critical 
code, it should be documented (or at least noted) that the optimization only 
triggers for constant `n`. No correctness bug, but worth a comment.

### Warnings

**1. Duplicate function definitions for AVX/SSE variants**  
The patch introduces three nearly identical `rte_memcpy_generic_more_than_64()` 
and `rte_memcpy_aligned_more_than_64()` functions (one per vector size). This 
increases maintenance burden and instruction cache pressure. Consider factoring 
the common >64-byte logic into a single helper.

**2. Hard-coded cast chains in rte_mov48()**  
The SSE implementation of `rte_mov48()` uses explicit `(uint8_t *)` casts on 
restrict-qualified pointers. While functionally correct, it reduces 
readability. A cleaner pattern would be:

```c
rte_mov16(dst + 0, src + 0);
rte_mov16(dst + 16, src + 16);
rte_mov16(dst + 32, src + 32);
```

**3. Inconsistent use of `return dst` vs `return ret`**  
In the new `rte_memcpy()` implementation the fast path returns `dst`, while the 
old `rte_memcpy_generic()` returned the saved `ret`. Although `dst` and `ret` 
are identical, using `ret` consistently would avoid any future divergence.

**4. Missing Doxygen for new rte_mov48()**  
The new function has a Doxygen comment, but the description is minimal. 
Consider adding a brief note that it is an internal helper (or mark it 
`__rte_internal` if it should not be public).

### Info

- The removal of the `#pragma GCC diagnostic` block for `-Wstringop-overflow` 
is correct and aligns with the series that removed `rte_memcpy` usage from 
drivers.
- Adding `__rte_restrict` to `rte_memcpy()` and helper functions is a good 
improvement and matches modern DPDK style.
- Performance claims in the commit message are outside the scope of this review 
(they are not code).

### Summary

No critical correctness bugs (use-after-free, leaks, races, or missing error 
checks) were identified. The main issues are minor style/consistency items 
around restrict qualifiers, pointer arithmetic, and code duplication. The patch 
is otherwise clean and follows DPDK coding guidelines for the x86 EAL memcpy 
implementation.
(xai: review reported errors)

========================================

>>> Running review with: google

=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via Gemini) 
===
Review date: 2026-05-21

The following is a review of the patch against DPDK coding standards and 
guidelines.

### Warnings

**Implicit integer comparison in rte_memcpy.**
DPDK coding style requires explicit comparison against 0 for integer types, 
rather than using truthiness/logical negation.
```c
/* Current implementation */
if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))

/* Suggested fix */
if ((((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK) == 0)
```

### Info

**Use of bitwise logic in rte_mov15_or_less.**
While this patch only adds `__rte_restrict` to the signature, the existing
(google: review reported warnings)

========================================

Review comparison complete.
Summary across 4 provider(s): clean=0 warnings=1 errors=3 failed=0

Re: [PATCH v11] eal/x86: optimize memcpy of small sizes

Reply via email to