On Thu, 21 May 2026 18:56:31 +0000
Morten Brørup <[email protected]> wrote:
> The implementation for copying up to 64 bytes does not depend on address
> alignment with the size of the CPU's vector registers. Nonetheless, the
> exact same code for copying up to 64 bytes was present in both the aligned
> copy function and all the CPU vector register size specific variants of
> the unaligned copy functions.
> With this patch, the implementation for copying up to 64 bytes was
> consolidated into one instance, located in the common copy function,
> before checking alignment requirements.
> This provides three benefits:
> 1. No copy-paste in the source code.
> 2. A performance gain for copying up to 64 bytes, because the
> address alignment check is avoided in this case.
> 3. Reduced instruction memory footprint, because the compiler only
> generates one instance of the function for copying up to 64 bytes, instead
> of two instances (one in the unaligned copy function, and one in the
> aligned copy function).
>
> Furthermore, __rte_restrict was added to source and destination addresses.
>
> Also, the missing implementation of rte_mov48() was added.
>
> Until recently, some drivers required disabling stringop-overflow warnings
> when using rte_memcpy().
> For some strange reason, these warnings were disabled in the rte_memcpy
> header file, instead of in the problematic drivers.
> With series-38174 ("remove use of rte_memcpy from net/intel"), the
> problematic drivers were updated to use memcpy() instead of rte_memcpy(),
> so disabling these warnings is no longer required, and was removed.
>
> Regarding performance...
> The memcpy performance test (cache-to-cache copy) shows:
> Copying up to 15 bytes takes ca. 4.5 cycles, versus ca. 6.5 cycles before.
> Copying 8 bytes takes 4 cycles, versus 7 cycles before.
> Copying 16 bytes takes 2 cycles, versus 4 cycles before.
> Copying 64 bytes takes 4 cycles, versus 7 cycles before.
>
> Depends-on: series-38174 ("remove use of rte_memcpy from net/intel")
>
> Signed-off-by: Morten Brørup <[email protected]>
> Acked-by: Bruce Richardson <[email protected]>
> Acked-by: Konstantin Ananyev <[email protected]>
> ---
Here is the full wordy all providers reviews.
Mostly this is test to make sure script works.
$ ./devtools/ai/compare-patch-reviews.sh
~/Downloads/v11-eal-x86-optimize-memcpy-of-small-sizes.patch
Reviewing patch: v11-eal-x86-optimize-memcpy-of-small-sizes.patch
Providers: anthropic,openai,xai,google
Format: text
========================================
>>> Running review with: anthropic
=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via Claude)
===
Review date: 2026-05-21
# Review of v11-eal-x86-optimize-memcpy-of-small-sizes.patch
## Correctness Issues
### Errors
**Use-after-free pattern in overlapping copy optimization (lines 672-673, 684,
704, 707)**
The pattern `(uint8_t *)dst - 16 + n` performs pointer arithmetic that, when `n
< 16`, results in accessing memory before the `dst` pointer. This is used in an
optimization technique (two overlapping stores) that is correct when `n >= 16`,
but the code paths in question guarantee `n` is in specific ranges where this
is safe. However, the readability is poor and the intent is not obvious from
the code.
**Confidence: 80%** - This pattern is widely used in memcpy implementations and
appears intentional, but deserves a comment explaining the technique to avoid
confusion in future reviews.
---
## Warnings
**Missing documentation for new function rte_mov48() (line 130)**
The patch adds `rte_mov48()` but does not mention it in the commit message's
"Also, the missing implementation of rte_mov48() was added." The implementation
itself is correct, but there's no context about why it was missing or whether
it's used elsewhere in DPDK.
**Suggestion:** Verify that `rte_mov48()` is actually used somewhere, or
document why it's being added speculatively.
---
**Inconsistent handling of compile-time constant size (lines 682-683)**
The code checks `__rte_constant(n) && n == 16` to avoid a "harmless duplicate
copy" but does not apply this optimization consistently. For example:
- Line 682: avoids duplicate for `n == 16`
- Line 690: avoids duplicate for `n == 32`
- Line 693: avoids duplicate for `n == 64`
But in the 33-64 byte range (lines 695-707), there's no similar optimization.
This is not a bug, but the inconsistency in optimization strategy could be
noted.
**Suggestion:** Add a comment explaining the rationale for when this
optimization matters vs when the "harmless duplicate" is acceptable.
---
**Potential alignment assumption in AVX path (line 699)**
The AVX path at line 699 uses two overlapping `rte_mov32()` calls for sizes
33-64 bytes:
```c
rte_mov32((uint8_t *)dst, (const uint8_t *)src);
rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
```
For `n = 33`, the second call accesses `dst + 1` and `src + 1`, which may be
misaligned for the 32-byte `__m256i` operations. However,
`_mm256_loadu_si256()` and `_mm256_storeu_si256()` are documented as supporting
unaligned access, so this is safe.
**Confidence: 90%** - Not a bug, but worth noting that the AVX path relies on
unaligned load/store support.
---
## Style and Process Issues
### Warnings
**Unconventional comment style in rte_mov15_or_less() (lines 67-72)**
The multi-line comment explaining the rationale for the `rte_uint*_alias`
structures does not follow the standard DPDK comment block format with the `*`
on each line. However, this may be intentional for a technical explanation.
---
**Return value inconsistency (lines 682, 690-691, 693, 707-708)**
The function returns `dst` in some branches but the original `ret` pointer in
others:
- Lines 682, 690, 707: return `dst`
- Line 714: return from `rte_memcpy_aligned_more_than_64(dst, src, n)` which
returns `ret = dst`
- Line 716: return from `rte_memcpy_generic_more_than_64(dst, src, n)` which
returns `ret = dst`
Both `ret` and `dst` point to the same location, so this is harmless, but using
`dst` consistently would be clearer. The patch description states "Pointer to
the destination data" is returned, and both satisfy this.
**Confidence: 95%** - Not an error, but inconsistent style within the same
function.
---
## Documentation Issues
### Warnings
**Missing documentation update for removed warning suppression (lines removed
at top)**
The patch removes:
```c
#if defined(RTE_TOOLCHAIN_GCC) && (GCC_VERSION >= 100000)
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wstringop-overflow"
#endif
```
The commit message explains this is because the problematic drivers were
updated, with a `Depends-on: series-38174`. However, there's no comment in the
code explaining why the suppression was removed or a note in
`doc/guides/rel_notes/` documenting this change.
**Suggestion:** Add a release note entry documenting the removal of the warning
suppression and the dependency on the driver updates.
---
## Summary
**Priority findings:**
1. **(Info)** The overlapping copy optimization `dst - 16 + n` is correct but
deserves a comment explaining the technique.
2. **(Warning)** Missing documentation for why `rte_mov48()` was added.
3. **(Warning)** Missing release notes entry for removal of stringop-overflow
warning suppression.
4. **(Info)** Return value uses both `dst` and `ret` inconsistently (harmless
but could be unified).
**Overall assessment:** The patch appears functionally correct. The main
concerns are documentation completeness (missing context for `rte_mov48()`,
missing release notes for warning removal) and code clarity (overlapping copy
pattern could use explanatory comments). No blocking correctness issues were
identified.
(anthropic: review reported errors)
========================================
>>> Running review with: openai
=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via
ChatGPT) ===
Review date: 2026-05-21
**DPDK Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch**
---
### Errors
None found.
- The patch does not introduce resource leaks, use-after-free, race conditions,
or missing synchronization.
- There are no buffer overflows or out-of-bounds accesses in the memcpy
routines; all copy size calculations and code paths for small copies (<64
bytes) are explicit and conservative.
- No uninitialized or stale variable use detected.
- All error paths are clean (these are computational functions; return values
are always valid).
---
### Warnings
1. **Potential inconsistent comment phrasing regarding overlap requirements.**
- Some Doxygen comments use "locations should not overlap" while others now
correctly say "locations must not overlap".
- While this isn't a correctness bug, consistent strong wording clarifies
requirement for users.
**Suggested fix:** Standardize all relevant Doxygen comments to use
"locations must not overlap."
```c
/**
* Copy bytes from one location to another,
* locations must not overlap.
*/
```
2. **Redundant Doxygen param alignment in function headers.**
- Some parameter descriptions ("Pointer to the destination of the
data"/"Pointer to the destination data") remain a bit unclear or inconsistent.
If updating comments, consider making param descriptions exact and explicit.
**Suggested fix:** For each `@param` clarify if it's the destination or
source pointer. For `rte_memcpy`, `"Pointer to the destination."` and `"Pointer
to the source."` suffice.
---
### Info
- Use of `__rte_restrict` is correct and improves optimization; matches DPDK
style for pointer restrict qualifiers.
- Removal of `#pragma GCC diagnostic ignored "-Wstringop-overflow"` is correct
because the problematic drivers now use standard `memcpy()`.
- All new `rte_mov*` functions properly use `static __rte_always_inline` and
restrict-qualified pointers.
- Macro duplication is avoided by consolidation, in line with maintainability
guidelines.
- No forbidden tokens, banned constructs, or use of deprecated APIs/macros were
added.
- No symbol namespace pollution or missing `rte_`/`RTE_` prefix issues present.
- Coding style is maintained in indentation, tabs use, and function/variable
names.
- No API, public header, or C11 atomic/volatile/race issues are present.
- No documentation or API tag requirements are triggered (no public API change).
- Patch is self-contained and does not require changes to release notes or test
infrastructure.
- No RST or external documentation is involved in this patch.
---
### Summary
**No correctness/blocking issues found.**
Minor documentation consistency updates suggested for comment clarity, but
these do not affect correctness.
**This patch is in good shape and complies with DPDK coding and architectural
guidelines.**
(openai: review reported errors)
========================================
>>> Running review with: xai
=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via Grok)
===
Review date: 2026-05-21
**Review of patch: v11-eal-x86-optimize-memcpy-of-small-sizes.patch**
### Errors
**1. Missing `__rte_restrict` in rte_mov48() implementation (consistency /
correctness risk)**
The new `rte_mov48()` function is declared with `__rte_restrict`, but the
implementation casts away the restrict qualifier when calling `rte_mov32()` and
`rte_mov16()`:
```c
rte_mov32((uint8_t *)dst, (const uint8_t *)src);
rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
```
**Suggested fix**: Keep the restrict-qualified parameters in the implementation
or cast only after documenting the aliasing contract. This is a minor style
inconsistency but worth fixing for uniformity.
**2. Potential sign-extension issue on 64-bit offset calculations (left-shift
of narrow type)**
In `rte_mov15_or_less()`, the expression `(n & 8)` is `size_t` (ok), but the
subsequent pointer arithmetic:
```c
*(uint64_t *)((uint8_t *)dst + n - 8) = *(const uint64_t *)((const uint8_t
*)src + n - 8);
```
relies on implicit promotion. While currently safe, the pattern of adding a
`uint8_t`/`size_t` value to a pointer after a narrow mask (`n & 7`) could
trigger Coverity SIGN_EXTENSION warnings on some toolchains. Consider an
explicit cast:
```c
uint8_t *d = (uint8_t *)dst + (size_t)(n - 8);
```
**3. Use of `__rte_constant` inside hot path without documented behavior**
The code repeatedly uses `__rte_constant(n)` to optimize compile-time known
sizes. This is a DPDK internal macro. While acceptable for performance-critical
code, it should be documented (or at least noted) that the optimization only
triggers for constant `n`. No correctness bug, but worth a comment.
### Warnings
**1. Duplicate function definitions for AVX/SSE variants**
The patch introduces three nearly identical `rte_memcpy_generic_more_than_64()`
and `rte_memcpy_aligned_more_than_64()` functions (one per vector size). This
increases maintenance burden and instruction cache pressure. Consider factoring
the common >64-byte logic into a single helper.
**2. Hard-coded cast chains in rte_mov48()**
The SSE implementation of `rte_mov48()` uses explicit `(uint8_t *)` casts on
restrict-qualified pointers. While functionally correct, it reduces
readability. A cleaner pattern would be:
```c
rte_mov16(dst + 0, src + 0);
rte_mov16(dst + 16, src + 16);
rte_mov16(dst + 32, src + 32);
```
**3. Inconsistent use of `return dst` vs `return ret`**
In the new `rte_memcpy()` implementation the fast path returns `dst`, while the
old `rte_memcpy_generic()` returned the saved `ret`. Although `dst` and `ret`
are identical, using `ret` consistently would avoid any future divergence.
**4. Missing Doxygen for new rte_mov48()**
The new function has a Doxygen comment, but the description is minimal.
Consider adding a brief note that it is an internal helper (or mark it
`__rte_internal` if it should not be public).
### Info
- The removal of the `#pragma GCC diagnostic` block for `-Wstringop-overflow`
is correct and aligns with the series that removed `rte_memcpy` usage from
drivers.
- Adding `__rte_restrict` to `rte_memcpy()` and helper functions is a good
improvement and matches modern DPDK style.
- Performance claims in the commit message are outside the scope of this review
(they are not code).
### Summary
No critical correctness bugs (use-after-free, leaks, races, or missing error
checks) were identified. The main issues are minor style/consistency items
around restrict qualifiers, pointer arithmetic, and code duplication. The patch
is otherwise clean and follows DPDK coding guidelines for the x86 EAL memcpy
implementation.
(xai: review reported errors)
========================================
>>> Running review with: google
=== Patch Review: v11-eal-x86-optimize-memcpy-of-small-sizes.patch (via Gemini)
===
Review date: 2026-05-21
The following is a review of the patch against DPDK coding standards and
guidelines.
### Warnings
**Implicit integer comparison in rte_memcpy.**
DPDK coding style requires explicit comparison against 0 for integer types,
rather than using truthiness/logical negation.
```c
/* Current implementation */
if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
/* Suggested fix */
if ((((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK) == 0)
```
### Info
**Use of bitwise logic in rte_mov15_or_less.**
While this patch only adds `__rte_restrict` to the signature, the existing
(google: review reported warnings)
========================================
Review comparison complete.
Summary across 4 provider(s): clean=0 warnings=1 errors=3 failed=0