On Fri,  9 Jan 2026 20:56:48 -0500
[email protected] wrote:

> From: Scott Mitchell <[email protected]>
> 
> This series optimizes __rte_raw_cksum() by replacing memcpy-based access
> with unaligned_uint16_t pointer access, enabling vectorization in both
> GCC and Clang. The series is split into three patches to clearly separate
> the core optimization from compiler-specific workarounds.
> 
> Performance improvement from cksum_perf_autotest on Intel Xeon
> (Cascade Lake, AVX-512) with Clang 18.1 (TSC cycles/byte):
> 
>   Block size    Before    After    Improvement
>          100      0.40     0.24        ~40%
>         1500      0.50     0.06        ~8x
>         9000      0.49     0.06        ~8x
> 
> Changes in v12:
> - Split into 3-patch series per reviewer feedback
> - Patch 1/3: Core optimization and test additions
> - Patch 2/3: UBSAN alignment workaround (separate from GCC bug)
> - Patch 3/3: GCC optimization bug workaround
> - Reverted len & 1 to len % 2 and restored unlikely() per feedback
> - Renamed RTE_SUPPRESS_UNINITIALIZED_WARNING to RTE_FORCE_INIT_BARRIER
> - Applied minimal changes (no refactoring) to existing code
> - Deferred hinic driver refactoring to future series
> 
> Note: Patch 1/3 will trigger compiler warnings/failures on GCC versions
> with the optimization bug (GCC 11.5.0 and others seen on DPDK CI). These
> are resolved by patches 2/3 and 3/3.
> 
> Scott Mitchell (3):
>   net: optimize __rte_raw_cksum and add tests
>   eal: add workaround for UBSAN alignment false positive
>   eal/net: add workaround for GCC optimization bug
> 
>  app/test/meson.build             |   1 +
>  app/test/test_cksum_fuzz.c       | 240 +++++++++++++++++++++++++++++++
>  app/test/test_cksum_perf.c       |   2 +-
>  drivers/net/hinic/hinic_pmd_tx.c |   2 +
>  drivers/net/mlx5/mlx5_flow_dv.c  |   2 +
>  lib/eal/include/rte_common.h     |  23 +++
>  lib/net/rte_cksum.h              |  15 +-
>  lib/net/rte_ip4.h                |   1 +
>  lib/net/rte_ip6.h                |   1 +
>  9 files changed, 277 insertions(+), 10 deletions(-)
>  create mode 100644 app/test/test_cksum_fuzz.c
> 
> --
> 2.39.5 (Apple Git-154)
> 

Do not like to see so much compiler hacking, and not sure if it is worth
having to do so much special case code for a few cycles.

Feeling lazy, asked AI to compare the FreeBSD and Linux checksum and its
answer was enlightening.

The optimium solution would in my opinion would be:
   - specific IP and IPv6 only routines using assembly on a platform basis.
     Add with carry will be faster than all this compiler flags.
   - non-inline function for the general case with data.

This should eliminate the need for compiler flags.


Compare the algorithm used to compute standard IP checksum in FreeBSD and Linux

Great question! Both implement the same RFC 1071 one's complement checksum, but 
their optimization strategies differ in interesting ways.

## Core Algorithm

Both compute a 16-bit one's complement sum: accumulate 16-bit words, fold 
carries back in, and complement the result. The differences are in the 
implementation details.

## Linux Approach

Linux uses a layered architecture with generic C fallbacks and arch-specific 
optimizations:

**Generic path** (`lib/checksum.c`, `include/net/checksum.h`):
- `do_csum()` uses a 64-bit accumulator, processing 32 bytes per loop iteration 
(heavily unrolled)
- Defers folding until the end to minimize carry handling
- `csum_fold()` does the final 32→16 bit fold with a clever trick:
  ```c
  sum = (sum & 0xffff) + (sum >> 16);
  sum = (sum & 0xffff) + (sum >> 16);
  return ~sum;
  ```

**IP header fast path** (`ip_fast_csum()`):
- Specialized for the common 20-byte header case
- On x86-64, uses ADC (add-with-carry) chains to exploit hardware carry 
propagation
- ARM64 uses similar ADCS instruction chains

## FreeBSD Approach

FreeBSD's `in_cksum()` (`sys/netinet/in_cksum.c`) takes a different approach:

**mbuf-oriented design**:
- Walks mbuf chains directly, handling the common case of headers split across 
mbufs
- Tracks byte-swapping state when data crosses 16-bit boundaries between mbufs

**Accumulation strategy**:
- Uses `int` (32-bit) accumulator in the generic version
- Processes in 16-byte chunks with explicit unrolling
- Folds more frequently (after each mbuf) rather than deferring everything

**Architecture-specific versions**:
- x86 versions historically used inline assembly with ADC chains
- More recent versions lean on compiler intrinsics

## Key Differences

| Aspect | Linux | FreeBSD |
|--------|-------|---------|
| Accumulator | 64-bit, fold once at end | 32-bit, fold per mbuf |
| Loop unroll | 32 bytes/iteration | 16 bytes/iteration |
| Data structure | Works on flat buffers via `skb_checksum()` | Direct mbuf 
chain walking |
| Alignment | Handles via initial byte adjustment | Similar, but tracks "odd 
byte" state across mbufs |
| IP header | Dedicated `ip_fast_csum()` | Often uses general `in_cksum()` |

## Performance Implications

Linux's approach tends to be faster for large contiguous buffers due to larger 
unrolling and deferred folding. FreeBSD's mbuf-aware design is arguably cleaner 
for scattered data but may do more fold operations.

On modern CPUs with good branch prediction and out-of-order execution, the 
differences are often marginal—the ADC chain optimizations matter more than the 
high-level strategy. Both have moved toward letting the compiler handle more 
optimization rather than hand-tuned assembly, though Linux still maintains more 
arch-specific variants.

Given your DPDK work, you've probably noticed DPDK's `rte_raw_cksum()` takes 
yet another approach—often vectorized with SIMD when available, which can 
significantly outperform both kernel implementations for large payloads.

Reply via email to