> From: [email protected] [mailto:[email protected]] > Sent: Wednesday, 7 January 2026 18.04 > > From: Scott Mitchell <[email protected]> > > Optimize __rte_raw_cksum() by processing data in larger unrolled loops > instead of iterating word-by-word. The new implementation processes > 64-byte blocks (32 x uint16_t) in the hot path, followed by smaller > 32/16/8/4/2-byte chunks.
Playing around with Godbolt: https://godbolt.org/z/oYdP9xxfG With the original code (built with -msse4.2), the compiler vectorizes the loop to process 16-byte chunks (instead of the 2-byte chunks the source code indicates). When built with -mavx512f, it processes 32-byte chunks. IMHO, the compiled output of the new code is too big; using more than 12 kB instructions consumes too much L1 Instruction Cache. I suppose the compiler both vectorizes and loop unrolls. > > Uses uint32_t accumulator with explicit casts to prevent signed integer > overflow and leverages unaligned_uint16_t for safe unaligned access on > all platforms. Adds __rte_no_ubsan_alignment attribute to suppress > false > positive alignment warnings from UndefinedBehaviorSanitizer. > > Performance results from cksum_perf_autotest (TSC cycles/byte): > Block size Before After Improvement > 100 0.40-0.64 0.13-0.14 ~3-4x > 1500 0.49-0.51 0.10-0.11 ~4-5x > 9000 0.48-0.51 0.11-0.12 ~4x On which machine do you achieve these perf numbers? Can a measurable performance increase be achieved using significantly smaller compiled code than this patch? > > Signed-off-by: Scott Mitchell <[email protected]> > --- > Changes in v3: > - Added __rte_no_ubsan_alignment macro to suppress false-positive UBSAN > alignment warnings when using unaligned_uint16_t > - Fixed false-positive GCC maybe-uninitialized warning in rte_ip6.h > exposed > by optimization (can be split to separate patch once verified on CI) > > Changes in v2: > - Fixed UndefinedBehaviorSanitizer errors by adding uint32_t casts to > prevent > signed integer overflow in addition chains > - Restored uint32_t sum accumulator instead of uint64_t > - Added 64k length to test_cksum_perf.c > > diff --git a/lib/net/rte_cksum.h b/lib/net/rte_cksum.h > index a8e8927952..d6e313dea5 100644 > --- a/lib/net/rte_cksum.h > +++ b/lib/net/rte_cksum.h > @@ -39,24 +39,64 @@ extern "C" { > * @return > * sum += Sum of all words in the buffer. > */ > +__rte_no_ubsan_alignment > static inline uint32_t > __rte_raw_cksum(const void *buf, size_t len, uint32_t sum) > { > - const void *end; > + /* Process in 64 byte blocks (32 x uint16_t). */ > + /* Always process as uint16_t chunks to preserve overflow/carry. > */ > + const void *end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, 64)); > + while (buf != end) { > + const unaligned_uint16_t *p16 = (const unaligned_uint16_t > *)buf; > + sum += (uint32_t)p16[0] + p16[1] + p16[2] + p16[3] + > + p16[4] + p16[5] + p16[6] + p16[7] + > + p16[8] + p16[9] + p16[10] + p16[11] + > + p16[12] + p16[13] + p16[14] + p16[15] + > + p16[16] + p16[17] + p16[18] + p16[19] + > + p16[20] + p16[21] + p16[22] + p16[23] + > + p16[24] + p16[25] + p16[26] + p16[27] + > + p16[28] + p16[29] + p16[30] + p16[31]; > + buf = RTE_PTR_ADD(buf, 64); > + } > > - for (end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, > sizeof(uint16_t))); > - buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) { > - uint16_t v; > + if (len & 32) { > + const unaligned_uint16_t *p16 = (const unaligned_uint16_t > *)buf; > + sum += (uint32_t)p16[0] + p16[1] + p16[2] + p16[3] + > + p16[4] + p16[5] + p16[6] + p16[7] + > + p16[8] + p16[9] + p16[10] + p16[11] + > + p16[12] + p16[13] + p16[14] + p16[15]; > + buf = RTE_PTR_ADD(buf, 32); > + } > > - memcpy(&v, buf, sizeof(uint16_t)); > - sum += v; > + if (len & 16) { > + const unaligned_uint16_t *p16 = (const unaligned_uint16_t > *)buf; > + sum += (uint32_t)p16[0] + p16[1] + p16[2] + p16[3] + > + p16[4] + p16[5] + p16[6] + p16[7]; > + buf = RTE_PTR_ADD(buf, 16); > } > > - /* if length is odd, keeping it byte order independent */ > - if (unlikely(len % 2)) { > - uint16_t left = 0; > + if (len & 8) { > + const unaligned_uint16_t *p16 = (const unaligned_uint16_t > *)buf; > + sum += (uint32_t)p16[0] + p16[1] + p16[2] + p16[3]; > + buf = RTE_PTR_ADD(buf, 8); > + } > > - memcpy(&left, end, 1); > + if (len & 4) { > + const unaligned_uint16_t *p16 = (const unaligned_uint16_t > *)buf; > + sum += (uint32_t)p16[0] + p16[1]; > + buf = RTE_PTR_ADD(buf, 4); > + } > + > + if (len & 2) { > + const unaligned_uint16_t *p16 = (const unaligned_uint16_t > *)buf; > + sum += *p16; > + buf = RTE_PTR_ADD(buf, 2); > + } > + > + /* If length is odd use memcpy for byte order independence */ > + if (len & 1) { > + uint16_t left = 0; > + memcpy(&left, buf, 1); > sum += left; > } > > diff --git a/lib/net/rte_ip6.h b/lib/net/rte_ip6.h > index d1abf1f5d5..af65a39815 100644 > --- a/lib/net/rte_ip6.h > +++ b/lib/net/rte_ip6.h > @@ -564,7 +564,7 @@ rte_ipv6_phdr_cksum(const struct rte_ipv6_hdr > *ipv6_hdr, uint64_t ol_flags) > struct { > rte_be32_t len; /* L4 length. */ > rte_be32_t proto; /* L4 protocol - top 3 bytes must be zero > */ > - } psd_hdr; > + } psd_hdr = {0}; /* Empty initializer avoids false-positive > maybe-uninitialized warning */ > > psd_hdr.proto = (uint32_t)(ipv6_hdr->proto << 24); > if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG)) Maybe ipv6 can be fixed like this instead: - if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG)) - psd_hdr.len = 0; - else - psd_hdr.len = ipv6_hdr->payload_len; + psd_hdr.len = (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG)) ? + 0 : psd_hdr.len = ipv6_hdr->payload_len; > -- > 2.39.5 (Apple Git-154)

