Re: [PATCH v3] net: optimize raw checksum computation

Scott Mitchell Wed, 07 Jan 2026 14:06:53 -0800

On Wed, Jan 7, 2026 at 12:56 PM Morten Brørup <[email protected]> 
wrote:
>
> > From: [email protected] [mailto:[email protected]]
> > Sent: Wednesday, 7 January 2026 18.04
> >
> > From: Scott Mitchell <[email protected]>
> >
> > Optimize __rte_raw_cksum() by processing data in larger unrolled loops
> > instead of iterating word-by-word. The new implementation processes
> > 64-byte blocks (32 x uint16_t) in the hot path, followed by smaller
> > 32/16/8/4/2-byte chunks.
>
> Playing around with Godbolt:
> https://godbolt.org/z/oYdP9xxfG
>
> With the original code (built with -msse4.2), the compiler vectorizes the 
> loop to process 16-byte chunks (instead of the 2-byte chunks the source code 
> indicates).
> When built with -mavx512f, it processes 32-byte chunks.
>
> IMHO, the compiled output of the new code is too big; using more than 12 kB 
> instructions consumes too much L1 Instruction Cache.
> I suppose the compiler both vectorizes and loop unrolls.


Good observation, and godbolt is very handy! Agreed this patch isn't
desirable on x86-64 with gcc 15.2. I am using clang 18.1.8 (Redhat)
and the original version doesn't vectorize while my patch does
vectorize and icache isn't as bloated as gcc (explains the perf
difference).

I'm exploring an approach that will vectorize on both gcc and clang
and will submit an update soon.

>
> >
> > Uses uint32_t accumulator with explicit casts to prevent signed integer
> > overflow and leverages unaligned_uint16_t for safe unaligned access on
> > all platforms. Adds __rte_no_ubsan_alignment attribute to suppress
> > false
> > positive alignment warnings from UndefinedBehaviorSanitizer.
> >
> > Performance results from cksum_perf_autotest (TSC cycles/byte):
> >   Block size    Before    After    Improvement
> >          100  0.40-0.64  0.13-0.14    ~3-4x
> >         1500  0.49-0.51  0.10-0.11    ~4-5x
> >         9000  0.48-0.51  0.11-0.12    ~4x
>
> On which machine do you achieve these perf numbers?
>
> Can a measurable performance increase be achieved using significantly smaller 
> compiled code than this patch?
>
> >
> > Signed-off-by: Scott Mitchell <[email protected]>
> > ---
> > Changes in v3:
> > - Added __rte_no_ubsan_alignment macro to suppress false-positive UBSAN
> >   alignment warnings when using unaligned_uint16_t
> > - Fixed false-positive GCC maybe-uninitialized warning in rte_ip6.h
> > exposed
> >   by optimization (can be split to separate patch once verified on CI)
> >
> > Changes in v2:
> > - Fixed UndefinedBehaviorSanitizer errors by adding uint32_t casts to
> > prevent
> >   signed integer overflow in addition chains
> > - Restored uint32_t sum accumulator instead of uint64_t
> > - Added 64k length to test_cksum_perf.c
> >
>
>
> > diff --git a/lib/net/rte_cksum.h b/lib/net/rte_cksum.h
> > index a8e8927952..d6e313dea5 100644
> > --- a/lib/net/rte_cksum.h
> > +++ b/lib/net/rte_cksum.h
> > @@ -39,24 +39,64 @@ extern "C" {
> >   * @return
> >   *   sum += Sum of all words in the buffer.
> >   */
> > +__rte_no_ubsan_alignment
> >  static inline uint32_t
> >  __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
> >  {
> > -     const void *end;
> > +     /* Process in 64 byte blocks (32 x uint16_t). */
> > +     /* Always process as uint16_t chunks to preserve overflow/carry.
> > */
> > +     const void *end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, 64));
> > +     while (buf != end) {
> > +             const unaligned_uint16_t *p16 = (const unaligned_uint16_t
> > *)buf;
> > +             sum += (uint32_t)p16[0] + p16[1] + p16[2] + p16[3] +
> > +                      p16[4] + p16[5] + p16[6] + p16[7] +
> > +                      p16[8] + p16[9] + p16[10] + p16[11] +
> > +                      p16[12] + p16[13] + p16[14] + p16[15] +
> > +                      p16[16] + p16[17] + p16[18] + p16[19] +
> > +                      p16[20] + p16[21] + p16[22] + p16[23] +
> > +                      p16[24] + p16[25] + p16[26] + p16[27] +
> > +                      p16[28] + p16[29] + p16[30] + p16[31];
> > +             buf = RTE_PTR_ADD(buf, 64);
> > +     }
> >
> > -     for (end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len,
> > sizeof(uint16_t)));
> > -          buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
> > -             uint16_t v;
> > +     if (len & 32) {
> > +             const unaligned_uint16_t *p16 = (const unaligned_uint16_t
> > *)buf;
> > +             sum += (uint32_t)p16[0] + p16[1] + p16[2] + p16[3] +
> > +                      p16[4] + p16[5] + p16[6] + p16[7] +
> > +                      p16[8] + p16[9] + p16[10] + p16[11] +
> > +                      p16[12] + p16[13] + p16[14] + p16[15];
> > +             buf = RTE_PTR_ADD(buf, 32);
> > +     }
> >
> > -             memcpy(&v, buf, sizeof(uint16_t));
> > -             sum += v;
> > +     if (len & 16) {
> > +             const unaligned_uint16_t *p16 = (const unaligned_uint16_t
> > *)buf;
> > +             sum += (uint32_t)p16[0] + p16[1] + p16[2] + p16[3] +
> > +                      p16[4] + p16[5] + p16[6] + p16[7];
> > +             buf = RTE_PTR_ADD(buf, 16);
> >       }
> >
> > -     /* if length is odd, keeping it byte order independent */
> > -     if (unlikely(len % 2)) {
> > -             uint16_t left = 0;
> > +     if (len & 8) {
> > +             const unaligned_uint16_t *p16 = (const unaligned_uint16_t
> > *)buf;
> > +             sum += (uint32_t)p16[0] + p16[1] + p16[2] + p16[3];
> > +             buf = RTE_PTR_ADD(buf, 8);
> > +     }
> >
> > -             memcpy(&left, end, 1);
> > +     if (len & 4) {
> > +             const unaligned_uint16_t *p16 = (const unaligned_uint16_t
> > *)buf;
> > +             sum += (uint32_t)p16[0] + p16[1];
> > +             buf = RTE_PTR_ADD(buf, 4);
> > +     }
> > +
> > +     if (len & 2) {
> > +             const unaligned_uint16_t *p16 = (const unaligned_uint16_t
> > *)buf;
> > +             sum += *p16;
> > +             buf = RTE_PTR_ADD(buf, 2);
> > +     }
> > +
> > +     /* If length is odd use memcpy for byte order independence */
> > +     if (len & 1) {
> > +             uint16_t left = 0;
> > +             memcpy(&left, buf, 1);
> >               sum += left;
> >       }
> >
> > diff --git a/lib/net/rte_ip6.h b/lib/net/rte_ip6.h
> > index d1abf1f5d5..af65a39815 100644
> > --- a/lib/net/rte_ip6.h
> > +++ b/lib/net/rte_ip6.h
> > @@ -564,7 +564,7 @@ rte_ipv6_phdr_cksum(const struct rte_ipv6_hdr
> > *ipv6_hdr, uint64_t ol_flags)
> >       struct {
> >               rte_be32_t len;   /* L4 length. */
> >               rte_be32_t proto; /* L4 protocol - top 3 bytes must be zero
> > */
> > -     } psd_hdr;
> > +     } psd_hdr = {0}; /* Empty initializer avoids false-positive
> > maybe-uninitialized warning */
> >
> >       psd_hdr.proto = (uint32_t)(ipv6_hdr->proto << 24);
> >       if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG))
>
> Maybe ipv6 can be fixed like this instead:
> -       if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG))
> -               psd_hdr.len = 0;
> -       else
> -               psd_hdr.len = ipv6_hdr->payload_len;
> +       psd_hdr.len = (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | 
> RTE_MBUF_F_TX_UDP_SEG)) ?
> +                       0 : psd_hdr.len = ipv6_hdr->payload_len;
>
> > --
> > 2.39.5 (Apple Git-154)
>

Re: [PATCH v3] net: optimize raw checksum computation

Reply via email to