> -----Original Message-----
> From: Eelco Chaudron <echau...@redhat.com>
> Sent: Thursday, July 14, 2022 2:25 PM
> To: Van Haaren, Harry <harry.van.haa...@intel.com>
> Cc: d...@openvswitch.org; i.maxim...@ovn.org; Amber, Kumar
> <kumar.am...@intel.com>; Pai G, Sunil <sunil.pa...@intel.com>; Finn, Emma
> <emma.f...@intel.com>; Stokes, Ian <ian.sto...@intel.com>
> Subject: Re: [PATCH v10 10/10] odp-execute: Add ISA implementation of 
> set_masked
> IPv4 action
> 
> > From: Emma Finn <emma.f...@intel.com>
> >
> > This commit adds support for the AVX512 implementation of the
> > ipv4_set_addrs action as well as an AVX512 implementation of
> > updating the checksums.

<snip>

> > +        /* Update the IP checksum based on updated IP values. */
> > +        uint16_t delta = avx512_ipv4_update_csum(v_res, v_packet);
> > +        uint32_t new_csum = old_csum + delta;
> > +        delta = csum_finish(new_csum);
> > +
> > +        /* Insert new checksum. */
> > +        v_res = _mm256_insert_epi16(v_res, delta, 5);
> > +
> > +        /* If ip_src or ip_dst has been modified, L4 checksum needs to
> > +         * be updated too. */
> > +        if (mask->ipv4_src || mask->ipv4_dst) {
> > +
> > +            uint16_t delta_checksum = avx512_l4_update_csum(v_packet, 
> > v_res);
> > +
> 
> Wondering if all this AVX code being executed really is faster than 
> recalc_csum32(uh-
> >udp_csum, old_addr, new_addr)?

Ultimately, measuring is worth more than talking about it. In our measurements 
here,
yes absolutely it is, our measurements are available in the cover letter of the 
patchset.

Note that the code here is compute-bound, its juggling values between 
registers, and
with XMM/YMM registers, SIMD IPC of 3 can be achieved. That means that in 
theory,
the SIMD code executes ~3 intrinsics *per cycle*, but in practice the IPC is 
often *more*
due to interleaved scalar code, and Out-of-Order execution capabilities of the 
CPU.

Although the code is verbose (lots of typing) the resulting instruction stream 
is generally
optimized very well by the compiler, and reduced to very small, dense and hot 
loops.

I recommend using "perf top" to investigate the hotspots, for those unaware of 
tools
and methods, a DPDK Userspace presentation covers exactly this using OVS DPCLS 
as
the examples code! https://youtu.be/ZmwOKR5JyPk

Regards, -Harry
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to