Hi Harry,
On Thu, Jun 24, 2021 at 11:07:59AM +0000, Van Haaren, Harry wrote: > > -----Original Message----- > > From: dev <ovs-dev-boun...@openvswitch.org> On Behalf Of Flavio Leitner > > Sent: Thursday, June 24, 2021 4:57 AM > > To: Ferriter, Cian <cian.ferri...@intel.com> > > Cc: ovs-dev@openvswitch.org; i.maxim...@ovn.org > > Subject: Re: [ovs-dev] [v13 12/12] dpcls-avx512: Enable avx512 vector > > popcount > > instruction. > > > > On Thu, Jun 17, 2021 at 05:18:25PM +0100, Cian Ferriter wrote: > > > From: Harry van Haaren <harry.van.haa...@intel.com> > > > > > > This commit enables the AVX512-VPOPCNTDQ Vector Popcount > > > instruction. This instruction is not available on every CPU > > > that supports the AVX512-F Foundation ISA, hence it is enabled > > > only when the additional VPOPCNTDQ ISA check is passed. > > > > > > The vector popcount instruction is used instead of the AVX512 > > > popcount emulation code present in the avx512 optimized DPCLS today. > > > It provides higher performance in the SIMD miniflow processing > > > as that requires the popcount to calculate the miniflow block indexes. > > > > > > Signed-off-by: Harry van Haaren <harry.van.haa...@intel.com> > > > > Acked-by: Flavio Leitner <f...@sysclose.org> > > Thanks for reviewing! > > > This patch series implements low level optimizations by manually > > coding instructions. I wonder if gcc couldn't get some relevant > > level of vectorized optimizations refactoring and enabling > > compiling flags. I assume the answer is no, but I would appreciate > > some enlightenment on the matter. > > Unfortunately no... there is no magic solution here to have the toolchain > provide fallbacks if the latest ISA is not available. You're 100% right, these > are manually implemented versions of new ISA, implemented in "older" > ISA, to allow usage of the functionality. In this case, Skylake grade > "AVX512-F" > is used to implement the Icelake grade "AVX512-VPOPCNTDQ" instruction: > (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_popcnt_epi64%2520&expand=4368,4368) > > I do like the idea of toolchain supporting ISA options a bit more, there is > so much compute performance available that is not widely used today. > Such an effort industry wide would be very beneficial to all for improving > performance, but would be a pretty large undertaking too... outside the > scope of this patchset! :) Yeah, it is. I mean, if the toolchain is not ready yet and we think worth the benefits considering that most probably fewer people will be able to contribute or maintain, then I see no other way to solve the issue. Do you think improving the toolchain is a larger commitment than manually improving applications? A quick look on gcc gave me the impression that it does support at least some basic vector optimization capabilities. > I'll admit to being a bit of an ISA fan, but there's some magical instructions > that can do stuff in 1x instruction that otherwise take large amounts of > shifts & loops. Did I hear somebody ask for examples..?? Out of curiosity, which tool are you using (if you are) to measure the improvements at cycles level? vtune? > Miniflow Bits processing with "BMI" (Bit Manipulation Instructions) > Introduced in Haswell era, > https://software.intel.com/sites/landingpage/IntrinsicsGuide/#othertechs=BMI1,BMI2 > - Favorite instructions are pdep and pext (parallel bit deposit, and parallel > bit extract) > - Very useful for dense bitfield unpacking, instead of "load - shift - AND" > per field, can > unpack up to 8 bitfields in a u64 and align them to byte-boundaries > - Its "opposite" "pext" also exists, extracting sparse bits from an integer > into a packed layout > (pext is used in DPCLS, to pull sparse bits from the packet's miniflow into > linear packed layout, > allowing it to be processed in a single packed AVX512 register) > > Note that we're all benefitting from novel usage of the scalar "popcount" > instruction too, since merging > commit: a0b36b392 (introduced in SSE4.2, with CPUID flag POPCNT) It uses a > bitmask & popcount approach > to index into the miniflow, improving on the previous "count and shifts bits" > to iterate miniflows approach. > > There are likely multiple other places in OVS where we spend significant > cycles > on processing data in ways that can be accelerated significantly by using all > available ISA. > There is ongoing work in miniflow extract (MFEX) with AVX512 SIMD ISA, > allowing parsing > of multiple packet protocols at the same time (see here > https://patchwork.ozlabs.org/project/openvswitch/list/?series=249470) > > I'll stop promoting ISA here, but am happy to continue detailed discussions, > or break out > conversations about specific areas of compute in OVS if there's appetite for > that! Feel free > to email to OVS Mailing list (with me on CC please :) or email directly OK > too. I am definitely learning more about it and I appreciated your longer reply. Thanks, -- fbl _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev