> -----Original Message----- > From: Eelco Chaudron <echau...@redhat.com> > Sent: Wednesday, June 30, 2021 10:52 AM > To: Van Haaren, Harry <harry.van.haa...@intel.com> > Cc: Amber, Kumar <kumar.am...@intel.com>; d...@openvswitch.org; > i.maxim...@ovn.org > Subject: Re: [ovs-dev] [v4 03/12] dpif-netdev: Add study function to select > the best > mfex function > > > > On 30 Jun 2021, at 11:32, Van Haaren, Harry wrote: > > >> -----Original Message----- > >> From: Eelco Chaudron <echau...@redhat.com> > >> Sent: Wednesday, June 30, 2021 10:18 AM > >> To: Van Haaren, Harry <harry.van.haa...@intel.com> > >> Cc: Amber, Kumar <kumar.am...@intel.com>; d...@openvswitch.org; > >> i.maxim...@ovn.org > >> Subject: Re: [ovs-dev] [v4 03/12] dpif-netdev: Add study function to > >> select the > best > >> mfex function > >> > >> > >> > >> On 29 Jun 2021, at 18:32, Van Haaren, Harry wrote: > >> > >>>> -----Original Message----- > >>>> From: dev <ovs-dev-boun...@openvswitch.org> On Behalf Of Eelco Chaudron
<snip away outdated context> > >>>> Maybe we should report the numbers/hits for the other methods, as they > >>>> might > >> be > >>>> equal, and some might be faster in execution time? > >>> > >>> As above, the implementations are sorted in performance order. Performance > >>> here can be known by micro-benchmarks, and developers of such SIMD > optimized > >>> code can be expected to know which impl is fastest. > >> > >> Don’t think we can, as it’s not documented in the code, and some one can > >> just > add > >> his own, and has no clue about the existing ones. > > > > Yes, in theory somebody could add his own, and get this wrong. There are > > many > many > > things that could go wrong when making code changes. We cannot document > everything. > > I meant that the code currently does not document that the implementation > table, > mfex_impls[], is in order of preference. So I think this should be added. Sure we can document that the impl list is iterated & searched in order, hence code-doc would help there. Will add this to the code. > >>> In our current code, the avx512_vbmi_* impls are always before the > >>> avx512_* > >>> impls, as the VBMI instruction set allows a faster runtime. > >> > >> Guess we need some documentation in the developer's section on how to add > >> processor optimized functions, and how to benchmark them (and maybe some > >> benchmark data for the current implementations). > >> Also, someone can write a sloppy avx512_vbmi* function that might be slower > than > >> an avx512_*, right? > > > > What are we trying to achieve here? What is the root problem that is being > addressed? > > > > Yes, somebody "could" write sloppy (complex, advanced, ISA specific, SIMD) > avx512 code, > > and have it be slower. Who is realistically going to do that? > > > > I'm fine with documenting a few things if they make sense to document, but > > trying to "hand hold" at every level just doesn't work. Adding sections on > > how > > to benchmark code, and how function pointers work and how to add them? > > These things are documented in various places across the internet. > > > > If there's really an interest to learn AVX512 SIMD optimization, reach out > > to the > > OVS community, put me on CC, and I'll be willing to help. Adding > > documentation > > ad nauseam is not the solution, as each optimization is likely to have > > subtle > differences. > > I think the problem is that except you, and some other small group at Intel > might > know AVX512, but for most of the OVS community this is moving back to > handwritten assembler. Nitpick but worth mentioning: optimizing with intrinsics is much easier, and much less mental overhead than actual assembler (e.g. register allocation handled by compiler). I agree lots of developers don't see this on a daily basis, but its really not that "crazy". Once over the 1st level of "reading intrinsics", scalar becomes looped scalar becomes vector: uint64_t x = y & z; for (int i = 0; i < 8; i++) x[i] = y[i] & z[i]; __m512i x = _mm512_and_si512(y, z); Anyway, this is getting off topic, so I'll stop adding detail here. > So at least some guidelines on what you should do when > adding a custom function would help. Like order them in priority, maybe some > simple example on how to benchmark the runtime of the mfex function. Don't > think > this has to be part of this patch, but a follow-up would be nice. Honestly I'm still not convinced. Just running the normal OVS benchmarks is enough. If the cycle-counts/packet-rate reported by OVS are better, you're going faster. These things are already documented: https://docs.openvswitch.org/en/latest/topics/dpdk/pmd/ If you're a developer writing SIMD code, I think its fair to assume some level of knowledge on profiling. If not, the OVS documentation is IMO still _not_ the place to document how to profile optimized code. There's nothing special about benchmarking these AVX512 MFEX implementations compared to any other datapath (or otherwise) function. > >>> <snip code changes till end of patch> > > <snip snip away irrelevant old discussions> _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev