Sorry for double post - if you're happy with the changes please add a Tested-by: tag?
> -----Original Message----- > From: Van Haaren, Harry > Sent: Thursday, June 27, 2019 1:07 PM > To: 'Malvika Gupta' <malvika.gu...@arm.com> > Cc: nd <n...@arm.com>; Yanqin Wei (Arm Technology China) <yanqin....@arm.com>; > d...@openvswitch.org > Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized generic > scalar functions > > > -----Original Message----- > > From: Malvika Gupta [mailto:malvika.gu...@arm.com] > > Sent: Tuesday, June 25, 2019 8:19 PM > > To: Van Haaren, Harry <harry.van.haa...@intel.com> > > Cc: nd <n...@arm.com>; Yanqin Wei (Arm Technology China) > <yanqin....@arm.com>; > > d...@openvswitch.org > > Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized generic > > scalar functions > > > > Hi Harry, > > Hi Malvika, > > > > I tested your patch on two ARM machines, namely ThunderX2 and Octeon-Tx. > > > > On Octeon-Tx (running gcc version 7.4.0), a performance improvement of 9- > 10% > > was seen. On the ThunderX2 (running gcc version 8.3.0), a performance > > improvement of 14-15% was observed. For both machines, I disabled the EMC > > and tested with two traffic patterns of 64B packet size, specifically > > UDP/IPv4/Ethernet and IPv4/Ethernet (Raw IP). > > Great - thanks for testing and reporting back performance, appreciated! > > > > Please let me know if you require any more details. > > Thanks, > > Malvika > > It would be good to know which miniflow fingerprints are commonly deployed > with OVS. > > If we (the OVS community) had more input in terms of benchmarks that OVS > deployments > use, then we can optimize against those more (as was mentioned at OVS Conf > '18 too :) > https://youtu.be/5-MDlpUIOBE?list=PLaJlRa-xItwCzuAL3mP6n02vmXab4Bwu-&t=1471 > > I haven't had much success with getting benchmarks yet - but if you have > something > like a use-case or benchmark to test against, sharing it with OVS Community > would be great! > > -Harry > > > > > -----Original Message----- > > > From: Malvika Gupta > > > Sent: Wednesday, June 19, 2019 1:34 PM > > > To: Van Haaren, Harry <harry.van.haa...@intel.com> > > > Cc: Yanqin Wei (Arm Technology China) <yanqin....@arm.com>; nd > > > <n...@arm.com>; d...@openvswitch.org > > > Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized > generic > > > scalar functions > > > > > > Hi Harry > > > > > > Thanks for your reply. I have some last few things I wanted to clarify. > > Please > > > see my inline comments below. > > > > > > > -----Original Message----- > > > > From: Van Haaren, Harry <harry.van.haa...@intel.com> > > > > Sent: Wednesday, June 19, 2019 11:22 AM > > > > To: Malvika Gupta <malvika.gu...@arm.com> > > > > Cc: Yanqin Wei (Arm Technology China) <yanqin....@arm.com>; nd > > > > <n...@arm.com>; d...@openvswitch.org > > > > Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized > > > > generic scalar functions > > > > > > > > > -----Original Message----- > > > > > From: Malvika Gupta [mailto:malvika.gu...@arm.com] > > > > > Sent: Wednesday, June 19, 2019 5:17 PM > > > > > To: Van Haaren, Harry <harry.van.haa...@intel.com> > > > > > Cc: Yanqin Wei (Arm Technology China) <yanqin....@arm.com>; nd > > > > > <n...@arm.com>; d...@openvswitch.org > > > > > Subject: RE: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized > > > > > generic scalar functions > > > > > > > > > > Hi Harry, > > > > > > > > Hi Malvika, > > > > > > > > > > > > > I would like to test your patch on ARM platforms and report any > > > > > performance improvement. Can you please tell me the test vectors you > > > > > used to match the subtables that call the specialized lookup > > > > > functions, i.e., dpcls_subtable_lookup_mf_u0w5_u1w1, > > > > > dpcls_subtable_lookup_mf_u0w4_u1w1, and > > > > > dpcls_subtable_lookup_mf_u0w4_u1w0? And how did you make that > > > > > distinction with traffic patterns, i.e., which traffic pattern would > > > > > match with > > > > which subtable lookup? > > > > > > > > I was just running Eth/IPv4/UDP data, and using OvS to do Phy to Phy > > > > forwarding. > > > > > > > > ./utilities/ovs-ofctl add-flow br0 in_port=1,action=output:2 > > > > ./utilities/ovs- ofctl add-flow br0 in_port=2,action=output:1 > > > > > > > > This is an extremely simple setup, however it is enough to showcase > > > > value of the patches here. > > > > > > > > By changing traffic to Eth/IPv4 and Eth alone, you can change the # of > > > > miniflow items, and hit the different u0_w5_u1_w0 cases :) > > > > > > > > > > 1. Are you also sending combinations of traffic flows at the same time? > > Like > > > UDP/IPv4/Eth is one traffic flow, raw IP could be another and Eth > packets > > > only is a third flow. Did you also test by say, sending these 3 flows > > together > > > which would all hit different subtables or just one kind of traffic flow > > at a > > > time? > > > 2. Just to clarify, packet size is 64B, right? > > > > > > I am sorry if these seem trivial but I just wanted to have a clear > > > understanding of the testing environment. > > > Thanks! > > > > > > > If you have suggestions for other subtables to special case, please > > > > post the u0_wX_u1_wY values that you'd like to see, and I'll add them > to > > > the patchset. > > > > > > > > > > > > > Thank you, > > > > > Malvika Gupta > > > > > > > > Testing welcomed, thanks! -Harry > > > > > > > > > > -----Original Message----- > > > > > > From: ovs-dev-boun...@openvswitch.org <ovs-dev- > > > > > > boun...@openvswitch.org> On Behalf Of Harry van Haaren > > > > > > Sent: Wednesday, May 8, 2019 11:13 PM > > > > > > To: ovs-dev@openvswitch.org > > > > > > Cc: i.maxim...@samsung.com > > > > > > Subject: [ovs-dev] [PATCH v9 5/5] dpif-netdev: add specialized > > > > > > generic > > > > > scalar > > > > > > functions > > > > > > > > > > > > This commit adds a number of specialized functions, that handle > > > > > > common miniflow fingerprints. This enables compiler optimization, > > > > > > resulting in > > > > > higher > > > > > > performance. Below a quick description of how this optimization > > > > > > actually works; > > > > > > > > > > > > "Specialized functions" are "instances" of the generic > > > > > > implementation, but the compiler is given extra context when > > > > > > compiling. In the case of > > > > > iterating > > > > > > miniflow datastructures, the most interesting value to enable > > > > > > compile time optimizations is the loop trip count per unit. > > > > > > > > > > > > In order to create a specialized function, there is a generic > > > > > implementation, > > > > > > which uses a for() loop without the compiler knowing the loop trip > > > > > > count > > > > > at > > > > > > compile time. The loop trip count is passed in as an argument to > > > > > > the > > > > > function: > > > > > > > > > > > > uint32_t miniflow_impl_generic(struct miniflow *mf, uint32_t > > > > > > loop_count) > > > > { > > > > > > for(uint32_t i = 0; i < loop_count; i++) > > > > > > // do work > > > > > > } > > > > > > > > > > > > In order to "specialize" the function, we call the generic > > > > > > implementation > > > > > with > > > > > > hard-coded numbers - these are compile time constants! > > > > > > > > > > > > uint32_t miniflow_impl_loop5(struct miniflow *mf, uint32_t > > > > > > loop_count) > > > > { > > > > > > // use hard coded constant for compile-time constant- > propogation > > > > > > return miniflow_impl_generic(mf, 5); } > > > > > > > > > > > > Given the compiler is aware of the loop trip count at compile > > > > > > time, it can perform an optimization known as "constant > > propogation". > > > > > > Combined with inlining of the miniflow_impl_generic() function, > > > > > > the compiler is now > > > > > enabled > > > > > > to *compile time* unroll the loop 5x, and produce "flat" code. > > > > > > > > > > > > The last step to using the specialized functions is to utilize a > > > > > > function- > > > > > pointer > > > > > > to choose the specialized (or generic) implementation. > > > > > > The selection of the function pointer is performed at subtable > > > > > > creation > > > > > time, > > > > > > when miniflow fingerprint of the subtable is known. This technique > > > > > > is > > > > > known > > > > > > as "multiple dispatch" in some literature, as it uses multiple > > > > > > items of information (miniflow bit counts) to select the dispatch > > > function. > > > > > > > > > > > > By pointing the function pointer at the optimized implementation, > > > > > > OvS benefits from the compile time optimizations at runtime. > > > > > > > > > > > > Signed-off-by: Harry van Haaren <harry.van.haa...@intel.com> > > > > > > > > > > > > --- > > > > > > > > > > > > v8: > > > > > > - Rework to use blocks_cache from the dpcls instance, to avoid > > variable > > > > > > lenght arrays in the data-path. > > > > > > --- > > > > > > lib/dpif-netdev-lookup-generic.c | 83 > > > > > > +++++++++++++++++++++++++++++- > > > > -- > > > > > > lib/dpif-netdev.c | 9 +++- > > > > > > lib/dpif-netdev.h | 8 +++ > > > > > > 3 files changed, 90 insertions(+), 10 deletions(-) > > > > > > > > > > > > diff --git a/lib/dpif-netdev-lookup-generic.c > > > > > > b/lib/dpif-netdev-lookup- generic.c index 901d28ff0..ba2d024cc > > > > > > 100644 > > > > > > --- a/lib/dpif-netdev-lookup-generic.c > > > > > > +++ b/lib/dpif-netdev-lookup-generic.c > > > > > > @@ -100,11 +100,11 @@ netdev_flow_key_flatten(const struct > > > > > > netdev_flow_key * restrict key, > > > > > > > > > > > > /* Unit 0 flattening */ > > > > > > netdev_flow_key_flatten_unit(&pkt_blocks[0], > > > > > > - &tbl_blocks[0], > > > > > > - &mf_masks[0], > > > > > > - &blocks_scratch[0], > > > > > > - pkt_bits_u0, > > > > > > - u0_count); > > > > > > + &tbl_blocks[0], > > > > > > + &mf_masks[0], > > > > > > + &blocks_scratch[0], > > > > > > + pkt_bits_u0, > > > > > > + u0_count); > > > > > > > > > > > > /* Unit 1 flattening: > > > > > > * Move the pointers forward in the arrays based on u0 > offsets, > > > NOTE: > > > > > > @@ -225,7 +225,74 @@ dpcls_subtable_lookup_generic(struct > > > > > > dpcls_subtable *subtable, > > > > > > const struct netdev_flow_key > *keys[], > > > > > > struct dpcls_rule **rules) { > > > > > > - return lookup_generic_impl(subtable, blocks_scratch, > > keys_map, > > > > > keys, > > > > > > - rules, subtable- > > >mf_bits_set_unit0, > > > > > > - subtable->mf_bits_set_unit1); > > > > > > + /* Here the runtime subtable->mf_bits counts are used, which > > > > > > + forces > > > > > the > > > > > > + * compiler to iterate normal for() loops. Due to this > > > > > > + limitation in > > > > > the > > > > > > + * compilers available optimizations, this function has lower > > > > > performance > > > > > > + * than the below specialized functions. > > > > > > + */ > > > > > > + return lookup_generic_impl(subtable, blocks_scratch, > > > > > > + keys_map, keys, > > > > > > rules, > > > > > > + subtable->mf_bits_set_unit0, > > > > > > + subtable->mf_bits_set_unit1); } > > > > > > + > > > > > > +static uint32_t > > > > > > +dpcls_subtable_lookup_mf_u0w5_u1w1(struct dpcls_subtable > > > > *subtable, > > > > > > + uint64_t *blocks_scratch, > > > > > > + uint32_t keys_map, > > > > > > + const struct netdev_flow_key > > *keys[], > > > > > > + struct dpcls_rule **rules) { > > > > > > + /* hard coded bit counts - enables compile time loop > unrolling, > > and > > > > > > + * generating of optimized code-sequences due to loop > unrolled > > > code. > > > > > > + */ > > > > > > + return lookup_generic_impl(subtable, blocks_scratch, > > > > > > +keys_map, keys, > > > > > > rules, > > > > > > + 5, 1); } > > > > > > + > > > > > > +static uint32_t > > > > > > +dpcls_subtable_lookup_mf_u0w4_u1w1(struct dpcls_subtable > > > > *subtable, > > > > > > + uint64_t *blocks_scratch, > > > > > > + uint32_t keys_map, > > > > > > + const struct netdev_flow_key > > *keys[], > > > > > > + struct dpcls_rule **rules) { > > > > > > + return lookup_generic_impl(subtable, blocks_scratch, > > > > > > +keys_map, keys, > > > > > > rules, > > > > > > + 4, 1); } > > > > > > + > > > > > > +static uint32_t > > > > > > +dpcls_subtable_lookup_mf_u0w4_u1w0(struct dpcls_subtable > > > > *subtable, > > > > > > + uint64_t *blocks_scratch, > > > > > > + uint32_t keys_map, > > > > > > + const struct netdev_flow_key > > *keys[], > > > > > > + struct dpcls_rule **rules) { > > > > > > + return lookup_generic_impl(subtable, blocks_scratch, > > > > > > +keys_map, keys, > > > > > > rules, > > > > > > + 4, 0); } > > > > > > + > > > > > > +/* Probe function to lookup an available specialized function. > > > > > > + * If capable to run the requested miniflow fingerprint, this > > > > > > +function returns > > > > > > + * the most optimal implementation for that miniflow fingerprint. > > > > > > + * @retval FunctionAddress A valid function to handle the > > > > > > +miniflow bit pattern > > > > > > + * @retval 0 The requested miniflow is not supported here, NULL > > > > > > +is returned */ dpcls_subtable_lookup_func > > > > > > +dpcls_subtable_generic_probe(uint32_t u0_bits, uint32_t u1_bits) > { > > > > > > + dpcls_subtable_lookup_func f = NULL; > > > > > > + > > > > > > + if (u0_bits == 5 && u1_bits == 1) { > > > > > > + f = dpcls_subtable_lookup_mf_u0w5_u1w1; > > > > > > + } else if (u0_bits == 4 && u1_bits == 1) { > > > > > > + f = dpcls_subtable_lookup_mf_u0w4_u1w1; > > > > > > + } else if (u0_bits == 4 && u1_bits == 0) { > > > > > > + f = dpcls_subtable_lookup_mf_u0w4_u1w0; > > > > > > + } > > > > > > + > > > > > > + if (f) { > > > > > > + VLOG_INFO("Subtable using Generic Optimized for u0 %d, > > > u1 %d\n", > > > > > > + u0_bits, u1_bits); > > > > > > + } > > > > > > + return f; > > > > > > } > > > > > > diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index > > > > > 33b93cfdf..23fc5b7a6 > > > > > > 100644 > > > > > > --- a/lib/dpif-netdev.c > > > > > > +++ b/lib/dpif-netdev.c > > > > > > @@ -7623,8 +7623,13 @@ dpcls_create_subtable(struct dpcls *cls, > > > > > > const struct netdev_flow_key *mask) > > > > > > cls->blocks_scratch_size = blocks_required_per_pkt; > > > > > > } > > > > > > > > > > > > - /* Assign the generic lookup - this works with any miniflow > > > > > fingerprint */ > > > > > > - subtable->lookup_func = dpcls_subtable_lookup_generic; > > > > > > + /* Probe for an optimmized variant */ > > > > > > + subtable->lookup_func = dpcls_subtable_generic_probe(unit0, > > > > > > + unit1); > > > > > > + > > > > > > + /* If not set, assign generic lookup - this works with any > > > > > > + miniflow > > > > > */ > > > > > > + if (!subtable->lookup_func) { > > > > > > + subtable->lookup_func = dpcls_subtable_lookup_generic; > > > > > > + } > > > > > > > > > > > > cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask- > > > > >hash); > > > > > > /* Add the new subtable at the end of the pvector (with no > > > > > > hits > > > > > > yet) > > > > > */ diff > > > > > > --git a/lib/dpif-netdev.h b/lib/dpif-netdev.h index > > > > > > 9263256a9..123eabad7 > > > > > > 100644 > > > > > > --- a/lib/dpif-netdev.h > > > > > > +++ b/lib/dpif-netdev.h > > > > > > @@ -70,6 +70,14 @@ typedef uint32_t > > > > > > (*dpcls_subtable_lookup_func)(struct dpcls_subtable *subtable, > > > > > > const struct netdev_flow_key *keys[], > > > > > > struct dpcls_rule **rules); > > > > > > > > > > > > +/* Probe function to select a specialized version of the generic > > > > > > +lookup > > > > > > + * implementation. This provides performance benefit due to > > > > > > +compile-time > > > > > > + * optimizations such as loop-unrolling. These are enabled by the > > > > > > +compile-time > > > > > > + * constants in the specific function implementations. > > > > > > + */ > > > > > > +dpcls_subtable_lookup_func > > > > > > +dpcls_subtable_generic_probe(uint32_t u0_bit_count, uint32_t > > > > > > +u1_bit_count); > > > > > > + > > > > > > /* Prototype for generic lookup func, using same code path as > > > > > > before */ uint32_t dpcls_subtable_lookup_generic(struct > > > > > > dpcls_subtable *subtable, > > > > > > -- > > > > > > 2.17.1 > > > > > > > > > > > > _______________________________________________ > > > > > > dev mailing list > > > > > > d...@openvswitch.org > > > > > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > > > > > IMPORTANT NOTICE: The contents of this email and any attachments are > > > > > confidential and may also be privileged. If you are not the intended > > > > > recipient, please notify the sender immediately and do not disclose > > > > > the contents to any other person, use it for any purpose, or store > > > > > or copy the information in any medium. Thank you. > > IMPORTANT NOTICE: The contents of this email and any attachments are > > confidential and may also be privileged. If you are not the intended > > recipient, please notify the sender immediately and do not disclose the > > contents to any other person, use it for any purpose, or store or copy the > > information in any medium. Thank you. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev