On 12.07.2019 17:19, Ian Stokes wrote: > On 7/9/2019 1:34 PM, Harry van Haaren wrote: >> This commit refactors the generic implementation. The >> goal of this refactor is to simply the code to enable > 'simply' -> 'simplify'? > >> "specialization" of the functions at compile time. >> >> Given compile-time optimizations, the compiler is able >> to unroll loops, and create optimized code sequences due >> to compile time knowledge of loop-trip counts. >> >> In order to enable these compiler optimizations, we must >> refactor the code to pass the loop-trip counts to functions >> as compile time constants. >> >> This patch allows the number of miniflow-bits set per "unit" >> in the miniflow to be passed around as a function argument. >> >> Note that this patch does NOT yet take advantage of doing so, >> this is only a refactor to enable it in the next patches. >> >> Signed-off-by: Harry van Haaren <harry.van.haa...@intel.com> >> Tested-by: Malvika Gupta <malvika.gu...@arm.com> >> > > Thanks for the v10 Harry, some comments below. > >> --- >> >> v10: >> - Rebase updates from previous patches >> - Fix whitespace indentation of func params >> - Removed restrict keyword, Windows CI failing when it is used (Ian) >> - Fix integer 0 used to set NULL pointer (Ilya) >> - Postpone free() call on cls->blocks_scratch (Ilya) >> - Fix indentation of a function >> >> v9: >> - Use count_1bits in favour of __builtin_popcount (Ilya) >> - Use ALWAYS_INLINE instead of __attribute__ synatx (Ilya) >> >> v8: >> - Rework block_cache and mf_masks to avoid variable-lenght array >> due to compiler issues. Provisioning for worst case is not a >> good solution due to magnitue of over-provisioning required. >> - Rework netdev_flatten function removing unused parameter >> --- >> lib/dpif-netdev-lookup-generic.c | 239 ++++++++++++++++++++++++------- >> lib/dpif-netdev.c | 77 +++++++++- >> lib/dpif-netdev.h | 18 +++ >> 3 files changed, 281 insertions(+), 53 deletions(-) >> >> diff --git a/lib/dpif-netdev-lookup-generic.c >> b/lib/dpif-netdev-lookup-generic.c >> index d49d4b570..432d8782e 100644 >> --- a/lib/dpif-netdev-lookup-generic.c >> +++ b/lib/dpif-netdev-lookup-generic.c >> @@ -29,67 +29,204 @@ >> #include "packets.h" >> #include "pvector.h" >> -/* Returns a hash value for the bits of 'key' where there are 1-bits in >> - * 'mask'. */ >> -static inline uint32_t >> -netdev_flow_key_hash_in_mask(const struct netdev_flow_key *key, >> - const struct netdev_flow_key *mask) >> +VLOG_DEFINE_THIS_MODULE(dpif_lookup_generic); >> + >> +/* netdev_flow_key_flatten_unit: > > No need to mention function name in comments. Applies to other function > comments in the patch also. > > For function comments in OVS you can follow: > > http://docs.openvswitch.org/en/latest/internals/contributing/coding-style/#functions > >> + * Given a packet, table and mf_masks, this function iterates over each bit >> + * set in the subtable, and calculates the appropriate metadata to store in >> the >> + * blocks_scratch[]. >> + * >> + * The results of the blocks_scratch[] can be used for hashing, and later >> for >> + * verification of if a rule matches the given packet. >> + */ >> +static inline void >> +netdev_flow_key_flatten_unit(const uint64_t *pkt_blocks, >> + const uint64_t *tbl_blocks, >> + const uint64_t *mf_masks, >> + uint64_t *blocks_scratch, >> + const uint64_t pkt_mf_bits, >> + const uint32_t count) >> { >> - const uint64_t *p = miniflow_get_values(&mask->mf); >> - uint32_t hash = 0; >> - uint64_t value; >> + uint32_t i; > New line just to separate variable declaration form function body. > >> + for (i = 0; i < count; i++) { >> + uint64_t mf_mask = mf_masks[i]; >> + /* Calculate the block index for the packet metadata */ >> + uint64_t idx_bits = mf_mask & pkt_mf_bits; >> + const uint32_t pkt_idx = count_1bits(idx_bits); >> - NETDEV_FLOW_KEY_FOR_EACH_IN_FLOWMAP (value, key, mask->mf.map) { >> - hash = hash_add64(hash, value & *p); >> - p++; >> + /* check if the packet has the subtable miniflow bit set. If yes, >> the > > For above and for other comments in the patch, capitalization at beginning, > period to finish. > >> + * block at the above pkt_idx will be stored, otherwise it is masked >> + * out to be zero. >> + */ >> + uint64_t pkt_has_mf_bit = (mf_mask + 1) & pkt_mf_bits; >> + uint64_t no_bit = ((!pkt_has_mf_bit) > 0) - 1; >> + >> + /* mask packet block by table block, and mask to zero if packet >> + * doesn't actually contain this block of metadata >> + */ >> + blocks_scratch[i] = pkt_blocks[pkt_idx] & tbl_blocks[i] & no_bit; >> } >> +} >> + >> +/* netdev_flow_key_flatten: >> + * This function takes a packet, and subtable and writes an array of >> uint64_t >> + * blocks. The blocks contain the metadata that the subtable matches on, in >> + * the same order as the subtable, allowing linear iteration over the >> blocks. >> + * >> + * To calculate the blocks contents, the netdev_flow_key_flatten_unit >> function >> + * is called twice, once for each "unit" of the miniflow. This call can be >> + * inlined by the compiler for performance. >> + * >> + * Note that the u0_count and u1_count variables can be compile-time >> constants, >> + * allowing the loop in the inlined flatten_unit() function to be >> compile-time >> + * unrolled, or possibly removed totally by unrolling by the loop >> iterations. >> + * The compile time optimizations enabled by this design improves >> performance. >> + */ >> +static inline void >> +netdev_flow_key_flatten(const struct netdev_flow_key *key, >> + const struct netdev_flow_key *mask, >> + const uint64_t *mf_masks, >> + uint64_t *blocks_scratch, >> + const uint32_t u0_count, >> + const uint32_t u1_count) >> +{ >> + /* load mask from subtable, mask with packet mf, popcount to get idx */ >> + const uint64_t *pkt_blocks = miniflow_get_values(&key->mf); >> + const uint64_t *tbl_blocks = miniflow_get_values(&mask->mf); >> + >> + /* packet miniflow bits to be masked by pre-calculated mf_masks */ >> + const uint64_t pkt_bits_u0 = key->mf.map.bits[0]; >> + const uint32_t pkt_bits_u0_pop = count_1bits(pkt_bits_u0); >> + const uint64_t pkt_bits_u1 = key->mf.map.bits[1]; >> - return hash_finish(hash, (p - miniflow_get_values(&mask->mf)) * 8); >> + /* Unit 0 flattening */ >> + netdev_flow_key_flatten_unit(&pkt_blocks[0], >> + &tbl_blocks[0], >> + &mf_masks[0], >> + &blocks_scratch[0], >> + pkt_bits_u0, >> + u0_count); >> + >> + /* Unit 1 flattening: >> + * Move the pointers forward in the arrays based on u0 offsets, NOTE: >> + * 1) pkt blocks indexed by actual popcount of u0, which is NOT always >> + * the same as the amount of bits set in the subtable. >> + * 2) mf_masks, tbl_block and blocks_scratch are all "flat" arrays, so >> + * the index is always u0_count. >> + */ >> + netdev_flow_key_flatten_unit(&pkt_blocks[pkt_bits_u0_pop], >> + &tbl_blocks[u0_count], >> + &mf_masks[u0_count], >> + &blocks_scratch[u0_count], >> + pkt_bits_u1, >> + u1_count); >> +} >> + >> +static inline uint64_t >> +netdev_rule_matches_key(const struct dpcls_rule *rule, >> + const uint32_t mf_bits_total, >> + const uint64_t *blocks_scratch) >> +{ >> + const uint64_t *keyp = miniflow_get_values(&rule->flow.mf); >> + const uint64_t *maskp = miniflow_get_values(&rule->mask->mf); >> + >> + uint64_t not_match = 0; >> + for (int i = 0; i < mf_bits_total; i++) { >> + not_match |= (blocks_scratch[i] & maskp[i]) != keyp[i]; >> + } >> + >> + /* invert result to show match as 1 */ >> + return !not_match; >> } >> +/* const prop version of the function: note that mf bits total and u0 are >> + * explicitly passed in here, while they're also available at runtime from >> the >> + * subtable pointer. By making them compile time, we enable the compiler to >> + * unroll loops and flatten out code-sequences based on the knowledge of the >> + * mf_bits_* compile time values. This results in improved performance. >> + */ >> +static inline uint32_t ALWAYS_INLINE >> +lookup_generic_impl(struct dpcls_subtable *subtable, >> + uint64_t *blocks_scratch, >> + uint32_t keys_map, >> + const struct netdev_flow_key *keys[], >> + struct dpcls_rule **rules, >> + const uint32_t bit_count_u0, >> + const uint32_t bit_count_u1) >> +{ >> + const uint32_t n_pkts = count_1bits(keys_map); >> + ovs_assert(NETDEV_MAX_BURST >= n_pkts); >> + uint32_t hashes[NETDEV_MAX_BURST]; >> + >> + const uint32_t bit_count_total = bit_count_u0 + bit_count_u1; >> + uint64_t *mf_masks = subtable->mf_masks; >> + int i; >> + >> + /* Flatten the packet metadata into the blocks_scratch[] using subtable >> */ >> + ULLONG_FOR_EACH_1(i, keys_map) { >> + netdev_flow_key_flatten(keys[i], >> + &subtable->mask, >> + mf_masks, >> + &blocks_scratch[i * bit_count_total], >> + bit_count_u0, >> + bit_count_u1); >> + } >> + >> + /* Hash the now linearized blocks of packet metadata */ >> + ULLONG_FOR_EACH_1(i, keys_map) { >> + uint32_t hash = 0; >> + uint32_t i_off = i * bit_count_total; >> + for (int h = 0; h < bit_count_total; h++) { >> + hash = hash_add64(hash, blocks_scratch[i_off + h]); >> + } >> + hashes[i] = hash_finish(hash, bit_count_total * 8); > > Can we replace magic 8 above.
Another question. Can we use: ULLONG_FOR_EACH_1 (i, keys_map) { hashes[i] = hash_words64_inline(&blocks_scratch[i * bit_count_total], bit_count_total, 0); } ? There should be same effect. > >> + } >> + >> + /* Lookup: this returns a bitmask of packets where the hash table had >> + * an entry for the given hash key. Presence of a hash key does not >> + * guarantee matching the key, as there can be hash collisions. >> + */ >> + uint32_t found_map; >> + const struct cmap_node *nodes[NETDEV_MAX_BURST]; >> + found_map = cmap_find_batch(&subtable->rules, keys_map, hashes, nodes); >> + >> + /* Verify that packet actually matched rule. If not found, a hash >> + * collision has taken place, so continue searching with the next node. >> + */ >> + ULLONG_FOR_EACH_1(i, found_map) { >> + struct dpcls_rule *rule; >> + >> + CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) { >> + const uint32_t cidx = i * bit_count_total; >> + uint32_t match = netdev_rule_matches_key(rule, bit_count_total, >> + &blocks_scratch[cidx]); >> + >> + if (OVS_LIKELY(match)) { >> + rules[i] = rule; >> + subtable->hit_cnt++; >> + goto next; >> + } >> + } >> + >> + /* None of the found rules was a match. Clear the i-th bit to >> + * search for this key in the next subtable. */ >> + ULLONG_SET0(found_map, i); >> + next: >> + ; /* Keep Sparse happy. */ >> + } >> + >> + return found_map; >> +} >> + >> +/* Generic - use runtime provided mf bits */ >> uint32_t >> dpcls_subtable_lookup_generic(struct dpcls_subtable *subtable, >> + uint64_t *blocks_scratch, >> uint32_t keys_map, >> const struct netdev_flow_key *keys[], >> struct dpcls_rule **rules) >> { >> - int i; >> - /* Compute hashes for the remaining keys. Each search-key is >> - * masked with the subtable's mask to avoid hashing the wildcarded >> - * bits. */ >> - uint32_t hashes[NETDEV_MAX_BURST]; >> - ULLONG_FOR_EACH_1(i, keys_map) { >> - hashes[i] = netdev_flow_key_hash_in_mask(keys[i], >> - &subtable->mask); >> - } >> - >> - /* Lookup. */ >> - const struct cmap_node *nodes[NETDEV_MAX_BURST]; >> - uint32_t found_map = >> - cmap_find_batch(&subtable->rules, keys_map, hashes, nodes); >> - /* Check results. When the i-th bit of found_map is set, it means >> - * that a set of nodes with a matching hash value was found for the >> - * i-th search-key. Due to possible hash collisions we need to >> check >> - * which of the found rules, if any, really matches our masked >> - * search-key. */ >> - ULLONG_FOR_EACH_1(i, found_map) { >> - struct dpcls_rule *rule; >> - >> - CMAP_NODE_FOR_EACH (rule, cmap_node, nodes[i]) { >> - if (OVS_LIKELY(dpcls_rule_matches_key(rule, keys[i]))) { >> - rules[i] = rule; >> - /* Even at 20 Mpps the 32-bit hit_cnt cannot wrap >> - * within one second optimization interval. */ >> - subtable->hit_cnt++; >> - goto next; >> - } >> - } >> - /* None of the found rules was a match. Reset the i-th bit to >> - * keep searching this key in the next subtable. */ >> - ULLONG_SET0(found_map, i); /* Did not match. */ >> - next: >> - ; /* Keep Sparse happy. */ >> - } >> - >> - return found_map; >> + return lookup_generic_impl(subtable, blocks_scratch, keys_map, keys, >> rules, >> + subtable->mf_bits_set_unit0, >> + subtable->mf_bits_set_unit1); >> } >> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c >> index 190cc8918..03ab5267a 100644 >> --- a/lib/dpif-netdev.c >> +++ b/lib/dpif-netdev.c >> @@ -233,6 +233,15 @@ struct dpcls { >> odp_port_t in_port; >> struct cmap subtables_map; >> struct pvector subtables; >> + >> + /* Region of memory for this DPCLS instance to use as scratch. >> + * Size is garaunteed to be large enough to hold all blocks required for > 'garaunteed' -> 'guaranteed' > >> + * the subtable's to match on. This allows each dpcls lookup to flatten >> + * the packet miniflows into this blocks_scratch area, without using >> + * variable lenght arrays. This region is allocated on subtable create, >> and >> + * will be resized as required if a larger subtable is added. */ >> + uint64_t *blocks_scratch; >> + uint32_t blocks_scratch_size; >> }; >> /* Data structure to keep packet order till fastpath processing. */ >> @@ -7567,6 +7576,7 @@ static void >> dpcls_subtable_destroy_cb(struct dpcls_subtable *subtable) >> { >> cmap_destroy(&subtable->rules); >> + ovsrcu_postpone(free, subtable->mf_masks); >> ovsrcu_postpone(free, subtable); >> } >> @@ -7577,6 +7587,8 @@ dpcls_init(struct dpcls *cls) >> { >> cmap_init(&cls->subtables_map); >> pvector_init(&cls->subtables); >> + cls->blocks_scratch = NULL; >> + cls->blocks_scratch_size = 0; >> } >> static void >> @@ -7604,6 +7616,7 @@ dpcls_destroy(struct dpcls *cls) >> } >> cmap_destroy(&cls->subtables_map); >> pvector_destroy(&cls->subtables); >> + ovsrcu_postpone(free, cls->blocks_scratch); >> } >> } >> @@ -7619,7 +7632,28 @@ dpcls_create_subtable(struct dpcls *cls, const >> struct netdev_flow_key *mask) >> subtable->hit_cnt = 0; >> netdev_flow_key_clone(&subtable->mask, mask); >> - /* Decide which hash/lookup/verify function to use. */ >> + /* The count of bits in the mask defines the space required for masks. >> + * Then call gen_masks() to create the appropriate masks, avoiding the >> cost >> + * of doing runtime calculations. */ >> + uint32_t unit0 = count_1bits(mask->mf.map.bits[0]); >> + uint32_t unit1 = count_1bits(mask->mf.map.bits[1]); >> + subtable->mf_bits_set_unit0 = unit0; >> + subtable->mf_bits_set_unit1 = unit1; >> + >> + subtable->mf_masks = xmalloc(sizeof(uint64_t) * (unit0 + unit1)); >> + netdev_flow_key_gen_masks(mask, subtable->mf_masks, unit0, unit1); >> + >> + /* Allocate blocks scratch space only if subtable requires more size >> than >> + * is currently allocated. */ >> + const uint32_t blocks_required_per_pkt = unit0 + unit1; >> + if (cls->blocks_scratch_size < blocks_required_per_pkt) { >> + free(cls->blocks_scratch); >> + cls->blocks_scratch = xmalloc(sizeof(uint64_t) * NETDEV_MAX_BURST * >> + blocks_required_per_pkt); > > For sizeof above I don't think you need the () as it's part of an expression. > See OVS code standard below. This may apply to the xmalloc for mf_masks above > as well. > > The sizeof operator is unique among C operators in that it accepts two very > different kinds of operands: an expression or a type. In general, prefer to > specify an expression, e.g. int *x = xmalloc(sizeof *x);. When the operand of > sizeof is an expression, there is no need to parenthesize that operand, and > please don't. > >> + cls->blocks_scratch_size = blocks_required_per_pkt; >> + } >> + >> + /* Assign the generic lookup - this works with any miniflow >> fingerprint. */ >> subtable->lookup_func = dpcls_subtable_lookup_generic; >> cmap_insert(&cls->subtables_map, &subtable->cmap_node, mask->hash); >> @@ -7764,6 +7798,43 @@ dpcls_remove(struct dpcls *cls, struct dpcls_rule >> *rule) >> } >> } >> > > <snip> > >> @@ -95,8 +97,18 @@ struct dpcls_subtable { >> * subtable matches on. The miniflow "bits" are used to select the >> actual >> * dpcls lookup implementation at subtable creation time. >> */ > Although you can't see it here the comment above seems a little misleading, > maybe a copy and paste typo? > > "The lookup function to use for this subtable. From a technical point of > view, this is just an implementation of mutliple dispatch. The attribute > used to identify the ideal function is the miniflow fingerprint that the > subtable matches on. The miniflow "bits" are used to select the actual > dpcls lookup implementation at subtable creation time." > > The mf_bits_set_units are not the lookup function themselves. You could > probably re-word it as something like > > "Miniflow fingerprint that the subtable matches on. The miniflow "bits" are > used to select the actual dpcls lookup implementation at subtable creation > time" > >> + uint8_t mf_bits_set_unit0; >> + uint8_t mf_bits_set_unit1; >> + >> + /* the lookup function to use for this subtable. If there is a known >> + * property of the subtable (eg: only 3 bits of miniflow metadata is >> + * used for the lookup) then this can point at an optimized version of >> + * the lookup function for this particular subtable. */ >> dpcls_subtable_lookup_func lookup_func; >> + /* caches the masks to match a packet to, reducing runtime >> calculations */ >> + uint64_t *mf_masks; >> + >> struct netdev_flow_key mask; /* Wildcards for fields (const). */ >> /* 'mask' must be the last field, additional space is allocated here. >> */ >> }; >> @@ -105,6 +117,12 @@ struct dpcls_subtable { >> #define NETDEV_FLOW_KEY_FOR_EACH_IN_FLOWMAP(VALUE, KEY, FLOWMAP) \ >> MINIFLOW_FOR_EACH_IN_FLOWMAP (VALUE, &(KEY)->mf, FLOWMAP) >> > > 1 liner comment explaining the prototype comment would be nice here. > > Thanks > Ian >> +void >> +netdev_flow_key_gen_masks(const struct netdev_flow_key *tbl, >> + uint64_t *mf_masks, >> + const uint32_t mf_bits_u0, >> + const uint32_t mf_bits_u1); >> + >> bool dpcls_rule_matches_key(const struct dpcls_rule *rule, >> const struct netdev_flow_key *target); >> > > > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev