On 01.04.2019 9:52, Yanqin Wei (Arm Technology China) wrote: > Hi Ilya, > > Really appreciate your time in doing benchmarking for this patch. And many > thanks for your valuable comments. > It is quite possible that the second unneeded cache line prefetching causes > performance drop in case of big number of flows (64K - 256K). > This impact the real world use case, I will try to improve it . For 1-8 flows > cases, it should be not very important because of the lack of actual > deployment.
I'm not actually sure which of these cases is more real. We're suggesting to disable EMC for cases where it's not effective. >From the other side, if VM encapsulates all the incoming traffic and sends it for further processing, there will be only few outcoming flows. > > Best Regards, > Wei Yanqin > > -----Original Message----- > From: Ilya Maximets <i.maxim...@samsung.com> > Sent: Monday, April 1, 2019 2:16 PM > To: Yanqin Wei (Arm Technology China) <yanqin....@arm.com>; > d...@openvswitch.org; ian.sto...@intel.com > Cc: nd <n...@arm.com>; Gavin Hu (Arm Technology China) <gavin...@arm.com> > Subject: Re: [ovs-dev][PATCH v3] dpif-netdev: dfc_process optimization by > prefetching EMC entry. > > On 29.03.2019 20:33, Ilya Maximets wrote: >> Hi. >> I made few tests on PVP with bonded PHY setup and found no significant >> difference in maximum performance with low and medium number of flows >> (8 - 8192). > > Correction: I see the a slight performance drop (~1%) for the cases with very > low number of flows (1-8), and a performance increase (~ 2-3%) for the medium > low number of flows (64 - 4096). > In this scenario packets from VM has already calculated hash on > recirculation, so the prefetching optimization doesn't work for the first > pass through the datapath, but works for the second. > Degradation on single or very low number of flows probably caused by > additional checks and instructions for prefetching the memory that is in > cache already. > >> In case of big number of flows (64K - 256K) I see performance drop in >> about 2-3%. I think that because of prefetching of a second cacheline, >> which is unneeded because current_entry->key.hash likely != key->hash >> and we don't need to compare miniflows while emc_lookup. Changing the >> code to prefetch the first cacheline only decreases the drop to ~1% >> (However, this increases the consumed processing cycles for medium >> numbers of flows described below) > > In general, this patch decreases the performance for cases where EMC is not > efficient and improves it for cases where EMC is efficient, except the cases > with very low numbers of flows, which could be the main concern. > >> >> OTOH, I see a slight decrease (~1%) of consumed cycles per packet for >> the thread that polls HW NICs and send packets to VM, which is good. >> This improvement observed for a medium small number of flows: 512 - 8192. >> For a low (1 - 256) and high (64K - 256K) numbers of flows value of >> consumed processing cycles per packet fro this thread was not affected by >> the patch. >> >> Tests made with average 512B packets, EMC enabled, SMC disabled, TX >> flushing interval 50us. Note that the bottleneck in this case is the >> VM --> bonded PHY part which is the case of 5tuple hash calculation. >> >> See review comments inline. >> >> Best regards, Ilya Maximets. >> >> On 22.03.2019 11:44, Yanqin Wei (Arm Technology China) wrote: >>> Hi , OVS Maintainers, >>> >>> Could you help to have a look at this patch? Thanks a lot. >>> >>> Best Regards, >>> Wei Yanqin >>> >>> -----Original Message----- >>> From: Yanqin Wei <yanqin....@arm.com> >>> Sent: Wednesday, March 13, 2019 1:28 PM >>> To: d...@openvswitch.org >>> Cc: nd <n...@arm.com>; Gavin Hu (Arm Technology China) >>> <gavin...@arm.com>; Yanqin Wei (Arm Technology China) >>> <yanqin....@arm.com> >>> Subject: [ovs-dev][PATCH v3] dpif-netdev: dfc_process optimization by >>> prefetching EMC entry. >>> >>> It is observed that the throughput of multi-flow is worse than single-flow >>> in the EMC NIC to NIC cases. It is because CPU cache-miss increasing in EMC >>> lookup. Each flow need load at least one EMC entry to CPU cache(several >>> cache lines) and compare it with packet miniflow. >>> This patch improve it by prefetching EMC entry in advance. Hash value >>> can be obtained from dpdk rss hash, so this step can be advanced >>> ahead of >>> miniflow_extract() and prefetch EMC entry there. The prefetching size is >>> defined as ROUND_UP(128,CACHE_LINE_SIZE), which can cover majority traffic >>> including TCP/UDP protocol and need 2 cache lines in most modern CPU. >>> Performance test was run in some arm platform. 1000/10000 flows NIC2NIC >>> test achieved around 10% throughput improvement in thunderX2(aarch64 >>> platform). >>> >>> Signed-off-by: Yanqin Wei <yanqin....@arm.com> >>> Reviewed-by: Gavin Hu <gavin...@arm.com> >>> --- >>> lib/dpif-netdev.c | 80 >>> ++++++++++++++++++++++++++++++++++++------------------- >>> 1 file changed, 52 insertions(+), 28 deletions(-) >>> >>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index >>> 4d6d0c3..982082c 100644 >>> --- a/lib/dpif-netdev.c >>> +++ b/lib/dpif-netdev.c >>> @@ -189,6 +189,10 @@ struct netdev_flow_key { >>> #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX / \ >>> DEFAULT_EM_FLOW_INSERT_INV_PROB) >>> >>> +/* DEFAULT_EMC_PREFETCH_SIZE can cover majority traffic including >>> +TCP/UDP >>> + * protocol. */ >>> +#define DEFAULT_EMC_PREFETCH_SIZE ROUND_UP(128,CACHE_LINE_SIZE) >> >> Space after the ',' needed. >> >>> + >>> struct emc_entry { >>> struct dp_netdev_flow *flow; >>> struct netdev_flow_key key; /* key.hash used for emc hash value. */ >>> @@ -6166,15 +6170,20 @@ dp_netdev_upcall(struct dp_netdev_pmd_thread >>> *pmd, struct dp_packet *packet_, } >>> >>> static inline uint32_t >>> -dpif_netdev_packet_get_rss_hash_orig_pkt(struct dp_packet *packet, >>> - const struct miniflow *mf) >>> +dpif_netdev_packet_get_packet_rss_hash(struct dp_packet *packet, >>> + bool md_is_valid) >> >> 1. 'packet' word appears twice in the function name. >> How about 'dpif_netdev_packet_get_rss_hash' and >> 'dpif_netdev_packet_get_5tuple_hash' for the second function? >> >> 2. Please align the second line to the level of next char after '('. >> >> 3. 'md_is_valid' is meaningless name inside the function. >> What about renaming to 'bool account_recirc_id'? >> >>> { >>> - uint32_t hash; >>> + uint32_t hash,recirc_depth; >> >> missing space. >> >>> >>> - if (OVS_LIKELY(dp_packet_rss_valid(packet))) { >>> - hash = dp_packet_get_rss_hash(packet); >>> - } else { >>> - hash = miniflow_hash_5tuple(mf, 0); >>> + hash = dp_packet_get_rss_hash(packet); >>> + >>> + if (md_is_valid) { >>> + /* The RSS hash must account for the recirculation depth to avoid >>> + * collisions in the exact match cache */ >>> + recirc_depth = *recirc_depth_get_unsafe(); >>> + if (OVS_UNLIKELY(recirc_depth)) { >> >> We're in recirculation if md is valid. So, 'OVS_UNLIKELY' is wrong here. >> I think that we don't need to check the value at all. >> 'OVS_UNLIKELY' should be moved to the upper 'if'. >> >>> + hash = hash_finish(hash, recirc_depth); >>> + } >>> dp_packet_set_rss_hash(packet, hash); >> >> This could be moved inside the 'if', because we need to update rss >> hash only if it was changed. >> >>> } >>> >>> @@ -6182,24 +6191,23 @@ >>> dpif_netdev_packet_get_rss_hash_orig_pkt(struct dp_packet *packet, } >>> >>> static inline uint32_t >>> -dpif_netdev_packet_get_rss_hash(struct dp_packet *packet, >>> - const struct miniflow *mf) >>> +dpif_netdev_packet_get_hash_5tuple(struct dp_packet *packet, >>> + const struct miniflow *mf, >>> + bool md_is_valid) >> >> Alignments. >> >>> { >>> - uint32_t hash, recirc_depth; >>> + uint32_t hash,recirc_depth; >> >> Space. >> >>> >>> - if (OVS_LIKELY(dp_packet_rss_valid(packet))) { >>> - hash = dp_packet_get_rss_hash(packet); >>> - } else { >>> - hash = miniflow_hash_5tuple(mf, 0); >>> - dp_packet_set_rss_hash(packet, hash); >>> - } >>> + hash = miniflow_hash_5tuple(mf, 0); >>> + dp_packet_set_rss_hash(packet, hash); >>> >>> - /* The RSS hash must account for the recirculation depth to avoid >>> - * collisions in the exact match cache */ >>> - recirc_depth = *recirc_depth_get_unsafe(); >>> - if (OVS_UNLIKELY(recirc_depth)) { >>> - hash = hash_finish(hash, recirc_depth); >>> - dp_packet_set_rss_hash(packet, hash); >>> + if (md_is_valid) { >>> + /* The RSS hash must account for the recirculation depth to avoid >>> + * collisions in the exact match cache */ >>> + recirc_depth = *recirc_depth_get_unsafe(); >>> + if (OVS_UNLIKELY(recirc_depth)) { >> >> Same as previous. >> >>> + hash = hash_finish(hash, recirc_depth); >>> + dp_packet_set_rss_hash(packet, hash); >> >> This function sets the rss to dp_packet twice. You could avoid that by >> moving the 'dp_packet_set_rss_hash' call to the end of this function. >> >>> + } >>> } >>> return hash; >>> } >>> @@ -6390,6 +6398,7 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd, >>> bool smc_enable_db; >>> size_t map_cnt = 0; >>> bool batch_enable = true; >>> + bool is_5tuple_hash_needed; >>> >>> atomic_read_relaxed(&pmd->dp->smc_enable_db, &smc_enable_db); >>> pmd_perf_update_counter(&pmd->perf_stats, >>> @@ -6436,16 +6445,31 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd, >>> } >>> } >>> >>> - miniflow_extract(packet, &key->mf); >>> - key->len = 0; /* Not computed yet. */ >>> /* If EMC and SMC disabled skip hash computation */ >>> if (smc_enable_db == true || cur_min != 0) { >>> - if (!md_is_valid) { >>> - key->hash = >>> dpif_netdev_packet_get_rss_hash_orig_pkt(packet, >>> - &key->mf); >>> + if (OVS_LIKELY(dp_packet_rss_valid(packet))) { >> >> I'd not put the 'LIKELY' macro here, because all the packets from >> virtual ports has no RSS and this is the common case. >> >>> + is_5tuple_hash_needed = false; >>> + key->hash = >>> + >>> + dpif_netdev_packet_get_packet_rss_hash(packet,md_is_valid); >> >> Space. >> >>> + if (cur_min) { >>> + ovs_prefetch_range( >>> + &cache->emc_cache.entries[key->hash & >>> EM_FLOW_HASH_MASK], >>> + DEFAULT_EMC_PREFETCH_SIZE); >>> + } >>> } else { >>> - key->hash = dpif_netdev_packet_get_rss_hash(packet, >>> &key->mf); >>> + is_5tuple_hash_needed = true; >>> } >>> + } else { >>> + is_5tuple_hash_needed = false; >>> + } >>> + >>> + miniflow_extract(packet, &key->mf); >>> + key->len = 0; /* Not computed yet. */ >>> + >>> + /* If 5tuple hash is needed */ >> >> This comment is not that useful. IMHO, the variable is self-documenting. >> >>> + if (is_5tuple_hash_needed) { >>> + key->hash = dpif_netdev_packet_get_hash_5tuple(packet, >>> + &key->mf, >>> + >>> + md_is_valid); >>> } >>> if (cur_min) { >>> flow = emc_lookup(&cache->emc_cache, key); >>> -- >>> 2.7.4 >>> >>> >>> >> >> _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev