On 01.04.2019 10:35, Yanqin Wei (Arm Technology China) wrote: > So the main concern is performance drop for 1~8 flow, because EMC is > effective here and would be enable. > On the other hand, EMC prefetching does not take effect when EMC is disable, > and in the case of large number of flows, the EMC is likely to be disabled. > So the performance drop here can be avoided by configuration. > Is my understanding correct?
I think so. Ian, what is your opinion here? > > Best Regards, > Wei Yanqin > > -----Original Message----- > From: Ilya Maximets <i.maxim...@samsung.com> > Sent: Monday, April 1, 2019 3:15 PM > To: Yanqin Wei (Arm Technology China) <yanqin....@arm.com>; > d...@openvswitch.org; ian.sto...@intel.com > Cc: nd <n...@arm.com>; Gavin Hu (Arm Technology China) <gavin...@arm.com> > Subject: Re: [ovs-dev][PATCH v3] dpif-netdev: dfc_process optimization by > prefetching EMC entry. > > On 01.04.2019 9:52, Yanqin Wei (Arm Technology China) wrote: >> Hi Ilya, >> >> Really appreciate your time in doing benchmarking for this patch. And many >> thanks for your valuable comments. >> It is quite possible that the second unneeded cache line prefetching causes >> performance drop in case of big number of flows (64K - 256K). >> This impact the real world use case, I will try to improve it . For 1-8 >> flows cases, it should be not very important because of the lack of actual >> deployment. > > I'm not actually sure which of these cases is more real. > We're suggesting to disable EMC for cases where it's not effective. > From the other side, if VM encapsulates all the incoming traffic and sends it > for further processing, there will be only few outcoming flows. > >> >> Best Regards, >> Wei Yanqin >> >> -----Original Message----- >> From: Ilya Maximets <i.maxim...@samsung.com> >> Sent: Monday, April 1, 2019 2:16 PM >> To: Yanqin Wei (Arm Technology China) <yanqin....@arm.com>; >> d...@openvswitch.org; ian.sto...@intel.com >> Cc: nd <n...@arm.com>; Gavin Hu (Arm Technology China) >> <gavin...@arm.com> >> Subject: Re: [ovs-dev][PATCH v3] dpif-netdev: dfc_process optimization by >> prefetching EMC entry. >> >> On 29.03.2019 20:33, Ilya Maximets wrote: >>> Hi. >>> I made few tests on PVP with bonded PHY setup and found no >>> significant difference in maximum performance with low and medium >>> number of flows >>> (8 - 8192). >> >> Correction: I see the a slight performance drop (~1%) for the cases with >> very low number of flows (1-8), and a performance increase (~ 2-3%) for the >> medium low number of flows (64 - 4096). >> In this scenario packets from VM has already calculated hash on >> recirculation, so the prefetching optimization doesn't work for the first >> pass through the datapath, but works for the second. >> Degradation on single or very low number of flows probably caused by >> additional checks and instructions for prefetching the memory that is in >> cache already. >> >>> In case of big number of flows (64K - 256K) I see performance drop in >>> about 2-3%. I think that because of prefetching of a second >>> cacheline, which is unneeded because current_entry->key.hash likely >>> != key->hash and we don't need to compare miniflows while emc_lookup. >>> Changing the code to prefetch the first cacheline only decreases the >>> drop to ~1% (However, this increases the consumed processing cycles >>> for medium numbers of flows described below) >> >> In general, this patch decreases the performance for cases where EMC is not >> efficient and improves it for cases where EMC is efficient, except the cases >> with very low numbers of flows, which could be the main concern. >> >>> >>> OTOH, I see a slight decrease (~1%) of consumed cycles per packet for >>> the thread that polls HW NICs and send packets to VM, which is good. >>> This improvement observed for a medium small number of flows: 512 - 8192. >>> For a low (1 - 256) and high (64K - 256K) numbers of flows value of >>> consumed processing cycles per packet fro this thread was not affected by >>> the patch. >>> >>> Tests made with average 512B packets, EMC enabled, SMC disabled, TX >>> flushing interval 50us. Note that the bottleneck in this case is the >>> VM --> bonded PHY part which is the case of 5tuple hash calculation. >>> >>> See review comments inline. >>> >>> Best regards, Ilya Maximets. >>> >>> On 22.03.2019 11:44, Yanqin Wei (Arm Technology China) wrote: >>>> Hi , OVS Maintainers, >>>> >>>> Could you help to have a look at this patch? Thanks a lot. >>>> >>>> Best Regards, >>>> Wei Yanqin >>>> >>>> -----Original Message----- >>>> From: Yanqin Wei <yanqin....@arm.com> >>>> Sent: Wednesday, March 13, 2019 1:28 PM >>>> To: d...@openvswitch.org >>>> Cc: nd <n...@arm.com>; Gavin Hu (Arm Technology China) >>>> <gavin...@arm.com>; Yanqin Wei (Arm Technology China) >>>> <yanqin....@arm.com> >>>> Subject: [ovs-dev][PATCH v3] dpif-netdev: dfc_process optimization by >>>> prefetching EMC entry. >>>> >>>> It is observed that the throughput of multi-flow is worse than single-flow >>>> in the EMC NIC to NIC cases. It is because CPU cache-miss increasing in >>>> EMC lookup. Each flow need load at least one EMC entry to CPU >>>> cache(several cache lines) and compare it with packet miniflow. >>>> This patch improve it by prefetching EMC entry in advance. Hash >>>> value can be obtained from dpdk rss hash, so this step can be >>>> advanced ahead of >>>> miniflow_extract() and prefetch EMC entry there. The prefetching size is >>>> defined as ROUND_UP(128,CACHE_LINE_SIZE), which can cover majority traffic >>>> including TCP/UDP protocol and need 2 cache lines in most modern CPU. >>>> Performance test was run in some arm platform. 1000/10000 flows NIC2NIC >>>> test achieved around 10% throughput improvement in thunderX2(aarch64 >>>> platform). >>>> >>>> Signed-off-by: Yanqin Wei <yanqin....@arm.com> >>>> Reviewed-by: Gavin Hu <gavin...@arm.com> >>>> --- >>>> lib/dpif-netdev.c | 80 >>>> ++++++++++++++++++++++++++++++++++++------------------- >>>> 1 file changed, 52 insertions(+), 28 deletions(-) >>>> >>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index >>>> 4d6d0c3..982082c 100644 >>>> --- a/lib/dpif-netdev.c >>>> +++ b/lib/dpif-netdev.c >>>> @@ -189,6 +189,10 @@ struct netdev_flow_key { >>>> #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX / \ >>>> >>>> DEFAULT_EM_FLOW_INSERT_INV_PROB) >>>> >>>> +/* DEFAULT_EMC_PREFETCH_SIZE can cover majority traffic including >>>> +TCP/UDP >>>> + * protocol. */ >>>> +#define DEFAULT_EMC_PREFETCH_SIZE ROUND_UP(128,CACHE_LINE_SIZE) >>> >>> Space after the ',' needed. >>> >>>> + >>>> struct emc_entry { >>>> struct dp_netdev_flow *flow; >>>> struct netdev_flow_key key; /* key.hash used for emc hash value. */ >>>> @@ -6166,15 +6170,20 @@ dp_netdev_upcall(struct dp_netdev_pmd_thread >>>> *pmd, struct dp_packet *packet_, } >>>> >>>> static inline uint32_t >>>> -dpif_netdev_packet_get_rss_hash_orig_pkt(struct dp_packet *packet, >>>> - const struct miniflow *mf) >>>> +dpif_netdev_packet_get_packet_rss_hash(struct dp_packet *packet, >>>> + bool md_is_valid) >>> >>> 1. 'packet' word appears twice in the function name. >>> How about 'dpif_netdev_packet_get_rss_hash' and >>> 'dpif_netdev_packet_get_5tuple_hash' for the second function? >>> >>> 2. Please align the second line to the level of next char after '('. >>> >>> 3. 'md_is_valid' is meaningless name inside the function. >>> What about renaming to 'bool account_recirc_id'? >>> >>>> { >>>> - uint32_t hash; >>>> + uint32_t hash,recirc_depth; >>> >>> missing space. >>> >>>> >>>> - if (OVS_LIKELY(dp_packet_rss_valid(packet))) { >>>> - hash = dp_packet_get_rss_hash(packet); >>>> - } else { >>>> - hash = miniflow_hash_5tuple(mf, 0); >>>> + hash = dp_packet_get_rss_hash(packet); >>>> + >>>> + if (md_is_valid) { >>>> + /* The RSS hash must account for the recirculation depth to avoid >>>> + * collisions in the exact match cache */ >>>> + recirc_depth = *recirc_depth_get_unsafe(); >>>> + if (OVS_UNLIKELY(recirc_depth)) { >>> >>> We're in recirculation if md is valid. So, 'OVS_UNLIKELY' is wrong here. >>> I think that we don't need to check the value at all. >>> 'OVS_UNLIKELY' should be moved to the upper 'if'. >>> >>>> + hash = hash_finish(hash, recirc_depth); >>>> + } >>>> dp_packet_set_rss_hash(packet, hash); >>> >>> This could be moved inside the 'if', because we need to update rss >>> hash only if it was changed. >>> >>>> } >>>> >>>> @@ -6182,24 +6191,23 @@ >>>> dpif_netdev_packet_get_rss_hash_orig_pkt(struct dp_packet *packet, >>>> } >>>> >>>> static inline uint32_t >>>> -dpif_netdev_packet_get_rss_hash(struct dp_packet *packet, >>>> - const struct miniflow *mf) >>>> +dpif_netdev_packet_get_hash_5tuple(struct dp_packet *packet, >>>> + const struct miniflow *mf, >>>> + bool md_is_valid) >>> >>> Alignments. >>> >>>> { >>>> - uint32_t hash, recirc_depth; >>>> + uint32_t hash,recirc_depth; >>> >>> Space. >>> >>>> >>>> - if (OVS_LIKELY(dp_packet_rss_valid(packet))) { >>>> - hash = dp_packet_get_rss_hash(packet); >>>> - } else { >>>> - hash = miniflow_hash_5tuple(mf, 0); >>>> - dp_packet_set_rss_hash(packet, hash); >>>> - } >>>> + hash = miniflow_hash_5tuple(mf, 0); >>>> + dp_packet_set_rss_hash(packet, hash); >>>> >>>> - /* The RSS hash must account for the recirculation depth to avoid >>>> - * collisions in the exact match cache */ >>>> - recirc_depth = *recirc_depth_get_unsafe(); >>>> - if (OVS_UNLIKELY(recirc_depth)) { >>>> - hash = hash_finish(hash, recirc_depth); >>>> - dp_packet_set_rss_hash(packet, hash); >>>> + if (md_is_valid) { >>>> + /* The RSS hash must account for the recirculation depth to avoid >>>> + * collisions in the exact match cache */ >>>> + recirc_depth = *recirc_depth_get_unsafe(); >>>> + if (OVS_UNLIKELY(recirc_depth)) { >>> >>> Same as previous. >>> >>>> + hash = hash_finish(hash, recirc_depth); >>>> + dp_packet_set_rss_hash(packet, hash); >>> >>> This function sets the rss to dp_packet twice. You could avoid that >>> by moving the 'dp_packet_set_rss_hash' call to the end of this function. >>> >>>> + } >>>> } >>>> return hash; >>>> } >>>> @@ -6390,6 +6398,7 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd, >>>> bool smc_enable_db; >>>> size_t map_cnt = 0; >>>> bool batch_enable = true; >>>> + bool is_5tuple_hash_needed; >>>> >>>> atomic_read_relaxed(&pmd->dp->smc_enable_db, &smc_enable_db); >>>> pmd_perf_update_counter(&pmd->perf_stats, >>>> @@ -6436,16 +6445,31 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd, >>>> } >>>> } >>>> >>>> - miniflow_extract(packet, &key->mf); >>>> - key->len = 0; /* Not computed yet. */ >>>> /* If EMC and SMC disabled skip hash computation */ >>>> if (smc_enable_db == true || cur_min != 0) { >>>> - if (!md_is_valid) { >>>> - key->hash = >>>> dpif_netdev_packet_get_rss_hash_orig_pkt(packet, >>>> - &key->mf); >>>> + if (OVS_LIKELY(dp_packet_rss_valid(packet))) { >>> >>> I'd not put the 'LIKELY' macro here, because all the packets from >>> virtual ports has no RSS and this is the common case. >>> >>>> + is_5tuple_hash_needed = false; >>>> + key->hash = >>>> + >>>> + dpif_netdev_packet_get_packet_rss_hash(packet,md_is_valid); >>> >>> Space. >>> >>>> + if (cur_min) { >>>> + ovs_prefetch_range( >>>> + &cache->emc_cache.entries[key->hash & >>>> EM_FLOW_HASH_MASK], >>>> + DEFAULT_EMC_PREFETCH_SIZE); >>>> + } >>>> } else { >>>> - key->hash = dpif_netdev_packet_get_rss_hash(packet, >>>> &key->mf); >>>> + is_5tuple_hash_needed = true; >>>> } >>>> + } else { >>>> + is_5tuple_hash_needed = false; >>>> + } >>>> + >>>> + miniflow_extract(packet, &key->mf); >>>> + key->len = 0; /* Not computed yet. */ >>>> + >>>> + /* If 5tuple hash is needed */ >>> >>> This comment is not that useful. IMHO, the variable is self-documenting. >>> >>>> + if (is_5tuple_hash_needed) { >>>> + key->hash = dpif_netdev_packet_get_hash_5tuple(packet, >>>> + &key->mf, >>>> + >>>> + md_is_valid); >>>> } >>>> if (cur_min) { >>>> flow = emc_lookup(&cache->emc_cache, key); >>>> -- >>>> 2.7.4 >>>> >>>> >>>> >>> >>> _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev