Re: [ovs-dev] [PATCH v3] dpif-netdev: dfc_process optimization by prefetching EMC entry.

Yanqin Wei (Arm Technology China) Sun, 31 Mar 2019 23:53:11 -0700

Hi Ilya,

Really appreciate your time in doing benchmarking for this patch. And many 
thanks for your valuable comments.
It is quite possible that the second unneeded cache line prefetching causes 
performance drop in case of big number of flows (64K - 256K).
This impact the real world use case, I will try to improve it . For 1-8 flows 
cases, it should be not very important because of the lack of actual deployment.


Best Regards,
Wei Yanqin

-----Original Message-----
From: Ilya Maximets <i.maxim...@samsung.com> 
Sent: Monday, April 1, 2019 2:16 PM
To: Yanqin Wei (Arm Technology China) <yanqin....@arm.com>; 
d...@openvswitch.org; ian.sto...@intel.com
Cc: nd <n...@arm.com>; Gavin Hu (Arm Technology China) <gavin...@arm.com>
Subject: Re: [ovs-dev][PATCH v3] dpif-netdev: dfc_process optimization by 
prefetching EMC entry.

On 29.03.2019 20:33, Ilya Maximets wrote:
> Hi.
> I made few tests on PVP with bonded PHY setup and found no significant 
> difference in maximum performance with low and medium number of flows
> (8 - 8192).

Correction: I see the a slight performance drop (~1%) for the cases with very 
low number of flows (1-8), and a performance increase (~ 2-3%) for the medium 
low number of flows (64 - 4096).
In this scenario packets from VM has already calculated hash on recirculation, 
so the prefetching optimization doesn't work for the first pass through the 
datapath, but works for the second.
Degradation on single or very low number of flows probably caused by additional 
checks and instructions for prefetching the memory that is in cache already.

> In case of big number of flows (64K - 256K) I see performance drop in 
> about 2-3%. I think that because of prefetching of a second cacheline, 
> which is unneeded because current_entry->key.hash likely != key->hash 
> and we don't need to compare miniflows while emc_lookup. Changing the 
> code to prefetch the first cacheline only decreases the drop to ~1% 
> (However, this increases the consumed processing cycles for medium 
> numbers of flows described below)

In general, this patch decreases the performance for cases where EMC is not 
efficient and improves it for cases where EMC is efficient, except the cases 
with very low numbers of flows, which could be the main concern.

> 
> OTOH, I see a slight decrease (~1%) of consumed cycles per packet for 
> the thread that polls HW NICs and send packets to VM, which is good.
> This improvement observed for a medium small number of flows: 512 - 8192.
> For a low (1 - 256) and high (64K - 256K) numbers of flows value of 
> consumed processing cycles per packet fro this thread was not affected by the 
> patch.
> 
> Tests made with average 512B packets, EMC enabled, SMC disabled, TX 
> flushing interval 50us. Note that the bottleneck in this case is the 
> VM --> bonded PHY part which is the case of 5tuple hash calculation.
> 
> See review comments inline.
> 
> Best regards, Ilya Maximets.
> 
> On 22.03.2019 11:44, Yanqin Wei (Arm Technology China) wrote:
>> Hi , OVS Maintainers,
>>
>> Could you help to have a look at this patch? Thanks a lot.
>>
>> Best Regards,
>> Wei Yanqin
>>
>> -----Original Message-----
>> From: Yanqin Wei <yanqin....@arm.com>
>> Sent: Wednesday, March 13, 2019 1:28 PM
>> To: d...@openvswitch.org
>> Cc: nd <n...@arm.com>; Gavin Hu (Arm Technology China) 
>> <gavin...@arm.com>; Yanqin Wei (Arm Technology China) 
>> <yanqin....@arm.com>
>> Subject: [ovs-dev][PATCH v3] dpif-netdev: dfc_process optimization by 
>> prefetching EMC entry.
>>
>> It is observed that the throughput of multi-flow is worse than single-flow 
>> in the EMC NIC to NIC cases. It is because CPU cache-miss increasing in EMC 
>> lookup. Each flow need load at least one EMC entry to CPU cache(several 
>> cache lines) and compare it with packet miniflow.
>> This patch improve it by prefetching EMC entry in advance. Hash value 
>> can be obtained from dpdk rss hash, so this step can be advanced 
>> ahead of
>> miniflow_extract() and prefetch EMC entry there. The prefetching size is 
>> defined as ROUND_UP(128,CACHE_LINE_SIZE), which can cover majority traffic 
>> including TCP/UDP protocol and need 2 cache lines in most modern CPU.
>> Performance test was run in some arm platform. 1000/10000 flows NIC2NIC test 
>> achieved around 10% throughput improvement in thunderX2(aarch64 platform).
>>
>> Signed-off-by: Yanqin Wei <yanqin....@arm.com>
>> Reviewed-by: Gavin Hu <gavin...@arm.com>
>> ---
>>  lib/dpif-netdev.c | 80 
>> ++++++++++++++++++++++++++++++++++++-------------------
>>  1 file changed, 52 insertions(+), 28 deletions(-)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 
>> 4d6d0c3..982082c 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -189,6 +189,10 @@ struct netdev_flow_key {
>>  #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX /                     \
>>                                      DEFAULT_EM_FLOW_INSERT_INV_PROB)
>>  
>> +/* DEFAULT_EMC_PREFETCH_SIZE can cover majority traffic including 
>> +TCP/UDP
>> + * protocol. */
>> +#define DEFAULT_EMC_PREFETCH_SIZE ROUND_UP(128,CACHE_LINE_SIZE)
> 
> Space after the ',' needed.
> 
>> +
>>  struct emc_entry {
>>      struct dp_netdev_flow *flow;
>>      struct netdev_flow_key key;   /* key.hash used for emc hash value. */
>> @@ -6166,15 +6170,20 @@ dp_netdev_upcall(struct dp_netdev_pmd_thread 
>> *pmd, struct dp_packet *packet_,  }
>>  
>>  static inline uint32_t
>> -dpif_netdev_packet_get_rss_hash_orig_pkt(struct dp_packet *packet,
>> -                                const struct miniflow *mf)
>> +dpif_netdev_packet_get_packet_rss_hash(struct dp_packet *packet,
>> +                                bool md_is_valid)
> 
> 1. 'packet' word appears twice in the function name.
> How about 'dpif_netdev_packet_get_rss_hash' and 
> 'dpif_netdev_packet_get_5tuple_hash' for the second function?
> 
> 2.  Please align the second line to the level of next char after '('.
> 
> 3. 'md_is_valid' is meaningless name inside the function.
>     What about renaming to 'bool account_recirc_id'?
> 
>>  {
>> -    uint32_t hash;
>> +    uint32_t hash,recirc_depth;
> 
> missing space.
> 
>>  
>> -    if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
>> -        hash = dp_packet_get_rss_hash(packet);
>> -    } else {
>> -        hash = miniflow_hash_5tuple(mf, 0);
>> +    hash = dp_packet_get_rss_hash(packet);
>> +
>> +    if (md_is_valid) {
>> +        /* The RSS hash must account for the recirculation depth to avoid
>> +         * collisions in the exact match cache */
>> +        recirc_depth = *recirc_depth_get_unsafe();
>> +        if (OVS_UNLIKELY(recirc_depth)) {
> 
> We're in recirculation if md is valid. So, 'OVS_UNLIKELY' is wrong here.
> I think that we don't need to check the value at all.
> 'OVS_UNLIKELY' should be moved to the upper 'if'.
> 
>> +            hash = hash_finish(hash, recirc_depth);
>> +        }
>>          dp_packet_set_rss_hash(packet, hash);
> 
> This could be moved inside the 'if', because we need to update rss 
> hash only if it was changed.
> 
>>      }
>>  
>> @@ -6182,24 +6191,23 @@ 
>> dpif_netdev_packet_get_rss_hash_orig_pkt(struct dp_packet *packet,  }
>>  
>>  static inline uint32_t
>> -dpif_netdev_packet_get_rss_hash(struct dp_packet *packet,
>> -                                const struct miniflow *mf)
>> +dpif_netdev_packet_get_hash_5tuple(struct dp_packet *packet,
>> +                                const struct miniflow *mf,
>> +                                bool md_is_valid)
> 
> Alignments.
> 
>>  {
>> -    uint32_t hash, recirc_depth;
>> +    uint32_t hash,recirc_depth;
> 
> Space.
> 
>>  
>> -    if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
>> -        hash = dp_packet_get_rss_hash(packet);
>> -    } else {
>> -        hash = miniflow_hash_5tuple(mf, 0);
>> -        dp_packet_set_rss_hash(packet, hash);
>> -    }
>> +    hash = miniflow_hash_5tuple(mf, 0);
>> +    dp_packet_set_rss_hash(packet, hash);
>>  
>> -    /* The RSS hash must account for the recirculation depth to avoid
>> -     * collisions in the exact match cache */
>> -    recirc_depth = *recirc_depth_get_unsafe();
>> -    if (OVS_UNLIKELY(recirc_depth)) {
>> -        hash = hash_finish(hash, recirc_depth);
>> -        dp_packet_set_rss_hash(packet, hash);
>> +    if (md_is_valid) {
>> +        /* The RSS hash must account for the recirculation depth to avoid
>> +         * collisions in the exact match cache */
>> +        recirc_depth = *recirc_depth_get_unsafe();
>> +        if (OVS_UNLIKELY(recirc_depth)) {
> 
> Same as previous.
> 
>> +            hash = hash_finish(hash, recirc_depth);
>> +            dp_packet_set_rss_hash(packet, hash);
> 
> This function sets the rss to dp_packet twice. You could avoid that by 
> moving the 'dp_packet_set_rss_hash' call to the end of this function.
> 
>> +        }
>>      }
>>      return hash;
>>  }
>> @@ -6390,6 +6398,7 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>>      bool smc_enable_db;
>>      size_t map_cnt = 0;
>>      bool batch_enable = true;
>> +    bool is_5tuple_hash_needed;
>>  
>>      atomic_read_relaxed(&pmd->dp->smc_enable_db, &smc_enable_db);
>>      pmd_perf_update_counter(&pmd->perf_stats,
>> @@ -6436,16 +6445,31 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>>              }
>>          }
>>  
>> -        miniflow_extract(packet, &key->mf);
>> -        key->len = 0; /* Not computed yet. */
>>          /* If EMC and SMC disabled skip hash computation */
>>          if (smc_enable_db == true || cur_min != 0) {
>> -            if (!md_is_valid) {
>> -                key->hash = dpif_netdev_packet_get_rss_hash_orig_pkt(packet,
>> -                        &key->mf);
>> +            if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
> 
> I'd not put the 'LIKELY' macro here, because all the packets from 
> virtual ports has no RSS and this is the common case.
> 
>> +                is_5tuple_hash_needed = false;
>> +                key->hash =
>> +                   
>> + dpif_netdev_packet_get_packet_rss_hash(packet,md_is_valid);
> 
> Space.
> 
>> +                if (cur_min) {
>> +                    ovs_prefetch_range(
>> +                      &cache->emc_cache.entries[key->hash & 
>> EM_FLOW_HASH_MASK],
>> +                      DEFAULT_EMC_PREFETCH_SIZE);
>> +                }
>>              } else {
>> -                key->hash = dpif_netdev_packet_get_rss_hash(packet, 
>> &key->mf);
>> +                is_5tuple_hash_needed = true;
>>              }
>> +        } else {
>> +            is_5tuple_hash_needed = false;
>> +        }
>> +
>> +        miniflow_extract(packet, &key->mf);
>> +        key->len = 0; /* Not computed yet. */
>> +
>> +        /* If 5tuple hash is needed */
> 
> This comment is not that useful. IMHO, the variable is self-documenting.
> 
>> +        if (is_5tuple_hash_needed) {
>> +            key->hash = dpif_netdev_packet_get_hash_5tuple(packet, 
>> + &key->mf,
>> +                                                           
>> + md_is_valid);
>>          }
>>          if (cur_min) {
>>              flow = emc_lookup(&cache->emc_cache, key);
>> --
>> 2.7.4
>>
>>
>>
> 
> 
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v3] dpif-netdev: dfc_process optimization by prefetching EMC entry.

Reply via email to