Re: [ovs-dev] [PATCH v3] dpif-netdev: dfc_process optimization by prefetching EMC entry.

Ilya Maximets Mon, 01 Apr 2019 00:15:11 -0700

On 01.04.2019 9:52, Yanqin Wei (Arm Technology China) wrote:
> Hi Ilya,
> 
> Really appreciate your time in doing benchmarking for this patch. And many 
> thanks for your valuable comments.
> It is quite possible that the second unneeded cache line prefetching causes 
> performance drop in case of big number of flows (64K - 256K).
> This impact the real world use case, I will try to improve it . For 1-8 flows 
> cases, it should be not very important because of the lack of actual 
> deployment.


I'm not actually sure which of these cases is more real.
We're suggesting to disable EMC for cases where it's not effective.
>From the other side, if VM encapsulates all the incoming traffic and sends
it for further processing, there will be only few outcoming flows.

> 
> Best Regards,
> Wei Yanqin
> 
> -----Original Message-----
> From: Ilya Maximets <i.maxim...@samsung.com> 
> Sent: Monday, April 1, 2019 2:16 PM
> To: Yanqin Wei (Arm Technology China) <yanqin....@arm.com>; 
> d...@openvswitch.org; ian.sto...@intel.com
> Cc: nd <n...@arm.com>; Gavin Hu (Arm Technology China) <gavin...@arm.com>
> Subject: Re: [ovs-dev][PATCH v3] dpif-netdev: dfc_process optimization by 
> prefetching EMC entry.
> 
> On 29.03.2019 20:33, Ilya Maximets wrote:
>> Hi.
>> I made few tests on PVP with bonded PHY setup and found no significant 
>> difference in maximum performance with low and medium number of flows
>> (8 - 8192).
> 
> Correction: I see the a slight performance drop (~1%) for the cases with very 
> low number of flows (1-8), and a performance increase (~ 2-3%) for the medium 
> low number of flows (64 - 4096).
> In this scenario packets from VM has already calculated hash on 
> recirculation, so the prefetching optimization doesn't work for the first 
> pass through the datapath, but works for the second.
> Degradation on single or very low number of flows probably caused by 
> additional checks and instructions for prefetching the memory that is in 
> cache already.
> 
>> In case of big number of flows (64K - 256K) I see performance drop in 
>> about 2-3%. I think that because of prefetching of a second cacheline, 
>> which is unneeded because current_entry->key.hash likely != key->hash 
>> and we don't need to compare miniflows while emc_lookup. Changing the 
>> code to prefetch the first cacheline only decreases the drop to ~1% 
>> (However, this increases the consumed processing cycles for medium 
>> numbers of flows described below)
> 
> In general, this patch decreases the performance for cases where EMC is not 
> efficient and improves it for cases where EMC is efficient, except the cases 
> with very low numbers of flows, which could be the main concern.
> 
>>
>> OTOH, I see a slight decrease (~1%) of consumed cycles per packet for 
>> the thread that polls HW NICs and send packets to VM, which is good.
>> This improvement observed for a medium small number of flows: 512 - 8192.
>> For a low (1 - 256) and high (64K - 256K) numbers of flows value of 
>> consumed processing cycles per packet fro this thread was not affected by 
>> the patch.
>>
>> Tests made with average 512B packets, EMC enabled, SMC disabled, TX 
>> flushing interval 50us. Note that the bottleneck in this case is the 
>> VM --> bonded PHY part which is the case of 5tuple hash calculation.
>>
>> See review comments inline.
>>
>> Best regards, Ilya Maximets.
>>
>> On 22.03.2019 11:44, Yanqin Wei (Arm Technology China) wrote:
>>> Hi , OVS Maintainers,
>>>
>>> Could you help to have a look at this patch? Thanks a lot.
>>>
>>> Best Regards,
>>> Wei Yanqin
>>>
>>> -----Original Message-----
>>> From: Yanqin Wei <yanqin....@arm.com>
>>> Sent: Wednesday, March 13, 2019 1:28 PM
>>> To: d...@openvswitch.org
>>> Cc: nd <n...@arm.com>; Gavin Hu (Arm Technology China) 
>>> <gavin...@arm.com>; Yanqin Wei (Arm Technology China) 
>>> <yanqin....@arm.com>
>>> Subject: [ovs-dev][PATCH v3] dpif-netdev: dfc_process optimization by 
>>> prefetching EMC entry.
>>>
>>> It is observed that the throughput of multi-flow is worse than single-flow 
>>> in the EMC NIC to NIC cases. It is because CPU cache-miss increasing in EMC 
>>> lookup. Each flow need load at least one EMC entry to CPU cache(several 
>>> cache lines) and compare it with packet miniflow.
>>> This patch improve it by prefetching EMC entry in advance. Hash value 
>>> can be obtained from dpdk rss hash, so this step can be advanced 
>>> ahead of
>>> miniflow_extract() and prefetch EMC entry there. The prefetching size is 
>>> defined as ROUND_UP(128,CACHE_LINE_SIZE), which can cover majority traffic 
>>> including TCP/UDP protocol and need 2 cache lines in most modern CPU.
>>> Performance test was run in some arm platform. 1000/10000 flows NIC2NIC 
>>> test achieved around 10% throughput improvement in thunderX2(aarch64 
>>> platform).
>>>
>>> Signed-off-by: Yanqin Wei <yanqin....@arm.com>
>>> Reviewed-by: Gavin Hu <gavin...@arm.com>
>>> ---
>>>  lib/dpif-netdev.c | 80 
>>> ++++++++++++++++++++++++++++++++++++-------------------
>>>  1 file changed, 52 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index 
>>> 4d6d0c3..982082c 100644
>>> --- a/lib/dpif-netdev.c
>>> +++ b/lib/dpif-netdev.c
>>> @@ -189,6 +189,10 @@ struct netdev_flow_key {
>>>  #define DEFAULT_EM_FLOW_INSERT_MIN (UINT32_MAX /                     \
>>>                                      DEFAULT_EM_FLOW_INSERT_INV_PROB)
>>>  
>>> +/* DEFAULT_EMC_PREFETCH_SIZE can cover majority traffic including 
>>> +TCP/UDP
>>> + * protocol. */
>>> +#define DEFAULT_EMC_PREFETCH_SIZE ROUND_UP(128,CACHE_LINE_SIZE)
>>
>> Space after the ',' needed.
>>
>>> +
>>>  struct emc_entry {
>>>      struct dp_netdev_flow *flow;
>>>      struct netdev_flow_key key;   /* key.hash used for emc hash value. */
>>> @@ -6166,15 +6170,20 @@ dp_netdev_upcall(struct dp_netdev_pmd_thread 
>>> *pmd, struct dp_packet *packet_,  }
>>>  
>>>  static inline uint32_t
>>> -dpif_netdev_packet_get_rss_hash_orig_pkt(struct dp_packet *packet,
>>> -                                const struct miniflow *mf)
>>> +dpif_netdev_packet_get_packet_rss_hash(struct dp_packet *packet,
>>> +                                bool md_is_valid)
>>
>> 1. 'packet' word appears twice in the function name.
>> How about 'dpif_netdev_packet_get_rss_hash' and 
>> 'dpif_netdev_packet_get_5tuple_hash' for the second function?
>>
>> 2.  Please align the second line to the level of next char after '('.
>>
>> 3. 'md_is_valid' is meaningless name inside the function.
>>     What about renaming to 'bool account_recirc_id'?
>>
>>>  {
>>> -    uint32_t hash;
>>> +    uint32_t hash,recirc_depth;
>>
>> missing space.
>>
>>>  
>>> -    if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
>>> -        hash = dp_packet_get_rss_hash(packet);
>>> -    } else {
>>> -        hash = miniflow_hash_5tuple(mf, 0);
>>> +    hash = dp_packet_get_rss_hash(packet);
>>> +
>>> +    if (md_is_valid) {
>>> +        /* The RSS hash must account for the recirculation depth to avoid
>>> +         * collisions in the exact match cache */
>>> +        recirc_depth = *recirc_depth_get_unsafe();
>>> +        if (OVS_UNLIKELY(recirc_depth)) {
>>
>> We're in recirculation if md is valid. So, 'OVS_UNLIKELY' is wrong here.
>> I think that we don't need to check the value at all.
>> 'OVS_UNLIKELY' should be moved to the upper 'if'.
>>
>>> +            hash = hash_finish(hash, recirc_depth);
>>> +        }
>>>          dp_packet_set_rss_hash(packet, hash);
>>
>> This could be moved inside the 'if', because we need to update rss 
>> hash only if it was changed.
>>
>>>      }
>>>  
>>> @@ -6182,24 +6191,23 @@ 
>>> dpif_netdev_packet_get_rss_hash_orig_pkt(struct dp_packet *packet,  }
>>>  
>>>  static inline uint32_t
>>> -dpif_netdev_packet_get_rss_hash(struct dp_packet *packet,
>>> -                                const struct miniflow *mf)
>>> +dpif_netdev_packet_get_hash_5tuple(struct dp_packet *packet,
>>> +                                const struct miniflow *mf,
>>> +                                bool md_is_valid)
>>
>> Alignments.
>>
>>>  {
>>> -    uint32_t hash, recirc_depth;
>>> +    uint32_t hash,recirc_depth;
>>
>> Space.
>>
>>>  
>>> -    if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
>>> -        hash = dp_packet_get_rss_hash(packet);
>>> -    } else {
>>> -        hash = miniflow_hash_5tuple(mf, 0);
>>> -        dp_packet_set_rss_hash(packet, hash);
>>> -    }
>>> +    hash = miniflow_hash_5tuple(mf, 0);
>>> +    dp_packet_set_rss_hash(packet, hash);
>>>  
>>> -    /* The RSS hash must account for the recirculation depth to avoid
>>> -     * collisions in the exact match cache */
>>> -    recirc_depth = *recirc_depth_get_unsafe();
>>> -    if (OVS_UNLIKELY(recirc_depth)) {
>>> -        hash = hash_finish(hash, recirc_depth);
>>> -        dp_packet_set_rss_hash(packet, hash);
>>> +    if (md_is_valid) {
>>> +        /* The RSS hash must account for the recirculation depth to avoid
>>> +         * collisions in the exact match cache */
>>> +        recirc_depth = *recirc_depth_get_unsafe();
>>> +        if (OVS_UNLIKELY(recirc_depth)) {
>>
>> Same as previous.
>>
>>> +            hash = hash_finish(hash, recirc_depth);
>>> +            dp_packet_set_rss_hash(packet, hash);
>>
>> This function sets the rss to dp_packet twice. You could avoid that by 
>> moving the 'dp_packet_set_rss_hash' call to the end of this function.
>>
>>> +        }
>>>      }
>>>      return hash;
>>>  }
>>> @@ -6390,6 +6398,7 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>>>      bool smc_enable_db;
>>>      size_t map_cnt = 0;
>>>      bool batch_enable = true;
>>> +    bool is_5tuple_hash_needed;
>>>  
>>>      atomic_read_relaxed(&pmd->dp->smc_enable_db, &smc_enable_db);
>>>      pmd_perf_update_counter(&pmd->perf_stats,
>>> @@ -6436,16 +6445,31 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>>>              }
>>>          }
>>>  
>>> -        miniflow_extract(packet, &key->mf);
>>> -        key->len = 0; /* Not computed yet. */
>>>          /* If EMC and SMC disabled skip hash computation */
>>>          if (smc_enable_db == true || cur_min != 0) {
>>> -            if (!md_is_valid) {
>>> -                key->hash = 
>>> dpif_netdev_packet_get_rss_hash_orig_pkt(packet,
>>> -                        &key->mf);
>>> +            if (OVS_LIKELY(dp_packet_rss_valid(packet))) {
>>
>> I'd not put the 'LIKELY' macro here, because all the packets from 
>> virtual ports has no RSS and this is the common case.
>>
>>> +                is_5tuple_hash_needed = false;
>>> +                key->hash =
>>> +                   
>>> + dpif_netdev_packet_get_packet_rss_hash(packet,md_is_valid);
>>
>> Space.
>>
>>> +                if (cur_min) {
>>> +                    ovs_prefetch_range(
>>> +                      &cache->emc_cache.entries[key->hash & 
>>> EM_FLOW_HASH_MASK],
>>> +                      DEFAULT_EMC_PREFETCH_SIZE);
>>> +                }
>>>              } else {
>>> -                key->hash = dpif_netdev_packet_get_rss_hash(packet, 
>>> &key->mf);
>>> +                is_5tuple_hash_needed = true;
>>>              }
>>> +        } else {
>>> +            is_5tuple_hash_needed = false;
>>> +        }
>>> +
>>> +        miniflow_extract(packet, &key->mf);
>>> +        key->len = 0; /* Not computed yet. */
>>> +
>>> +        /* If 5tuple hash is needed */
>>
>> This comment is not that useful. IMHO, the variable is self-documenting.
>>
>>> +        if (is_5tuple_hash_needed) {
>>> +            key->hash = dpif_netdev_packet_get_hash_5tuple(packet, 
>>> + &key->mf,
>>> +                                                           
>>> + md_is_valid);
>>>          }
>>>          if (cur_min) {
>>>              flow = emc_lookup(&cache->emc_cache, key);
>>> --
>>> 2.7.4
>>>
>>>
>>>
>>
>>
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH v3] dpif-netdev: dfc_process optimization by prefetching EMC entry.

Reply via email to