On 1/4/23 22:58, Ilya Maximets wrote:
> On 1/4/23 10:43, Eelco Chaudron wrote:
>>
>>
>> On 3 Jan 2023, at 21:03, Ilya Maximets wrote:
>>
>>> On 1/3/23 20:42, William Zhao wrote:
>>>> Hi Ilya and Eelco,
>>>>
>>>> I have seen udpif->dump_duration in revalidate_ukey() go as high as ~2
>>>> seconds . But the dump duration is very random and sporadic. Most
>>>> likely a function of how busy the CPU is.
>>>>
>>>> We have tried a max-revalidator time of 1 second and we saw
>>>> connectivity losses presumably due to the flows being deleted in the
>>>> "else" case for should_revalidate() call in the revalidate_ukey()
>>>> function.
>>>>
>>>> Taking into consideration the Nvidia MLX driver for instance:
>>>> https://ansymbol.com/linux/v5.14/source/drivers/net/ethernet/mellanox/mlx5/core/fs_counters.c#L40
>>>> . We can see that they poll for stats every 1 second. But definitely
>>>> could be delayed. So the update period of hardware stats is at a
>>>> minimum every 1 second for MLX NICs.
>>>
>>> What it tells me is that stats are not the problem here.
>>> The problem is a very long dump duration.  If we're talking any values
>>> above one second for a dump duration we're very likely talking about
>>> hitting a dynamic datapath flow limit.  This will affect max-idle
>>> configuration or even trigger 'kill_them_all' condition.
>>>
>>> How many datapath flows does that system have installed?  Does the
>>> 'function of how busy the CPU is' mean that something else is running
>>> on the same cores?
>>>
>>>
>>> And if it takes 2 seconds to just dump flows, setting max-revalidator
>>> to 4 seconds should not be a huge deal.
>>>
>>> max-revalidator of 1 second is clearly not enough, as it should be
>>> at least twice as high as expected dump duration.
>>
>> Hi Ilya,
>>
>> Statistics are definitely the problem in this case, and I verified that in 
>> William's setup. They might have seen +1 seconds dump times previously 
>> however, those who were not responsible for the flow deletes. It was the 
>> flow statistics not being synced yet.
>>
>> The part that calculates the pps rate has a flaw, as it assumes the 
>> retrieved statistics are always accurate, which is not the case for hw 
>> offloaded flows. This is what this patch is trying to fix. So if we do not 
>> like the static 2s delay, we could add a configurable delay, but we need a 
>> fix for this specific problem.
>>
>> For now, a small spike of more than 1/2 max-revalidator could cause all HW 
>> offloaded flows to be removed. And although we could change the 
>> max-revalidator to 4 seconds, I do not see this as a final solution, as it 
>> affects statistics accuracy, including those of none offloaded flows.
>>
>> Not that in the field I've seen a lot of setups that have occasional spikes 
>> up to more than 1/2 max-revalidator, and this should not be a problem. And 
>> having to tweak OVS for all these cases not having HW offloads flows time 
>> out seems a bit too much.
>>
>> I guess William can debug why he has the occasional +1 spikes (and if those 
>> spikes are responsible by dumping the last duration values) but it does not 
>> solve the root cause of the flow deletes, which is the statistics not being 
>> correctly used.
>>
>> Cheers,
>>
>> Eelco
>>
>>>> On Tue, Jan 3, 2023 at 2:22 PM Ilya Maximets <i.maxim...@ovn.org> wrote:
>>>>>
>>>>> On 1/3/23 13:58, Eelco Chaudron wrote:
>>>>>>
>>>>>>
>>>>>> On 3 Jan 2023, at 13:36, Ilya Maximets wrote:
>>>>>>
>>>>>>> On 1/3/23 13:26, Eelco Chaudron wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3 Jan 2023, at 13:20, Ilya Maximets wrote:
>>>>>>>>
>>>>>>>>> On 1/3/23 12:33, Eelco Chaudron wrote:
>>>>>>>>>> Depending on the driver implementation it can take up to 2 seconds 
>>>>>>>>>> before
>>>>>>>>>> offloaded flow statistics are updated. This is causing a problem with
>>>>>>>>>> min-revalidate-pps, as old statistic values are used during this 
>>>>>>>>>> period.
>>>>>>>>>>
>>>>>>>>>> This fix will wait for at least 2 seconds before assuming no packets
>>>>>>>>>> where received during this period.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Eelco Chaudron <echau...@redhat.com>
>>>>>>>>>> ---
>>>>>>>>>>
>>>>>>>>>> Changes:
>>>>>>>>>>  - v2: Use existing ukey->offloaded so the revalidate_missed_dp_flow 
>>>>>>>>>> case is
>>>>>>>>>>        also covered.
>>>>>>>>>>
>>>>>>>>>>  ofproto/ofproto-dpif-upcall.c |   25 +++++++++++++++----------
>>>>>>>>>>  1 file changed, 15 insertions(+), 10 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/ofproto/ofproto-dpif-upcall.c 
>>>>>>>>>> b/ofproto/ofproto-dpif-upcall.c
>>>>>>>>>> index ad9635496..c395adbc6 100644
>>>>>>>>>> --- a/ofproto/ofproto-dpif-upcall.c
>>>>>>>>>> +++ b/ofproto/ofproto-dpif-upcall.c
>>>>>>>>>> @@ -2094,10 +2094,11 @@ ukey_delete(struct umap *umap, struct 
>>>>>>>>>> udpif_key *ukey)
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>>  static bool
>>>>>>>>>> -should_revalidate(const struct udpif *udpif, uint64_t packets,
>>>>>>>>>> -                  long long int used)
>>>>>>>>>> +should_revalidate(const struct udpif *udpif, struct udpif_key *ukey,
>>>>>>>>>> +                  uint64_t packets)
>>>>>>>>>>  {
>>>>>>>>>>      long long int metric, now, duration;
>>>>>>>>>> +    long long int used = ukey->stats.used;
>>>>>>>>>>
>>>>>>>>>>      if (!used) {
>>>>>>>>>>          /* Always revalidate the first time a flow is dumped. */
>>>>>>>>>> @@ -2124,8 +2125,12 @@ should_revalidate(const struct udpif *udpif, 
>>>>>>>>>> uint64_t packets,
>>>>>>>>>>      duration = now - used;
>>>>>>>>>>      metric = duration / packets;
>>>>>>>>>>
>>>>>>>>>> -    if (metric < 1000 / ofproto_min_revalidate_pps) {
>>>>>>>>>> -        /* The flow is receiving more than min-revalidate-pps, so 
>>>>>>>>>> keep it. */
>>>>>>>>>> +    if (metric < 1000 / ofproto_min_revalidate_pps ||
>>>>>>>>>> +        (ukey->offloaded && duration < 2000)) {
>>>>>>>>>> +        /* The flow is receiving more than min-revalidate-pps, so 
>>>>>>>>>> keep it.
>>>>>>>>>> +         * Or it's a hardware offloaded flow that might take up to 
>>>>>>>>>> 2 seconds
>>>>>>>>>> +         * to update its statistics. Until we are sure the 
>>>>>>>>>> statistics had a
>>>>>>>>>> +         * chance to be updated, also keep it. */
>>>>>>>>>
>>>>>>>>> Hmm.  If you know that a flow dump is taking a long time on this 
>>>>>>>>> setup,
>>>>>>>>> shouldn't we just bump the max-revalidator value in the database 
>>>>>>>>> instead
>>>>>>>>> of hardcoding yet another magic constant?
>>>>>>>>
>>>>>>>> Don’t think increasing the max-revalidator value is a good solution, 
>>>>>>>> as it affects
>>>>>>>> the statistics gathering in general.
>>>>>>>
>>>>>>> It affects the judgement on the dump duration and the
>>>>>>> time when revalidators will wake up next time.  If your
>>>>>>> dump duration is already high, your statistics gathering
>>>>>>> is already delayed.  And your hardware doesn't seem to
>>>>>>> provide statistics in any timely manner as well.
>>>>>>
>>>>>> Well, this is true if the delay is constant, but this is from the last 
>>>>>> run, and not all runs take the same amount :( Also, there is a mix of 
>>>>>> hardware and kernel flows, where kernel flows do provide real-time 
>>>>>> statistics vs max of 2 seconds delayed by HW offload.
>>>>>>
>>>>>> Take the following example; we just added some flows, which will trigger 
>>>>>> multiple updates back to back. So let’s say it takes 251ms to do the 
>>>>>> update (way under the 1.x seconds the hardware counter is updated).
>>>>>>
>>>>>> The second iteration would purge all HW offloaded flows, as no counter 
>>>>>> was updated in the 251ms window :(
>>>>>>
>>>>>> To solve this we could set the max-revalidator time to 4 seconds, but 
>>>>>> this will negatively affect all statistics gathering for flows. And 
>>>>>> maybe this might also have some other side effects in existing 
>>>>>> configurations.
> 
> 4 seconds only needed if the dump takes 2 seconds.  If it's 251ms,
> 504ms will do.  But, anyway...

Please, ignore the sentence above, posted a reply in the wrong
part of the email. :/

> 
> I'm not saying that we don't have issue with statistics.  We do have
> 2 separate issues here - dump duration and statistics update frequency.
> Just trying to figure out which is a primary and which is a secondary
> contributor to the flow deletion issues that you observe.
> 
> My point is that if dump durations wasn't that unreasonably high, we
> would not see flow deletions.  OTOH, if statistics updates wasn't delayed,
> we would not see the issue either.  So, I propose a draw. :)
> 
> We know the root cause of the statistics delay (driver), and we don't
> know the root cause of the long dump duration.  So we need to figure
> that out.
> 
> Would still be great to know how many datapath flows we're dealing
> with here.
> 
> 'dump_duration' is a misleading name though, as it is total revalidation
> time including sweep and all other phases.  So, '/ 2' part of the check
> in should_revalidate() doesn't seem to make a lot of sense.  Removing
> it might help with the situation a bit.
> If the dump durations are too inconsistent, we could replace that metric
> with an exponential moving average of the last N runs (lib/mov-avg.h).
> 
> 
> Also we could go the other way and just allow users to disable that
> "optimization" by setting min-revalidate-pps to zero (disabled).
> This is what Han proposed here:
>   
> https://patchwork.ozlabs.org/project/openvswitch/patch/20220809014100.3716850-1-hz...@ovn.org/
> I'm guessing that it's very similar in its roots or even the exact
> same issue that you both are trying to solve.
> 
> What do you think?
> 
>>>>>
>>>>> If the dump takes 251ms, you only need to bump max-revalidator to 504ms,
>>>>> not 4 seconds.  We should not mix up the time it takes to update the
>>>>> statistics and the dump duration.
>>>>>
>>>>> Flow being revalidated doesn't mean it will be deleted.  The flow will
>>>>> be deleted during revalidation only if it is actually incorrect.
>>>>>
>>>>> If it is idle for 10 seconds, it will be deleted without even considering
>>>>> revalidation.  This is controlled by the max-idle config.
>>>>>
>>>>> In your case flows didn't reach max-idle yet, but removed *without*
>>>>> revalidation.  This happens because OVS thinks that it's too expensive
>>>>> to trigger revalidation based on the flow dump duration.
>>>>> This is controlled by the max-revalidator config.
>>>>>
>>>>> If your actual dump takes 251ms, but max-revalidator is 504ms, the
>>>>> flow should not be removed even if it has zero stats for a next few
>>>>> seconds.  As long as the flow itself is correct.
>>>>>
>>>>> That's why we need to know what is the actual dump duration here.
>>>>>
>>>>>>
>>>>>>>> It also does not solve the problem, as any update in the flow 
>>>>>>>> table/config can
>>>>>>>> trigger a revalidation.
>>>>>>>
>>>>>>> It will, but there is a check right above the one that
>>>>>>> you're changing and it looks like this:
>>>>>>>
>>>>>>>     if (udpif->dump_duration < ofproto_max_revalidator / 2) {
>>>>>>>         /* We are likely to handle full revalidation for the flows. */
>>>>>>>         return true;
>>>>>>>     }
>>>>>>>
>>>>>>> So, by bumping the max-revalidator value we will avoid
>>>>>>> blind deletion and will properly revalidate the flow.
>>>>>>> The flow will not be deleted then, unless we hit the max-idle.
>>>>>>
>>>>>> See above.
>>>>>>
>>>>>>> Or am I missing something?
>>>>>>>
>>>>>>>>
>>>>>>>> If we do not like the magic static values we can add a configuration 
>>>>>>>> option, so it can be adjusted based on the driver implementation.
>>>>>>>>
>>>>>>>>> How long the dump actually takes?  And how many flows in it?
>>>>>>>>
>>>>>>>> William do you have these values, as I did not capture any of this 
>>>>>>>> when debugging your setup.
>>>>>>>>
>>>>>>>> //Eelco
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>
> 

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to