On 5 Jan 2023, at 14:39, Eelco Chaudron wrote:

<SNIP>

>>> I'm not saying that we don't have issue with statistics.  We do have
>>> 2 separate issues here - dump duration and statistics update frequency.
>>> Just trying to figure out which is a primary and which is a secondary
>>> contributor to the flow deletion issues that you observe.
>>>
>>> My point is that if dump durations wasn't that unreasonably high, we
>>> would not see flow deletions.  OTOH, if statistics updates wasn't delayed,
>>> we would not see the issue either.  So, I propose a draw. :)
>
> Well with the default config a run of 251ms in the default scenario could hit 
> the problem if the stats were not updated between the next and the previous 
> (too long) run.
>
>>> We know the root cause of the statistics delay (driver), and we don't
>>> know the root cause of the long dump duration.  So we need to figure
>>> that out.
>
> Yes, the excessive delay we need to figure out.
>
> William can you get some more accurate values (I think you have my reval 
> patch applied, 
> https://patchwork.ozlabs.org/project/openvswitch/patch/166367078734.325473.17899801287395410245.stgit@ebuild/,
>  so you could run the script with -q the get the values over time, maybe also 
> do -w for offline analysis).

William was able to run my revalidator script on the system for 2d and 17:16:14 
hours. The results are below, however one thing I did not realise is that the 
n_flows value is the actual dp flows, not including the hw offloaded ones.

So to determine the offloaded flows I used the n_reval_flows value, which is no 
revalidating needed to happening is reported as 0.

So here are some histograms for the entire period:

=> n_flows:
# NumSamples = 477616; Min = 0.00; Max = 20.00
# Mean = 2.564169; Variance = 2.860125; SD = 1.691190; Median 2.000000
# each ∎ represents a count of 1930
    0.0000 -     1.0000 [144762]: 
∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
    1.0000 -     2.0000 [128073]: 
∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
    2.0000 -     3.0000 [ 94831]: 
∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
    3.0000 -     4.0000 [ 59869]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
    4.0000 -     5.0000 [ 29455]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
    5.0000 -     6.0000 [ 12025]: ∎∎∎∎∎∎
    6.0000 -     7.0000 [  4425]: ∎∎
    7.0000 -     8.0000 [  1461]:
    8.0000 -     9.0000 [   364]:
    9.0000 -    10.0000 [   123]:
   10.0000 -    11.0000 [    21]:
   11.0000 -    12.0000 [    40]:
   12.0000 -    13.0000 [   247]:
   13.0000 -    14.0000 [   569]:
   14.0000 -    15.0000 [   509]:
   15.0000 -    16.0000 [   393]:
   16.0000 -    17.0000 [   220]:
   17.0000 -    18.0000 [   169]:
   18.0000 -    19.0000 [    51]:
   19.0000 -    20.0000 [     9]:

=> n_reval_flows:
# NumSamples = 477616; Min = 0.00; Max = 148.00
# Mean = 4.193549; Variance = 336.919508; SD = 18.355367; Median 0.000000
# each ∎ represents a count of 6041
    0.0000 -     7.4000 [453117]: 
∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
    7.4000 -    14.8000 [    72]:
   14.8000 -    22.2000 [   382]:
   22.2000 -    29.6000 [   117]:
   29.6000 -    37.0000 [    49]:
   37.0000 -    44.4000 [    25]:
   44.4000 -    51.8000 [    49]:
   51.8000 -    59.2000 [   239]:
   59.2000 -    66.6000 [  1020]:
   66.6000 -    74.0000 [  3640]:
   74.0000 -    81.4000 [  5819]:
   81.4000 -    88.8000 [  6191]: ∎
   88.8000 -    96.2000 [  4288]:
   96.2000 -   103.6000 [  1550]:
  103.6000 -   111.0000 [   603]:
  111.0000 -   118.4000 [   194]:
  118.4000 -   125.8000 [   126]:
  125.8000 -   133.2000 [    87]:
  133.2000 -   140.6000 [    34]:
  140.6000 -   148.0000 [    14]:

=> dump_duration:
# NumSamples = 477616; Min = 1.00; Max = 2396.00
# Mean = 20.440632; Variance = 4416.771835; SD = 66.458798; Median 2.000000
# each ∎ represents a count of 6078
    1.0000 -   120.7500 [455905]: 
∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  120.7500 -   240.5000 [ 12804]: ∎∎
  240.5000 -   360.2500 [  4972]:
  360.2500 -   480.0000 [  2455]:
  480.0000 -   599.7500 [   704]:
  599.7500 -   719.5000 [   276]:
  719.5000 -   839.2500 [   169]:
  839.2500 -   959.0000 [   137]:
  959.0000 -  1078.7500 [    65]:
 1078.7500 -  1198.5000 [    58]:
 1198.5000 -  1318.2500 [    27]:
 1318.2500 -  1438.0000 [    16]:
 1438.0000 -  1557.7500 [    11]:
 1557.7500 -  1677.5000 [     9]:
 1677.5000 -  1797.2500 [     2]:
 1797.2500 -  1917.0000 [     2]:
 1917.0000 -  2036.7500 [     1]:
 2036.7500 -  2156.5000 [     1]:
 2156.5000 -  2276.2500 [     0]:
 2276.2500 -  2396.0000 [     2]:


So it looks like the maximum number of existing flows in the system is 148 (tc 
+ kernel).
The dump duration goes from 1 ms to 2.3 seconds.

Looking at some records for +1.5s revalidate times (anything less give a lot of 
records:

Dump records where dump_duration > 1500:
  duration =  2072, n_reval_[flows] =    16[ 3]
  duration =  1534, n_reval_[flows] =    14[ 6]
  duration =  1686, n_reval_[flows] =    16[ 3]
  duration =  1599, n_reval_[flows] =    30[ 3]
  duration =  1619, n_reval_[flows] =    13[ 2]
  duration =  2313, n_reval_[flows] =    15[ 3]
  duration =  1523, n_reval_[flows] =    18[ 2]
  duration =  1611, n_reval_[flows] =    13[ 4]
  duration =  1603, n_reval_[flows] =    18[ 2]
  duration =  1626, n_reval_[flows] =    15[ 1]
  duration =  1807, n_reval_[flows] =    17[ 2]
  duration =  1694, n_reval_[flows] =    17[ 4]
  duration =  1558, n_reval_[flows] =    19[ 5]
  duration =  2396, n_reval_[flows] =    31[ 6]
  duration =  1996, n_reval_[flows] =    12[ 5]
  duration =  1866, n_reval_[flows] =    12[ 2]
  duration =  1656, n_reval_[flows] =    15[ 2]
  duration =  1659, n_reval_[flows] =    15[ 1]
  duration =  1544, n_reval_[flows] =     0[ 3]
  duration =  1559, n_reval_[flows] =    16[ 5]


Note that even with 12 flows we can still take 1.6 seconds.
However, taking some excessive records not showing issues:

duration =     3, n_reval_[flows] =    81[ 2]
duration =     2, n_reval_[flows] =   121[14]
duration =   108, n_reval_[flows] =   127[14]

It stays well under /2 max reval.

So to me, this looks like something outside of OVS is causing the revalidator 
runs to take long.
Or at least it’s not related to the number of active flows.


William, maybe you can get some system load stats and see if we can relate them 
to the long dump duration?


<SNIP>

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to