>>>
>>> >  35 files changed, 9681 insertions(+), 4645 deletions(-)
>>>
>>> I had another look at this series and acked the remaining patches.  I
>>> just had some minor comments that can be easily fixed when applying the
>>> patches to the main branch.
>>>
>>> Thanks for all the work on this!  It was a very large change but it
>>> improves northd performance significantly.  I just hope we don't
>>> introduce too many bugs.  Hopefully the time we have until release will
>>> allow us to further test this change on the 24.03 branch.
>>>
>>> Regards,
>>> Dumitru
>>
>>
>>
>> Thanks a lot Dumitru and Han for the reviews and patience.
>>
>> I addressed the comments and applied the patches to main and also to
> branch-24.03.
>>
>> @Han - I know you wanted to take another look in to v6.  I didn't want to
> delay further as branch-24.03 was created.  I'm more than happy to submit
> follow up patches if you have any comments to address.  Please let me know.
>>
> 
> Hi Numan,
> 
> I was writing the reply and saw your email just now. Thanks a lot for
> taking a huge effort to achieve the great optimization. I only left one
> comment on the implicit dependency left for the en_lrnat -> en_lflow. Feel
> free to address it with a followup and no need to block the branching. And
> take my Ack for the series with that addressed.
> 
> Acked-by: Han Zhou <hzhou at ovn.org>


Hi, Numan, Dumitru and Han.

I see a huge negative performance impact, most likely from this set, on
ovn-heater's cluster-density tests.  The memory consumption on northd
jumped about 4x and it constantly recomputes due to failures of port_group
handler:

2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow, recompute (failed 
handler for input port_group) took 9762ms
2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long 9898ms poll 
interval (5969ms user, 1786ms system)
...
2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow, recompute (failed 
handler for input port_group) took 9014ms
2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long 9118ms poll 
interval (5376ms user, 1515ms system)
...
2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow, recompute (failed 
handler for input port_group) took 10695ms
2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long 10890ms poll 
interval (6085ms user, 2745ms system)
...
2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow, recompute (failed 
handler for input port_group) took 9985ms
2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long 10108ms poll 
interval (5521ms user, 2440ms system)

That increases 95%% ovn-installed latency in 500node cluster-density from
3.6 seconds last week to 21.5 seconds this week.

I think, this should be a release blocker.

Memory usage is also very concerning.  Unfortunately it is not tied to the
cluster-density test.  The same 4-5x RSS jump is also seen in other test
like density-heavy.  Last week RSS of ovn-northd in cluster-density 500 node
was between 1.5 and 2.5 GB, this week we have a range between 5.5 and 8.5 GB.

I would consider this as a release blocker as well.


I don't have direct evidence that this particular series is a culprit, but
it looks like the most likely candidate.  I can dig more into investigation
on Monday.

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to