On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets <i.maxim...@ovn.org> wrote:
>
> >>>
> >>> >  35 files changed, 9681 insertions(+), 4645 deletions(-)
> >>>
> >>> I had another look at this series and acked the remaining patches.  I
> >>> just had some minor comments that can be easily fixed when applying
the
> >>> patches to the main branch.
> >>>
> >>> Thanks for all the work on this!  It was a very large change but it
> >>> improves northd performance significantly.  I just hope we don't
> >>> introduce too many bugs.  Hopefully the time we have until release
will
> >>> allow us to further test this change on the 24.03 branch.
> >>>
> >>> Regards,
> >>> Dumitru
> >>
> >>
> >>
> >> Thanks a lot Dumitru and Han for the reviews and patience.
> >>
> >> I addressed the comments and applied the patches to main and also to
> > branch-24.03.
> >>
> >> @Han - I know you wanted to take another look in to v6.  I didn't want
to
> > delay further as branch-24.03 was created.  I'm more than happy to
submit
> > follow up patches if you have any comments to address.  Please let me
know.
> >>
> >
> > Hi Numan,
> >
> > I was writing the reply and saw your email just now. Thanks a lot for
> > taking a huge effort to achieve the great optimization. I only left one
> > comment on the implicit dependency left for the en_lrnat -> en_lflow.
Feel
> > free to address it with a followup and no need to block the branching.
And
> > take my Ack for the series with that addressed.
> >
> > Acked-by: Han Zhou <hzhou at ovn.org>
>
>
> Hi, Numan, Dumitru and Han.
>
> I see a huge negative performance impact, most likely from this set, on
> ovn-heater's cluster-density tests.  The memory consumption on northd
> jumped about 4x and it constantly recomputes due to failures of port_group
> handler:
>
> 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow, recompute
(failed handler for input port_group) took 9762ms
> 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long 9898ms poll
interval (5969ms user, 1786ms system)
> ...
> 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow, recompute
(failed handler for input port_group) took 9014ms
> 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long 9118ms poll
interval (5376ms user, 1515ms system)
> ...
> 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow, recompute
(failed handler for input port_group) took 10695ms
> 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long 10890ms
poll interval (6085ms user, 2745ms system)
> ...
> 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow, recompute
(failed handler for input port_group) took 9985ms
> 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long 10108ms
poll interval (5521ms user, 2440ms system)
>
> That increases 95%% ovn-installed latency in 500node cluster-density from
> 3.6 seconds last week to 21.5 seconds this week.
>
> I think, this should be a release blocker.
>
> Memory usage is also very concerning.  Unfortunately it is not tied to the
> cluster-density test.  The same 4-5x RSS jump is also seen in other test
> like density-heavy.  Last week RSS of ovn-northd in cluster-density 500
node
> was between 1.5 and 2.5 GB, this week we have a range between 5.5 and 8.5
GB.
>
> I would consider this as a release blocker as well.
>
>
> I don't have direct evidence that this particular series is a culprit, but
> it looks like the most likely candidate.  I can dig more into
investigation
> on Monday.
>
> Best regards, Ilya Maximets.

Thanks Ilya for reporting this. 95% latency and 4x RSS increase is a little
surprising to me. I did test this series with my scale test scripts for
recompute performance regression. It was 10+% increase in latency. I even
digged a little into it, and noticed ~5% increase caused by the hmap used
to maintain the lflows in each lflow_ref. This was discussed in the code
review for an earlier version (v2/v3). Overall it looked not very bad, if
we now handle most common scenarios incrementally, and it is reasonable to
have some cost for maintaining the references/index for incremental
processing. I wonder if my test scenario was too simple (didn't have LBs
included) to find the problems, so today I did another test by including a
LB group with 1k LBs applied to 100 node-LS & GR, and another 1K LBs per
node-LS & GR (101K LBs in total), and I did see more performance penalty
but still within ~20%. While for memory I didn't notice a significant
increase (<10%). I believe I am missing some specific scenario that had the
big impact in the ovn-heater's tests. Please share if you dig out more
clues .

Thanks,
Han
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to