Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

Han Zhou Mon, 05 Feb 2024 14:54:15 -0800

On Mon, Feb 5, 2024 at 10:15 AM Ilya Maximets <i.maxim...@ovn.org> wrote:
>
> On 2/5/24 15:45, Ilya Maximets wrote:
> > On 2/5/24 11:34, Ilya Maximets wrote:
> >> On 2/5/24 09:23, Dumitru Ceara wrote:
> >>> On 2/5/24 08:13, Han Zhou wrote:
> >>>> On Sun, Feb 4, 2024 at 9:26 PM Numan Siddique <num...@ovn.org> wrote:
> >>>>>
> >>>>> On Sun, Feb 4, 2024 at 9:53 PM Han Zhou <hz...@ovn.org> wrote:
> >>>>>>
> >>>>>> On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets <i.maxim...@ovn.org>
wrote:
> >>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>  35 files changed, 9681 insertions(+), 4645 deletions(-)
> >>>>>>>>>>
> >>>>>>>>>> I had another look at this series and acked the remaining
> >>>> patches.  I
> >>>>>>>>>> just had some minor comments that can be easily fixed when
> >>>> applying
> >>>>>> the
> >>>>>>>>>> patches to the main branch.
> >>>>>>>>>>
> >>>>>>>>>> Thanks for all the work on this!  It was a very large change
but
> >>>> it
> >>>>>>>>>> improves northd performance significantly.  I just hope we
don't
> >>>>>>>>>> introduce too many bugs.  Hopefully the time we have until
release
> >>>>>> will
> >>>>>>>>>> allow us to further test this change on the 24.03 branch.
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Dumitru
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Thanks a lot Dumitru and Han for the reviews and patience.
> >>>>>>>>>
> >>>>>>>>> I addressed the comments and applied the patches to main and
also
> >>>> to
> >>>>>>>> branch-24.03.
> >>>>>>>>>
> >>>>>>>>> @Han - I know you wanted to take another look in to v6.  I
didn't
> >>>> want
> >>>>>> to
> >>>>>>>> delay further as branch-24.03 was created.  I'm more than happy
to
> >>>>>> submit
> >>>>>>>> follow up patches if you have any comments to address.  Please
let
> >>>> me
> >>>>>> know.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Hi Numan,
> >>>>>>>>
> >>>>>>>> I was writing the reply and saw your email just now. Thanks a lot
> >>>> for
> >>>>>>>> taking a huge effort to achieve the great optimization. I only
left
> >>>> one
> >>>>>>>> comment on the implicit dependency left for the en_lrnat ->
> >>>> en_lflow.
> >>>>>> Feel
> >>>>>>>> free to address it with a followup and no need to block the
> >>>> branching.
> >>>>>> And
> >>>>>>>> take my Ack for the series with that addressed.
> >>>>>>>>
> >>>>>>>> Acked-by: Han Zhou <hzhou at ovn.org>
> >>>>>>>
> >>>>>>>
> >>>>>>> Hi, Numan, Dumitru and Han.
> >>>>>>>
> >>>>>>> I see a huge negative performance impact, most likely from this
set,
> >>>> on
> >>>>>>> ovn-heater's cluster-density tests.  The memory consumption on
northd
> >>>
> >>> Thanks for reporting this, Ilya!
> >>>
> >>>>>>> jumped about 4x and it constantly recomputes due to failures of
> >>>> port_group
> >>>>>>> handler:
> >>>>>>>
> >>>>>>> 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow,
> >>>> recompute
> >>>>>> (failed handler for input port_group) took 9762ms
> >>>>>>> 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long
9898ms
> >>>> poll
> >>>>>> interval (5969ms user, 1786ms system)
> >>>>>>> ...
> >>>>>>> 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow,
> >>>> recompute
> >>>>>> (failed handler for input port_group) took 9014ms
> >>>>>>> 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long
9118ms
> >>>> poll
> >>>>>> interval (5376ms user, 1515ms system)
> >>>>>>> ...
> >>>>>>> 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow,
> >>>> recompute
> >>>>>> (failed handler for input port_group) took 10695ms
> >>>>>>> 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long
10890ms
> >>>>>> poll interval (6085ms user, 2745ms system)
> >>>>>>> ...
> >>>>>>> 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow,
> >>>> recompute
> >>>>>> (failed handler for input port_group) took 9985ms
> >>>>>>> 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long
10108ms
> >>>>>> poll interval (5521ms user, 2440ms system)
> >>>>>>>
> >>>>>>> That increases 95%% ovn-installed latency in 500node
cluster-density
> >>>> from
> >>>>>>> 3.6 seconds last week to 21.5 seconds this week.
> >>>>>>>
> >>>>>>> I think, this should be a release blocker.
> >>>>>>>
> >>>>>>> Memory usage is also very concerning.  Unfortunately it is not
tied
> >>>> to the
> >>>>>>> cluster-density test.  The same 4-5x RSS jump is also seen in
other
> >>>> test
> >>>>>>> like density-heavy.  Last week RSS of ovn-northd in
cluster-density
> >>>> 500
> >>>>>> node
> >>>>>>> was between 1.5 and 2.5 GB, this week we have a range between 5.5
and
> >>>> 8.5
> >>>>>> GB.
> >>>>>>>
> >>>>>>> I would consider this as a release blocker as well.
> >>>>>>>
> >>>
> >>> I agree, we shouldn't release 24.03.0 unless these two issues are
> >>> (sufficiently) addressed.  We do have until March 1st (official
release
> >>> date) to do that or to revert any patches that cause regressions.
> >>>
> >>>>>>>
> >>>>>>> I don't have direct evidence that this particular series is a
> >>>> culprit, but
> >>>>>>> it looks like the most likely candidate.  I can dig more into
> >>>>>> investigation
> >>>>>>> on Monday.
> >>>>>>>
> >>>>>>> Best regards, Ilya Maximets.
> >>>>>>
> >>>>>> Thanks Ilya for reporting this. 95% latency and 4x RSS increase is
a
> >>>> little
> >>>>>> surprising to me. I did test this series with my scale test
scripts for
> >>>>>> recompute performance regression. It was 10+% increase in latency.
I
> >>>> even
> >>>>>> digged a little into it, and noticed ~5% increase caused by the
hmap
> >>>> used
> >>>>>> to maintain the lflows in each lflow_ref. This was discussed in
the code
> >>>>>> review for an earlier version (v2/v3). Overall it looked not very
bad,
> >>>> if
> >>>>>> we now handle most common scenarios incrementally, and it is
reasonable
> >>>> to
> >>>>>> have some cost for maintaining the references/index for incremental
> >>>>>> processing. I wonder if my test scenario was too simple (didn't
have LBs
> >>>>>> included) to find the problems, so today I did another test by
> >>>> including a
> >>>>>> LB group with 1k LBs applied to 100 node-LS & GR, and another 1K
LBs per
> >>>>>> node-LS & GR (101K LBs in total), and I did see more performance
penalty
> >>>>>> but still within ~20%. While for memory I didn't notice a
significant
> >>>>>> increase (<10%). I believe I am missing some specific scenario
that had
> >>>> the
> >>>>>> big impact in the ovn-heater's tests. Please share if you dig out
more
> >>>>>> clues .
> >>>>>
> >>>>> Hi Ilya,
> >>>>>
> >>>>> Thanks for reporting these details.
> >>>>>
> >>>>> I had a look at this regression.   There is a significant increase
in
> >>>>> the lflow recompute
> >>>>> time (around 4x) in my local testing and this definitely not
acceptable.
> >>>>>
> >>>>> In this particular cluster density test, whenever a port group is
> >>>>> created it results in a full recompute
> >>>>> and now since recompute time has increased, it has a cumulative
effect
> >>>>> on the latency of the test.
> >>>>>
> >>>>> The dp reference counting [1] added in the v4 of this series has
> >>>>> introduced this regression (both CPU and memory).
> >>>>> I'm working on this fix and I think I should be able to address
this soon.
> >>>>>
> >>>>> [1] -
https://github.com/ovn-org/ovn/blob/main/northd/lflow-mgr.c#L696
> >>>>>
> >>>>>
> >>>> Indeed, the same concern I mentioned when reviewing v5:
> >>>>
https://mail.openvswitch.org/pipermail/ovs-dev/2024-January/411194.html
> >>
> >> Yeah.  In this cluster-density 500node test, dp_refcnt_use() allocates
> >> 120 M references and puts them into hash maps.  That alone takes
several
> >> GB of memory.  And it is a largest consumer according to massif.
> >> It takes ~45% of total memory allocations.  Another ~25% are taken by
hash
> >> maps where the refcounts are stored.
> >>
> >> There is a simple optimization that can be made - not allocate
refcounts
> >> if it is the first time we're trying to add them.  The original bitmap
> >> should be enough.  And if we try to use it again while it is already in
> >> a bitmap, then allocate.  That saves a lot of space, because 80% of all
> >> refcounts in that test never used more than once.  However, it might
not
> >> be enough.  Needs testing.  Ideally the whole structure needs to go or
> >> be replaced by something different.  Frequent iterations over the
bitmaps
> >> and per-datapath*per-lflow memory allocations do not scale.  Even if we
> >> can make it work reasonably well for 500 nodes, it will cause problems
on
> >> higher scale.
> >
> > FWIW, I have a small prototype for this partial solution that seem to
> > noticeably decrease the memory usage here:
> >
https://github.com/igsilya/ovn/commit/5ac3a78e287708fd571344c74a021e418fde31a4
> >
> > I need to run this through the scale run to get actual results though.
>
> So, the results for the cluster-density 500 node test are here:
>
>           | Avg. Poll Intervals | Total test time |  northd RSS
> --------------------------------+-----------------+---------------
> Last week |     1.5 seconds     |  1005 seconds   |    2.5 GB
> This week |      6  seconds     |  2246 seconds   |    8.5 GB
> Patch     |     2.5 seconds     |  1170 seconds   |    3.1 GB
>
> So, the patch does help a lot.  But we still have 66%  increase
> of the poll intervals on northd, i.e. 66% increase in recompute
> time.  And about 25% increase in memory usage.
>
> I know that Numan is working on alternative solution to just
> remove the reference counting altogether, so I'll hold on sending
> an official patch.  Removing the refcounting, if possible, sounds
> like a better approach in general.
>


Thanks Ilya. I also tested with your patch for my previous test case that
had 50% increase on latency and memory, and compared with a patch that
completely removes the ref count logic but only kept the hmap in lflow
(without really using it after initializing).
The difference is really small in my test (~1%) in both memory and latency.
So I think your patch is quite effective. I believe the ref count will very
rarely be larger than 1 in reality, so I think the memory cost should be
negligible after applying your patch.
I was actually hesitating about the approach, primarily concerning the
latency, because the bitmap scan still happens and for each 1 it will
perform another bit test. However, with my test it seemed that the CPU cost
was also negligible. I did check perf flamegraph and the bitmap operation
didn't show as a significant part.

So, your patch seems to me a good candidate solution for this particular
problem. I am again surprised at the cluster-density 500 node test result
you shared above. Could it be that the majority of the 66% increase in
compute time and 25% increase in memory is from a different part of the
change other than the ref count? Did you test with ref count completely
removed (of course there is a correctness problem when the ref count would
need to be > 1, but just ignore that for now for the performance
evaluation).

It would be great if Numan comes up with an approach that avoids the ref
count and I am curious which direction is it. Initially I had the opinion
that for I-P, we should avoid merging lflows' DP/DPGs, but instead directly
create lflows for individual DPs or for DPGs generated beforehand (e.g. for
LBs applied to multiple DPs), even if there may be some redundant flows
generated. However, later after reviewing Numan's patch it seemed more
generic, and I think it is ok with this more generic approach if the
performance cost is acceptable, although it does add more complexity.

For Dumitru's proposal:
>> Maybe it would be an idea to integrate some of Han's performance testing
>> scripts into the set of tests we already have in the upstream repo,
>> ovn-performance.at [0], and run those in GitHub actions too.

>> [0] https://github.com/ovn-org/ovn/blob/main/tests/ovn-performance.at

>> Han, others, what do you think?

I've been thinking about this for a long time but it wasn't prioritized. It
was not easy for the reasons mentioned by Ilya. But I agree it is something
we need to figure out. For memory increase it should be pretty reliable,
but for CPU/latency, we might need to think about metrics that take into
account the performance of the node that executes the test cases. Of course
hardware resources (CPU and memory of the test node) v.s. the scale we can
test is another concern, but we may start with a scale that is not too big
to run in git actions while big enough to provide good performance metrics.

Thanks,
Han

> Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn v6 00/13] northd lflow incremental processing

Reply via email to