On Mon, Feb 5, 2024 at 9:41 PM Han Zhou <hz...@ovn.org> wrote:
>
> On Mon, Feb 5, 2024 at 4:12 PM Numan Siddique <num...@ovn.org> wrote:
> >
> > On Mon, Feb 5, 2024 at 5:54 PM Han Zhou <hz...@ovn.org> wrote:
> > >
> > > On Mon, Feb 5, 2024 at 10:15 AM Ilya Maximets <i.maxim...@ovn.org>
> wrote:
> > > >
> > > > On 2/5/24 15:45, Ilya Maximets wrote:
> > > > > On 2/5/24 11:34, Ilya Maximets wrote:
> > > > >> On 2/5/24 09:23, Dumitru Ceara wrote:
> > > > >>> On 2/5/24 08:13, Han Zhou wrote:
> > > > >>>> On Sun, Feb 4, 2024 at 9:26 PM Numan Siddique <num...@ovn.org>
> wrote:
> > > > >>>>>
> > > > >>>>> On Sun, Feb 4, 2024 at 9:53 PM Han Zhou <hz...@ovn.org> wrote:
> > > > >>>>>>
> > > > >>>>>> On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets <
> i.maxim...@ovn.org>
> > > wrote:
> > > > >>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>>  35 files changed, 9681 insertions(+), 4645 deletions(-)
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> I had another look at this series and acked the remaining
> > > > >>>> patches.  I
> > > > >>>>>>>>>> just had some minor comments that can be easily fixed when
> > > > >>>> applying
> > > > >>>>>> the
> > > > >>>>>>>>>> patches to the main branch.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Thanks for all the work on this!  It was a very large
> change
> > > but
> > > > >>>> it
> > > > >>>>>>>>>> improves northd performance significantly.  I just hope we
> > > don't
> > > > >>>>>>>>>> introduce too many bugs.  Hopefully the time we have until
> > > release
> > > > >>>>>> will
> > > > >>>>>>>>>> allow us to further test this change on the 24.03 branch.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Regards,
> > > > >>>>>>>>>> Dumitru
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> Thanks a lot Dumitru and Han for the reviews and patience.
> > > > >>>>>>>>>
> > > > >>>>>>>>> I addressed the comments and applied the patches to main and
> > > also
> > > > >>>> to
> > > > >>>>>>>> branch-24.03.
> > > > >>>>>>>>>
> > > > >>>>>>>>> @Han - I know you wanted to take another look in to v6.  I
> > > didn't
> > > > >>>> want
> > > > >>>>>> to
> > > > >>>>>>>> delay further as branch-24.03 was created.  I'm more than
> happy
> > > to
> > > > >>>>>> submit
> > > > >>>>>>>> follow up patches if you have any comments to address.
> Please
> > > let
> > > > >>>> me
> > > > >>>>>> know.
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> Hi Numan,
> > > > >>>>>>>>
> > > > >>>>>>>> I was writing the reply and saw your email just now. Thanks
> a lot
> > > > >>>> for
> > > > >>>>>>>> taking a huge effort to achieve the great optimization. I
> only
> > > left
> > > > >>>> one
> > > > >>>>>>>> comment on the implicit dependency left for the en_lrnat ->
> > > > >>>> en_lflow.
> > > > >>>>>> Feel
> > > > >>>>>>>> free to address it with a followup and no need to block the
> > > > >>>> branching.
> > > > >>>>>> And
> > > > >>>>>>>> take my Ack for the series with that addressed.
> > > > >>>>>>>>
> > > > >>>>>>>> Acked-by: Han Zhou <hzhou at ovn.org>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> Hi, Numan, Dumitru and Han.
> > > > >>>>>>>
> > > > >>>>>>> I see a huge negative performance impact, most likely from
> this
> > > set,
> > > > >>>> on
> > > > >>>>>>> ovn-heater's cluster-density tests.  The memory consumption on
> > > northd
> > > > >>>
> > > > >>> Thanks for reporting this, Ilya!
> > > > >>>
> > > > >>>>>>> jumped about 4x and it constantly recomputes due to failures
> of
> > > > >>>> port_group
> > > > >>>>>>> handler:
> > > > >>>>>>>
> > > > >>>>>>> 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow,
> > > > >>>> recompute
> > > > >>>>>> (failed handler for input port_group) took 9762ms
> > > > >>>>>>> 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long
> > > 9898ms
> > > > >>>> poll
> > > > >>>>>> interval (5969ms user, 1786ms system)
> > > > >>>>>>> ...
> > > > >>>>>>> 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow,
> > > > >>>> recompute
> > > > >>>>>> (failed handler for input port_group) took 9014ms
> > > > >>>>>>> 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long
> > > 9118ms
> > > > >>>> poll
> > > > >>>>>> interval (5376ms user, 1515ms system)
> > > > >>>>>>> ...
> > > > >>>>>>> 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow,
> > > > >>>> recompute
> > > > >>>>>> (failed handler for input port_group) took 10695ms
> > > > >>>>>>> 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long
> > > 10890ms
> > > > >>>>>> poll interval (6085ms user, 2745ms system)
> > > > >>>>>>> ...
> > > > >>>>>>> 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow,
> > > > >>>> recompute
> > > > >>>>>> (failed handler for input port_group) took 9985ms
> > > > >>>>>>> 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long
> > > 10108ms
> > > > >>>>>> poll interval (5521ms user, 2440ms system)
> > > > >>>>>>>
> > > > >>>>>>> That increases 95%% ovn-installed latency in 500node
> > > cluster-density
> > > > >>>> from
> > > > >>>>>>> 3.6 seconds last week to 21.5 seconds this week.
> > > > >>>>>>>
> > > > >>>>>>> I think, this should be a release blocker.
> > > > >>>>>>>
> > > > >>>>>>> Memory usage is also very concerning.  Unfortunately it is not
> > > tied
> > > > >>>> to the
> > > > >>>>>>> cluster-density test.  The same 4-5x RSS jump is also seen in
> > > other
> > > > >>>> test
> > > > >>>>>>> like density-heavy.  Last week RSS of ovn-northd in
> > > cluster-density
> > > > >>>> 500
> > > > >>>>>> node
> > > > >>>>>>> was between 1.5 and 2.5 GB, this week we have a range between
> 5.5
> > > and
> > > > >>>> 8.5
> > > > >>>>>> GB.
> > > > >>>>>>>
> > > > >>>>>>> I would consider this as a release blocker as well.
> > > > >>>>>>>
> > > > >>>
> > > > >>> I agree, we shouldn't release 24.03.0 unless these two issues are
> > > > >>> (sufficiently) addressed.  We do have until March 1st (official
> > > release
> > > > >>> date) to do that or to revert any patches that cause regressions.
> > > > >>>
> > > > >>>>>>>
> > > > >>>>>>> I don't have direct evidence that this particular series is a
> > > > >>>> culprit, but
> > > > >>>>>>> it looks like the most likely candidate.  I can dig more into
> > > > >>>>>> investigation
> > > > >>>>>>> on Monday.
> > > > >>>>>>>
> > > > >>>>>>> Best regards, Ilya Maximets.
> > > > >>>>>>
> > > > >>>>>> Thanks Ilya for reporting this. 95% latency and 4x RSS
> increase is
> > > a
> > > > >>>> little
> > > > >>>>>> surprising to me. I did test this series with my scale test
> > > scripts for
> > > > >>>>>> recompute performance regression. It was 10+% increase in
> latency.
> > > I
> > > > >>>> even
> > > > >>>>>> digged a little into it, and noticed ~5% increase caused by the
> > > hmap
> > > > >>>> used
> > > > >>>>>> to maintain the lflows in each lflow_ref. This was discussed in
> > > the code
> > > > >>>>>> review for an earlier version (v2/v3). Overall it looked not
> very
> > > bad,
> > > > >>>> if
> > > > >>>>>> we now handle most common scenarios incrementally, and it is
> > > reasonable
> > > > >>>> to
> > > > >>>>>> have some cost for maintaining the references/index for
> incremental
> > > > >>>>>> processing. I wonder if my test scenario was too simple (didn't
> > > have LBs
> > > > >>>>>> included) to find the problems, so today I did another test by
> > > > >>>> including a
> > > > >>>>>> LB group with 1k LBs applied to 100 node-LS & GR, and another
> 1K
> > > LBs per
> > > > >>>>>> node-LS & GR (101K LBs in total), and I did see more
> performance
> > > penalty
> > > > >>>>>> but still within ~20%. While for memory I didn't notice a
> > > significant
> > > > >>>>>> increase (<10%). I believe I am missing some specific scenario
> > > that had
> > > > >>>> the
> > > > >>>>>> big impact in the ovn-heater's tests. Please share if you dig
> out
> > > more
> > > > >>>>>> clues .
> > > > >>>>>
> > > > >>>>> Hi Ilya,
> > > > >>>>>
> > > > >>>>> Thanks for reporting these details.
> > > > >>>>>
> > > > >>>>> I had a look at this regression.   There is a significant
> increase
> > > in
> > > > >>>>> the lflow recompute
> > > > >>>>> time (around 4x) in my local testing and this definitely not
> > > acceptable.
> > > > >>>>>
> > > > >>>>> In this particular cluster density test, whenever a port group
> is
> > > > >>>>> created it results in a full recompute
> > > > >>>>> and now since recompute time has increased, it has a cumulative
> > > effect
> > > > >>>>> on the latency of the test.
> > > > >>>>>
> > > > >>>>> The dp reference counting [1] added in the v4 of this series has
> > > > >>>>> introduced this regression (both CPU and memory).
> > > > >>>>> I'm working on this fix and I think I should be able to address
> > > this soon.
> > > > >>>>>
> > > > >>>>> [1] -
> > > https://github.com/ovn-org/ovn/blob/main/northd/lflow-mgr.c#L696
> > > > >>>>>
> > > > >>>>>
> > > > >>>> Indeed, the same concern I mentioned when reviewing v5:
> > > > >>>>
> > > https://mail.openvswitch.org/pipermail/ovs-dev/2024-January/411194.html
> > > > >>
> > > > >> Yeah.  In this cluster-density 500node test, dp_refcnt_use()
> allocates
> > > > >> 120 M references and puts them into hash maps.  That alone takes
> > > several
> > > > >> GB of memory.  And it is a largest consumer according to massif.
> > > > >> It takes ~45% of total memory allocations.  Another ~25% are taken
> by
> > > hash
> > > > >> maps where the refcounts are stored.
> > > > >>
> > > > >> There is a simple optimization that can be made - not allocate
> > > refcounts
> > > > >> if it is the first time we're trying to add them.  The original
> bitmap
> > > > >> should be enough.  And if we try to use it again while it is
> already in
> > > > >> a bitmap, then allocate.  That saves a lot of space, because 80%
> of all
> > > > >> refcounts in that test never used more than once.  However, it
> might
> > > not
> > > > >> be enough.  Needs testing.  Ideally the whole structure needs to
> go or
> > > > >> be replaced by something different.  Frequent iterations over the
> > > bitmaps
> > > > >> and per-datapath*per-lflow memory allocations do not scale.  Even
> if we
> > > > >> can make it work reasonably well for 500 nodes, it will cause
> problems
> > > on
> > > > >> higher scale.
> > > > >
> > > > > FWIW, I have a small prototype for this partial solution that seem
> to
> > > > > noticeably decrease the memory usage here:
> > > > >
> > >
> https://github.com/igsilya/ovn/commit/5ac3a78e287708fd571344c74a021e418fde31a4
> > > > >
> > > > > I need to run this through the scale run to get actual results
> though.
> > > >
> > > > So, the results for the cluster-density 500 node test are here:
> > > >
> > > >           | Avg. Poll Intervals | Total test time |  northd RSS
> > > > --------------------------------+-----------------+---------------
> > > > Last week |     1.5 seconds     |  1005 seconds   |    2.5 GB
> > > > This week |      6  seconds     |  2246 seconds   |    8.5 GB
> > > > Patch     |     2.5 seconds     |  1170 seconds   |    3.1 GB
> > > >
> > > > So, the patch does help a lot.  But we still have 66%  increase
> > > > of the poll intervals on northd, i.e. 66% increase in recompute
> > > > time.  And about 25% increase in memory usage.
> > > >
> > > > I know that Numan is working on alternative solution to just
> > > > remove the reference counting altogether, so I'll hold on sending
> > > > an official patch.  Removing the refcounting, if possible, sounds
> > > > like a better approach in general.
> > > >
> > >
> > > Thanks Ilya. I also tested with your patch for my previous test case
> that
> > > had 50% increase on latency and memory, and compared with a patch that
> > > completely removes the ref count logic but only kept the hmap in lflow
> > > (without really using it after initializing).
> > > The difference is really small in my test (~1%) in both memory and
> latency.
> > > So I think your patch is quite effective. I believe the ref count will
> very
> > > rarely be larger than 1 in reality, so I think the memory cost should be
> > > negligible after applying your patch.
> > > I was actually hesitating about the approach, primarily concerning the
> > > latency, because the bitmap scan still happens and for each 1 it will
> > > perform another bit test. However, with my test it seemed that the CPU
> cost
> > > was also negligible. I did check perf flamegraph and the bitmap
> operation
> > > didn't show as a significant part.
> > >
> > > So, your patch seems to me a good candidate solution for this particular
> > > problem. I am again surprised at the cluster-density 500 node test
> result
> > > you shared above. Could it be that the majority of the 66% increase in
> > > compute time and 25% increase in memory is from a different part of the
> > > change other than the ref count? Did you test with ref count completely
> > > removed (of course there is a correctness problem when the ref count
> would
> > > need to be > 1, but just ignore that for now for the performance
> > > evaluation).
> > >
> > > It would be great if Numan comes up with an approach that avoids the ref
> > > count and I am curious which direction is it. Initially I had the
> opinion
> > > that for I-P, we should avoid merging lflows' DP/DPGs, but instead
> directly
> > > create lflows for individual DPs or for DPGs generated beforehand (e.g.
> for
> > > LBs applied to multiple DPs), even if there may be some redundant flows
> > > generated. However, later after reviewing Numan's patch it seemed more
> > > generic, and I think it is ok with this more generic approach if the
> > > performance cost is acceptable, although it does add more complexity.
> >
> > Let me just briefly explain why I had to add dp ref count in each
> ovn_lflow.
> >
> >   - When a logical flow  L(M, A) is added more than once (i.e
> ovn_lflow_add())
> >      for the same datapath we need to keep this ref count so that we don't
> >      delete the logical flow from SB DB when it is actually still
> required.
> >      I noticed this behavior with this commit [1]
> >
> >  - There is also another possibility.  There can be 2 load balancers
> > lb1 and lb2 with
> >    the same VIP and both these lbs are associated to the same datapaths.
> >    eg.  ovn-nbctl ls-lb-add sw0 lb1
> >           ovn-nbctl ls-lb-add sw0 lb2
> >    In this case,  ovn-northd would generate exact same logical flows
> > for the common VIP
> >    We need to have reference counting, otherwise if lb2 is removed from
> sw0
> >    we may delete the lflows from SB DB where as it is still needed for
> lb1.
> >    This would get fixed when ovn-northd recomputes, but until then its
> > a problem.
> >    I added this test case here [2] specifically to cover this scenario
> (in v4)
>
> The motivation is clear. But what I mentioned was a different approach. In
> that approach, we don't care about adding the same lflow multiple times,
> provided that it happens rarely. When that happens, there will be multiple
> equivalent lflows, but each referenced by different objects. For lflows
> added with 'od', even if the 'od' is the same, they are added without
> checking if there is an old lflow. When the object referencing the one of
> the lflows is deleted, the linked lflow is deleted accordingly, and the
> other 'copies' is still kept in the lflow table. Similarly, for lflows
> added with DPG, we don't try to merge the DPGs, and add them as is, even if
> the DPG may contain the same DPs. There will be duplicated lflows in SB
> (with different row uuids), but I believe ovn-controller can handle it
> properly.

I think your approach seems reasonable to me.

I'd perhaps take this approach
   - For the same flow added multiple times with the 'od' set use the dp_refcnt.
    This doesn't happen often and even if so, the cost to maintain the
dp_refcnt will be minimal.

   - For flows added with dp_group  (mainly for lb_datapaths) as you
suggested use a DPG for each lb_datapaths.

We can explore that and target for 24.09 perhaps.


>
> However, I am not concluding that the approach is better (or worse) than
> the current one. It may be more friendly to I-P handling, but having
> duplications (even in rare situations) is not ideal, so these are pros and
> cons. If the ref count performance problem is resolved I think the current
> approach is very reasonable.
>
> >
> > I tried taking the approach of removing dp ref count and falling back
> > to recompute
> > when a logical flow's dp count (in the dp group bitmap) doesn't match
> with a
> > refcount variable.   Here's the WIP patch -
> >
> https://github.com/numansiddique/ovn/commit/92f6563d9e7b1e6c8f2b924dea7a220781e10a05
> > I still need to run cluster density tests.
> >
> > IMO  Ilya's approach seems cleaner to me.   What do you all think ?
> > Once the tests are complete, I'll share the results.
> >
> > And the test results from both Ilya and Han have shown a positive impact.
>
> I am ok with Ilya's patch. My tests showed that it solved the ref count's
> performance problem, but it is better to also confirm by the ovn-heater
> test to prove the same by comparing with a patch that simply removes the
> ref count, in case my test coverage is not sufficient.
>

I did some testing with my patch and these are the findings

                                     | Avg. Poll Intervals | Total
test time |  northd RSS
------------------------------ +------------------------+---------------
Last week                     |      1.5 seconds     |  1005 seconds
|    2.5 GB
This week                     |      6  seconds       |  2246 seconds
 |    8.5 GB
Ilya's Patch                   |     2.5 seconds      |  1170 seconds
 |    3.1 GB
Numan's patch (run1)   |     1.6 seconds      |  992 seconds    |    2.43 GB
Numan's patch (run2)   |     1.7 seconds      |  1022 seconds   |    2.43 GB

Seems like removing dp ref cnt all together is more efficient and very close to
last week's results.

I'll submit the patch tomorrow after some more testing.

Feel free to provide any review comments or feedback if you've any.

Thanks
Numan

> >
> > Regarding the cluster density 500 node test with Ilya's fix,  I think
> > this seems o.k to me since
> > the test falls back to recompute a lot and there is some cost now with
> > the lflow I-P patches.
> > This can be improved by adding port group I-P.   I can start looking
> > into the port-group I-P if you
> > think it would be worth it.  Let me know.
> >
>
> Adding more I-P would be helpful, but it still doesn't explain the
> observation from Ilya regarding the 66% increase in recompute time and 25%
> increase in memory. My test cases couldn't reproduce such significant
> increases (it is ~10-20% in my test). I think we'd better figure out what
> caused such big increases - is it still related to ref count or is it
> something else. I think the lflow_ref might have contributed to it,
> especially the use of hmap instead of list. It is totally fine if the
> lflow_ref adds some cost, which is required for the I-P to work, but it is
> better to understand where the cost comes from and justify it (or optimize
> it).

Looks like from my patch results,  dp_ref_cnt is contributing to this increase.
Is it possible for you to trigger another test with my patch -
https://github.com/numansiddique/ovn/commit/92f6563d9e7b1e6c8f2b924dea7a220781e10a05

Thanks
Numan

>
> Thanks,
> Han
>
> >
> >
> > [1] -
> https://github.com/ovn-org/ovn/commit/fe1c5df98b6f38a59dc5bdbba658a25effe32667
> > [2] - https://github.com/ovn-org/ovn/blob/main/tests/ovn-northd.at#L11507
> >
> > Thanks
> > Numan
> >
> > >
> > > For Dumitru's proposal:
> > > >> Maybe it would be an idea to integrate some of Han's performance
> testing
> > > >> scripts into the set of tests we already have in the upstream repo,
> > > >> ovn-performance.at [0], and run those in GitHub actions too.
> > >
> > > >> [0] https://github.com/ovn-org/ovn/blob/main/tests/ovn-performance.at
> > >
> > > >> Han, others, what do you think?
> > >
> > > I've been thinking about this for a long time but it wasn't
> prioritized. It
> > > was not easy for the reasons mentioned by Ilya. But I agree it is
> something
> > > we need to figure out. For memory increase it should be pretty reliable,
> > > but for CPU/latency, we might need to think about metrics that take into
> > > account the performance of the node that executes the test cases. Of
> course
> > > hardware resources (CPU and memory of the test node) v.s. the scale we
> can
> > > test is another concern, but we may start with a scale that is not too
> big
> > > to run in git actions while big enough to provide good performance
> metrics.
> > >
> > > Thanks,
> > > Han
> > >
> > > > Best regards, Ilya Maximets.
> > > _______________________________________________
> > > dev mailing list
> > > d...@openvswitch.org
> > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> _______________________________________________
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to