On Mon, Feb 5, 2024 at 10:15 AM Ilya Maximets <i.maxim...@ovn.org> wrote: > > On 2/5/24 15:45, Ilya Maximets wrote: > > On 2/5/24 11:34, Ilya Maximets wrote: > >> On 2/5/24 09:23, Dumitru Ceara wrote: > >>> On 2/5/24 08:13, Han Zhou wrote: > >>>> On Sun, Feb 4, 2024 at 9:26 PM Numan Siddique <num...@ovn.org> wrote: > >>>>> > >>>>> On Sun, Feb 4, 2024 at 9:53 PM Han Zhou <hz...@ovn.org> wrote: > >>>>>> > >>>>>> On Sun, Feb 4, 2024 at 5:46 AM Ilya Maximets <i.maxim...@ovn.org> wrote: > >>>>>>> > >>>>>>>>>> > >>>>>>>>>>> 35 files changed, 9681 insertions(+), 4645 deletions(-) > >>>>>>>>>> > >>>>>>>>>> I had another look at this series and acked the remaining > >>>> patches. I > >>>>>>>>>> just had some minor comments that can be easily fixed when > >>>> applying > >>>>>> the > >>>>>>>>>> patches to the main branch. > >>>>>>>>>> > >>>>>>>>>> Thanks for all the work on this! It was a very large change but > >>>> it > >>>>>>>>>> improves northd performance significantly. I just hope we don't > >>>>>>>>>> introduce too many bugs. Hopefully the time we have until release > >>>>>> will > >>>>>>>>>> allow us to further test this change on the 24.03 branch. > >>>>>>>>>> > >>>>>>>>>> Regards, > >>>>>>>>>> Dumitru > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Thanks a lot Dumitru and Han for the reviews and patience. > >>>>>>>>> > >>>>>>>>> I addressed the comments and applied the patches to main and also > >>>> to > >>>>>>>> branch-24.03. > >>>>>>>>> > >>>>>>>>> @Han - I know you wanted to take another look in to v6. I didn't > >>>> want > >>>>>> to > >>>>>>>> delay further as branch-24.03 was created. I'm more than happy to > >>>>>> submit > >>>>>>>> follow up patches if you have any comments to address. Please let > >>>> me > >>>>>> know. > >>>>>>>>> > >>>>>>>> > >>>>>>>> Hi Numan, > >>>>>>>> > >>>>>>>> I was writing the reply and saw your email just now. Thanks a lot > >>>> for > >>>>>>>> taking a huge effort to achieve the great optimization. I only left > >>>> one > >>>>>>>> comment on the implicit dependency left for the en_lrnat -> > >>>> en_lflow. > >>>>>> Feel > >>>>>>>> free to address it with a followup and no need to block the > >>>> branching. > >>>>>> And > >>>>>>>> take my Ack for the series with that addressed. > >>>>>>>> > >>>>>>>> Acked-by: Han Zhou <hzhou at ovn.org> > >>>>>>> > >>>>>>> > >>>>>>> Hi, Numan, Dumitru and Han. > >>>>>>> > >>>>>>> I see a huge negative performance impact, most likely from this set, > >>>> on > >>>>>>> ovn-heater's cluster-density tests. The memory consumption on northd > >>> > >>> Thanks for reporting this, Ilya! > >>> > >>>>>>> jumped about 4x and it constantly recomputes due to failures of > >>>> port_group > >>>>>>> handler: > >>>>>>> > >>>>>>> 2024-02-03T11:09:12.441Z|01680|inc_proc_eng|INFO|node: lflow, > >>>> recompute > >>>>>> (failed handler for input port_group) took 9762ms > >>>>>>> 2024-02-03T11:09:12.444Z|01681|timeval|WARN|Unreasonably long 9898ms > >>>> poll > >>>>>> interval (5969ms user, 1786ms system) > >>>>>>> ... > >>>>>>> 2024-02-03T11:09:23.770Z|01690|inc_proc_eng|INFO|node: lflow, > >>>> recompute > >>>>>> (failed handler for input port_group) took 9014ms > >>>>>>> 2024-02-03T11:09:23.773Z|01691|timeval|WARN|Unreasonably long 9118ms > >>>> poll > >>>>>> interval (5376ms user, 1515ms system) > >>>>>>> ... > >>>>>>> 2024-02-03T11:09:36.692Z|01699|inc_proc_eng|INFO|node: lflow, > >>>> recompute > >>>>>> (failed handler for input port_group) took 10695ms > >>>>>>> 2024-02-03T11:09:36.696Z|01700|timeval|WARN|Unreasonably long 10890ms > >>>>>> poll interval (6085ms user, 2745ms system) > >>>>>>> ... > >>>>>>> 2024-02-03T11:09:49.133Z|01708|inc_proc_eng|INFO|node: lflow, > >>>> recompute > >>>>>> (failed handler for input port_group) took 9985ms > >>>>>>> 2024-02-03T11:09:49.137Z|01709|timeval|WARN|Unreasonably long 10108ms > >>>>>> poll interval (5521ms user, 2440ms system) > >>>>>>> > >>>>>>> That increases 95%% ovn-installed latency in 500node cluster-density > >>>> from > >>>>>>> 3.6 seconds last week to 21.5 seconds this week. > >>>>>>> > >>>>>>> I think, this should be a release blocker. > >>>>>>> > >>>>>>> Memory usage is also very concerning. Unfortunately it is not tied > >>>> to the > >>>>>>> cluster-density test. The same 4-5x RSS jump is also seen in other > >>>> test > >>>>>>> like density-heavy. Last week RSS of ovn-northd in cluster-density > >>>> 500 > >>>>>> node > >>>>>>> was between 1.5 and 2.5 GB, this week we have a range between 5.5 and > >>>> 8.5 > >>>>>> GB. > >>>>>>> > >>>>>>> I would consider this as a release blocker as well. > >>>>>>> > >>> > >>> I agree, we shouldn't release 24.03.0 unless these two issues are > >>> (sufficiently) addressed. We do have until March 1st (official release > >>> date) to do that or to revert any patches that cause regressions. > >>> > >>>>>>> > >>>>>>> I don't have direct evidence that this particular series is a > >>>> culprit, but > >>>>>>> it looks like the most likely candidate. I can dig more into > >>>>>> investigation > >>>>>>> on Monday. > >>>>>>> > >>>>>>> Best regards, Ilya Maximets. > >>>>>> > >>>>>> Thanks Ilya for reporting this. 95% latency and 4x RSS increase is a > >>>> little > >>>>>> surprising to me. I did test this series with my scale test scripts for > >>>>>> recompute performance regression. It was 10+% increase in latency. I > >>>> even > >>>>>> digged a little into it, and noticed ~5% increase caused by the hmap > >>>> used > >>>>>> to maintain the lflows in each lflow_ref. This was discussed in the code > >>>>>> review for an earlier version (v2/v3). Overall it looked not very bad, > >>>> if > >>>>>> we now handle most common scenarios incrementally, and it is reasonable > >>>> to > >>>>>> have some cost for maintaining the references/index for incremental > >>>>>> processing. I wonder if my test scenario was too simple (didn't have LBs > >>>>>> included) to find the problems, so today I did another test by > >>>> including a > >>>>>> LB group with 1k LBs applied to 100 node-LS & GR, and another 1K LBs per > >>>>>> node-LS & GR (101K LBs in total), and I did see more performance penalty > >>>>>> but still within ~20%. While for memory I didn't notice a significant > >>>>>> increase (<10%). I believe I am missing some specific scenario that had > >>>> the > >>>>>> big impact in the ovn-heater's tests. Please share if you dig out more > >>>>>> clues . > >>>>> > >>>>> Hi Ilya, > >>>>> > >>>>> Thanks for reporting these details. > >>>>> > >>>>> I had a look at this regression. There is a significant increase in > >>>>> the lflow recompute > >>>>> time (around 4x) in my local testing and this definitely not acceptable. > >>>>> > >>>>> In this particular cluster density test, whenever a port group is > >>>>> created it results in a full recompute > >>>>> and now since recompute time has increased, it has a cumulative effect > >>>>> on the latency of the test. > >>>>> > >>>>> The dp reference counting [1] added in the v4 of this series has > >>>>> introduced this regression (both CPU and memory). > >>>>> I'm working on this fix and I think I should be able to address this soon. > >>>>> > >>>>> [1] - https://github.com/ovn-org/ovn/blob/main/northd/lflow-mgr.c#L696 > >>>>> > >>>>> > >>>> Indeed, the same concern I mentioned when reviewing v5: > >>>> https://mail.openvswitch.org/pipermail/ovs-dev/2024-January/411194.html > >> > >> Yeah. In this cluster-density 500node test, dp_refcnt_use() allocates > >> 120 M references and puts them into hash maps. That alone takes several > >> GB of memory. And it is a largest consumer according to massif. > >> It takes ~45% of total memory allocations. Another ~25% are taken by hash > >> maps where the refcounts are stored. > >> > >> There is a simple optimization that can be made - not allocate refcounts > >> if it is the first time we're trying to add them. The original bitmap > >> should be enough. And if we try to use it again while it is already in > >> a bitmap, then allocate. That saves a lot of space, because 80% of all > >> refcounts in that test never used more than once. However, it might not > >> be enough. Needs testing. Ideally the whole structure needs to go or > >> be replaced by something different. Frequent iterations over the bitmaps > >> and per-datapath*per-lflow memory allocations do not scale. Even if we > >> can make it work reasonably well for 500 nodes, it will cause problems on > >> higher scale. > > > > FWIW, I have a small prototype for this partial solution that seem to > > noticeably decrease the memory usage here: > > https://github.com/igsilya/ovn/commit/5ac3a78e287708fd571344c74a021e418fde31a4 > > > > I need to run this through the scale run to get actual results though. > > So, the results for the cluster-density 500 node test are here: > > | Avg. Poll Intervals | Total test time | northd RSS > --------------------------------+-----------------+--------------- > Last week | 1.5 seconds | 1005 seconds | 2.5 GB > This week | 6 seconds | 2246 seconds | 8.5 GB > Patch | 2.5 seconds | 1170 seconds | 3.1 GB > > So, the patch does help a lot. But we still have 66% increase > of the poll intervals on northd, i.e. 66% increase in recompute > time. And about 25% increase in memory usage. > > I know that Numan is working on alternative solution to just > remove the reference counting altogether, so I'll hold on sending > an official patch. Removing the refcounting, if possible, sounds > like a better approach in general. >
Thanks Ilya. I also tested with your patch for my previous test case that had 50% increase on latency and memory, and compared with a patch that completely removes the ref count logic but only kept the hmap in lflow (without really using it after initializing). The difference is really small in my test (~1%) in both memory and latency. So I think your patch is quite effective. I believe the ref count will very rarely be larger than 1 in reality, so I think the memory cost should be negligible after applying your patch. I was actually hesitating about the approach, primarily concerning the latency, because the bitmap scan still happens and for each 1 it will perform another bit test. However, with my test it seemed that the CPU cost was also negligible. I did check perf flamegraph and the bitmap operation didn't show as a significant part. So, your patch seems to me a good candidate solution for this particular problem. I am again surprised at the cluster-density 500 node test result you shared above. Could it be that the majority of the 66% increase in compute time and 25% increase in memory is from a different part of the change other than the ref count? Did you test with ref count completely removed (of course there is a correctness problem when the ref count would need to be > 1, but just ignore that for now for the performance evaluation). It would be great if Numan comes up with an approach that avoids the ref count and I am curious which direction is it. Initially I had the opinion that for I-P, we should avoid merging lflows' DP/DPGs, but instead directly create lflows for individual DPs or for DPGs generated beforehand (e.g. for LBs applied to multiple DPs), even if there may be some redundant flows generated. However, later after reviewing Numan's patch it seemed more generic, and I think it is ok with this more generic approach if the performance cost is acceptable, although it does add more complexity. For Dumitru's proposal: >> Maybe it would be an idea to integrate some of Han's performance testing >> scripts into the set of tests we already have in the upstream repo, >> ovn-performance.at [0], and run those in GitHub actions too. >> [0] https://github.com/ovn-org/ovn/blob/main/tests/ovn-performance.at >> Han, others, what do you think? I've been thinking about this for a long time but it wasn't prioritized. It was not easy for the reasons mentioned by Ilya. But I agree it is something we need to figure out. For memory increase it should be pretty reliable, but for CPU/latency, we might need to think about metrics that take into account the performance of the node that executes the test cases. Of course hardware resources (CPU and memory of the test node) v.s. the scale we can test is another concern, but we may start with a scale that is not too big to run in git actions while big enough to provide good performance metrics. Thanks, Han > Best regards, Ilya Maximets. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev