On Mon, Aug 14, 2023 at 4:46 AM Dumitru Ceara <dce...@redhat.com> wrote: > > On 8/12/23 07:08, Han Zhou wrote: > > On Fri, Aug 11, 2023 at 6:07 AM Dumitru Ceara <dce...@redhat.com> wrote: > >> > >> On 8/10/23 18:38, Ilya Maximets wrote: > >>> On 8/10/23 17:34, Dumitru Ceara wrote: > >>>> On 8/10/23 17:20, Han Zhou wrote: > >>>>> On Thu, Aug 10, 2023 at 6:36 AM Dumitru Ceara <dce...@redhat.com> > > wrote: > >>>>>> > >>>>>> On 8/10/23 15:34, Han Zhou wrote: > >>>>>>> On Thu, Aug 10, 2023 at 2:29 AM Dumitru Ceara <dce...@redhat.com> > > wrote: > >>>>>>>> > >>>>>>>> On 8/10/23 08:12, Ales Musil wrote: > >>>>>>>>> On Wed, Aug 9, 2023 at 5:13 PM Mark Michelson < mmich...@redhat.com > >> > >>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi Ales, > >>>>>>>>>> > >>>>>>>>>> I have some high-level comments/questions about this patch. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> Hi Mark, > >>>>>>>>> > >>>>>>>> > >>>>>>>> Hi Ales, Mark, > >>>>>>>> > >>>>>>>>> thank you for the review. See my answers inline below. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> I have been privy to the conversations that led to this change. > > My > >>>>>>>>>> understanding is that by having ovn-northd wake up immediately, > > it > >>>>> can > >>>>>>>>>> be more CPU-intensive than waiting a bit for changes to > > accumulate > >>>>> and > >>>>>>>>>> handling all of those at once instead. However, nothing in > > either the > >>>>>>>>>> commit message or ovn-nb.xml explains what the purpose of this > > new > >>>>>>>>>> configuration option is. I think you should add a sentence or > > two to > >>>>>>>>>> explain why someone would want to enable this option. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> Yeah that's my bad, I have v2 prepared with some explanation in > > the > >>>>>>> commit > >>>>>>>>> message > >>>>>>>>> together with results from scale run. > >>>>>>>>> > >>>>>>>> > >>>>>>>> +1 we really need to explain why this change is needed. > >>>>>>>> > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Next, the algorithm used here strikes me as odd. We use the > > previous > >>>>>>> run > >>>>>>>>>> time of ovn-northd to determine how long to wait before running > >>>>> again. > >>>>>>>>>> This delay is capped by the configured backoff time. Let's say > > that > >>>>>>>>>> we've configured the backoff interval to be 200 ms. If ovn-northd > >>>>> has a > >>>>>>>>>> super quick run and only takes 10 ms, then we will only delay the > >>>>> next > >>>>>>>>>> run by 10 ms. IMO, this seems like it would not mitigate the > > original > >>>>>>>>>> issue by much, since we are only allowing a maximum of 20 ms (10 > > ms > >>>>> for > >>>>>>>>>> the run of ovn-northd + 10 ms delay) of NB changes to accumulate. > >>>>>>>>>> Conversely, if northd has a huge recompute and it takes 5000 ms > > to > >>>>>>>>>> complete, then we would delay the next run by 200ms. In this > > case, > >>>>>>>>>> delaying at all seems like it's not necessary since we > > potentially > >>>>> have > >>>>>>>>>> 5000 ms worth of NB DB updates that have not been addressed. I > > would > >>>>>>>>>> have expected the opposite approach to be taken. If someone > >>>>> configures > >>>>>>>>>> 200ms as their backoff interval, I would expect us to always > > allow a > >>>>>>>>>> *minimum* of 200ms of NB changes to accumulate before running > > again. > >>>>> So > >>>>>>>>>> for instance, if northd runs quickly and is done in 10 ms, then > > we > >>>>>>> would > >>>>>>>>>> wait 200 - 10 = 190 ms before processing changes again. If northd > >>>>> takes > >>>>>>>>>> a long time to recompute and takes 5000 ms, then we would not > > wait at > >>>>>>>>>> all before processing changes again. Was the algorithm chosen > > based > >>>>> on > >>>>>>>>>> experimentation? Is it a well-known method I'm just not familiar > >>>>> with? > >>>>>>>> > >>>>>>>> I think the main assumption (that should probably be made explicit > > in > >>>>>>>> the commit log and/or documentation) is that on average changes > > happen > >>>>>>>> in a uniform way. This might not always be accurate. > >>>>>>>> > >>>>>>>> However, if we're off with the estimate, in the worst case we'd be > >>>>>>>> adding the configured max delay to the latency of processing > > changes. > >>>>>>>> So, as long as the value is not extremely high, the impact is not > > that > >>>>>>>> high either. > >>>>>>>> > >>>>>>>> Last but not least, as this value would be configured by the CMS, > > we > >>>>>>>> assume the CMS knows what they're doing. :) > >>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> I'm not sure if the algorithm is well known. > >>>>>>>>> > >>>>>>>>> The thing is that at scale we almost always cap at the backoff so > > it > >>>>> has > >>>>>>>>> probably > >>>>>>>>> the same effect as your suggestion with the difference that we > >>>>> actually > >>>>>>>>> delay even > >>>>>>>>> after long runs. And that is actually desired, it's true that in > > the > >>>>>>> let's > >>>>>>>>> say 500 ms > >>>>>>>>> should be enough to accumulate more changes however that can lead > > to > >>>>>>> another > >>>>>>>>> 500 ms run and so on. That in the end means that northd will spin > > at > >>>>>>> 100% > >>>>>>>>> CPU > >>>>>>>>> anyway which is what we want to avoid. So from another point of > > view > >>>>> the > >>>>>>>>> accumulation > >>>>>>>>> of IDL changes is a secondary effect which is still desired, but > > not > >>>>> the > >>>>>>>>> main purpose. > >>>>>>>>> > >>>>>>>>> Also delaying by short time if the previous run was short is > > fine, you > >>>>>>> are > >>>>>>>>> right that we don't > >>>>>>>>> accumulate enough however during short running times there is a > > high > >>>>>>> chance > >>>>>>>>> that the > >>>>>>>>> northd would get to sleep anyway (We will help it to sleep at > > least a > >>>>>>> bit > >>>>>>>>> nevertheless). > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Next, I notice that you've added new poll_timer_wait() calls but > >>>>>>> haven't > >>>>>>>>>> changed the ovsdb_idl_loop_run() or > > ovsdb_idl_loop_commit_and_wait() > >>>>>>>>>> calls. Is there any danger of ovn-northd getting in a busy loop > > of > >>>>>>>>>> sleeping and waking because of this? I don't think it should, > > since > >>>>>>>>>> presumably ovsdb_idl_loop_run() should clear the conditions > > waited on > >>>>>>> by > >>>>>>>>>> ovsdb_idl_loop_commit_and_wait(), but I want to double-check. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> AFAIK it shouldn't cause any issues as ovsdb_idl_loop_run() will > >>>>> process > >>>>>>>>> anything > >>>>>>>>> that it can and wait will be fine. The problem would be if we > > would > >>>>>>> skip the > >>>>>>>>> ovsdb_idl_loop_run() for some reason. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Next, does this have any negative impact on our ability to > > perform > >>>>>>>>>> incremental processing in ovn-northd? My concern is that since > > we are > >>>>>>>>>> still running the ovsdb IDL loop that if multiple NB changes > > occur > >>>>>>>>>> during our delay, then we might have to fall back to a full > > recompute > >>>>>>>>>> instead of being able to incrementally process the changes. Are > > my > >>>>>>>>>> concerns valid? > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> I suppose that can happen if there are changes that could result > > in > >>>>>>>>> "conflict" > >>>>>>>>> and full recompute. During the test we haven't seen anything like > >>>>> that. > >>>>>>>>> The odds of that happening are small because as stated previously > > we > >>>>> are > >>>>>>>>> doing > >>>>>>>>> basically the same as if the engine was running for a long time > > always > >>>>>>> from > >>>>>>>>> the IDL > >>>>>>>>> point of view except that we give IDL a chance to process > > whatever has > >>>>>>>>> pilled up > >>>>>>>>> within the sleep period. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Next, has scale testing shown that this change has made a > > positive > >>>>>>>>>> impact? If so, is there any recommendation for how to determine > > what > >>>>> to > >>>>>>>>>> configure the value to? > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> It has a huge impact actually the value tested was 200 ms, the > >>>>>>>>> recommendation > >>>>>>>> > >>>>>>>> This was chosen based on the historical data from similar tests > > which > >>>>>>>> showed that the I-P engine was taking ~180-200 ms to run at scale. > >>>>>>>> > >>>>>>>>> would be < 500 ms. After that point the latency on components > > creation > >>>>>>>>> would be > >>>>>>>>> very noticable. I will put the recommendation into the ovn-nb.xml > > with > >>>>>>> the > >>>>>>>>> latency > >>>>>>>>> comment. Before I'll post v2 (which has the numbers in commit > > message) > >>>>>>> those > >>>>>>>>> are the test results: > >>>>>>>>> > >>>>>>>>> Run without any backoff period: > >>>>>>>>> northd aggregate CPU 9810% avg / 12765% max > >>>>>>>>> northd was spinning at 100% CPU the entire second half of the > > test. > >>>>>>>>> > >>>>>>>>> Run with 200 ms max backoff period: > >>>>>>>>> northd aggregate CPU 6066% avg / 7689% max > >>>>>>>>> northd was around 60% for the second half of the test > >>>>>>>>> > > > > Ales, could you clarify the version of OVN and the type of test for the > > above result? Is it on main branch or with Numans patches for LB I-P? Does > > the test include LB changes? Are there recomputes during the test? Was the > > 100% CPU during the second half of the test caused by recompute or by cost > > of I-P? > > > > I'll try to reply to these but Ales, please correct me if I'm wrong. > > Version used for testing: OVN main + Numan's "LB + lflow I-P v5". > > The test includes LB changes, it brings up an 120 OpenShift cluster > using the latest OVN-IC support that was added recently to > ovn-kubernetes. It runs one "zone per node". > > The test then creates 14K services and 28K pods distributed across > 80 namespaces as fast as possible. The test measures lots of > things but we were interested in average and P99 time for pods to > be "Ready" (networking ready) and cpu usage for ovn-northd on all > nodes. > > As far as I understand there were still occasional recomputes > triggered by missing/failing handlers, e.g. sync_to_sb_lb or lflow > missing a handler for SB.LB changes. In any case, the CMS (ovn-kube) > was continuously adding ports/LBs/resources to NB, as fast as > possible; northd was waking up to process those. > > An example of logs generated by northd I-P while at scale: > > 2023-08-08T19:12:27.394Z|1149658|inc_proc_eng|DBG|node: lflow, handler for input northd took 1ms > 2023-08-08T19:12:27.552Z|1149766|inc_proc_eng|DBG|node: lb_data, handler for input NB_load_balancer_group took 13ms > 2023-08-08T19:12:27.553Z|1149779|inc_proc_eng|DBG|node: sync_to_sb_addr_set, recompute (missing handler for input northd) took 1ms > 2023-08-08T19:12:27.684Z|1149782|inc_proc_eng|DBG|node: sync_to_sb_lb, recompute (missing handler for input SB_load_balancer) took 131ms > 2023-08-08T19:12:27.685Z|1149794|inc_proc_eng|DBG|node: lflow, handler for input northd took 1ms > 2023-08-08T19:12:27.846Z|1149893|inc_proc_eng|DBG|node: lb_data, handler for input NB_load_balancer_group took 13ms > 2023-08-08T19:12:27.848Z|1149907|inc_proc_eng|DBG|node: sync_to_sb_addr_set, recompute (missing handler for input northd) took 1ms > > >> > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Finally, is this change expected to be a long-term necessity? > > This > >>>>>>>>>> option seems to be useful for cases where northd recomputes are > >>>>>>>>>> required. Performing recomputes less frequently seems like it > > would > >>>>>>>>>> lower the CPU usage of ovn-northd while still processing the same > >>>>>>> amount > >>>>>>>>>> of changes. However, once northd can handle most changes > >>>>> incrementally, > >>>>>>>>>> is there still a benefit to delaying running? If each run of > > northd > >>>>>>>>>> handles all DB changes incrementally, then is there any point in > >>>>>>> putting > >>>>>>>>>> delays between those incremental runs? > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> Ideally we won't need it in the future. However, the assumption > > for > >>>>> not > >>>>>>>>> needing > >>>>>>>>> anything like this is that northd will be fast enough to process > > I-P > >>>>>>>>> changes and > >>>>>>>>> be able to sleep between the next batch update arrives from CMS. > > That > >>>>>>>>> doesn't > >>>>>>>>> seem to happen in very near future, one thing to keep in mind is > > that > >>>>>>>>> testing > >>>>>>>>> happened with Numan's I-P for LBs and lflows which make a huge > >>>>>>> difference, > >>>>>>>>> but > >>>>>>>>> still not enough to achieve the mentioned northd state. So from my > >>>>>>>>> perspective > >>>>>>>>> it will be relevant for a few releases. And as stated above the > > point > >>>>>>> is to > >>>>>>>>> prevent > >>>>>>>>> northd to spin at 100% CPU all the time. > >>>>>>>>> > >>>>>>>> > >>>>>>>> +1 it's not the prettiest feature (and some might rightfully call > > it a > >>>>>>>> hack) but it seems to me like the cleanest alternative for now, > > until > >>>>>>>> northd processing is fully incremental. > >>>>>>> > >>>>>>> In most cases it may be fine, but it might be a problem for a worst > > case > >>>>>>> scenario: > >>>>>>> > >>>>>>> Assume all the changes coming in NB can be incrementally processed > > but > >>>>> at > >>>>>>> a very very high rate, and ovn-northd keeps processing the changes > >>>>>>> incrementally. Since the change rate is so high, ovn-northd barely > >>>>> keeps up > >>>>>>> with the changes with 99% CPU load. For example, I-P for each object > >>>>> takes > >>>>>>> 10ms, and the change rate is 99 objects/sec. According to this > >>>>> algorithm, > >>>>>>> ovn-northd will always sleep for the maximum 200ms between each IDL > > run, > >>>>>>> and then ovn-northd would never keep up with the changes any more - > > the > >>>>>>> backlog will become longer and longer because of the wasted idle > > time. > >>>>>>> > >>>>>> > >>>>>> IDL runs are not skipped. Just I-P engine runs. So I think this > > won't > >>>>>> be a problem, or am I missing something? > >>>>> > >>>>> Sorry about the typo, I meant to say between each "engine run" > > instead of > >>>>> "IDL run". IDL run is not skipped, but the backlog (accumulated > > changes in > >>>>> IDL) becomes longer and longer. E.g.: > >>>>> > >>>>> (assume change rate is 100 object/sec) > >>>>> > >>>>> run-1: handles 1 object, takes 10ms, sleep 10ms > >>>>> run-2: handles 2 objects, takes 20ms, sleep 20ms > >>>>> run-3: handles 4 objects, takes 40ms, sleep 40ms > >>>>> run-4: handles 8 objects, takes 80ms, sleep 80ms > >>>>> run-5: handles 16 objects, takes 160ms, sleep 160ms > >>>>> run-6: handles 32 objects, takes 320ms, sleep 200ms > >>>>> run-7: handles 52 objects (accumulated in 320 + 200 ms), takes 520ms, > > sleep > >>>>> 200ms > >>>>> run-8: handles 72 objects, takes 720ms, sleep 200ms > >>>>> run-9: handles 92 objects, takes 920ms, sleep 200ms > >>>>> ... > >>>>> As we can see the backlog grows indefinitely if the input keeps > > changing at > >>>>> the rate of 100 obj/s. > >>>>> > >>>> > >>>> I see now, thanks for the explanation. But isn't this possible today > >>>> too? Maybe less probable though. > >>>> > >>>> Also, we'll probably hit other issues (e.g., timeouts on CMS side) > >>>> because of the backlog which will (likely?) throttle the CMS. > >>>> > >>>> What would be a good solution? It seems hard to define what a "large > >>>> backlog" is. If we could do that, maybe triggering a full recompute > >>>> when the backlog is large enough might be OK. > >>> > >>> Hmm. That's an interesting problem. > >>> From my understanding, the cost of incremental processing run consists > >>> of two parts: some more or less constant engine run cost E and a cost > >>> of processing actual changes P. Unless CMS is changing the same value > >>> over and over again, we can't optimize P by accumulating more changes > >>> per run. But we can amortize the more or less constant cost E. > >>> > >>> We had a similar problem with the full recompute in the past. If we > >>> assume the cost of a recompute R to be more or less constant regardless > >>> of amount of changes, then we can reduce the total cost by aggregating > >>> more changes per run. That's what run_idl_loop() function is trying > >>> to do. It runs IDL for as long as there are changes (capped at 500ms). > >>> > >>> We can't really avoid paying the cost P and it will compound with the > >>> increased number of changes. The only solution here, AFAICT, is to > >>> fall back to full recompute once P > R. Maybe that can be done by > >>> somehow calculating the number of tracked vs real rows in IDL. > >>> Current ovn-northd has this problem and introduction of delays will > >>> amplify it. > >>> > >> > >> That is still hard to do unfortunately. The ratio of IDL tracked vs > >> "real" rows is still not a good enough indication. It's possible that > >> northd can quickly incrementally process row updates for table T1 but > >> that it has to do intensive computations for row updates for table T2. > >> > >> A simpler (dumb?) approach is to just periodically recompute. But that > >> shouldn't happen too often because then we start using a lot of CPU > >> again. :) > >> > >>> > >>> The problem this patch is trying to address, IIUC, is that the cost E > >>> is still high. So, I'm wondering if the solution can be similar to > >>> what we already have for a full recompute. > > > > My assumption was that the cost E is relatively small in the I-P. I thought > > the patch was trying to help when there were still many recomputes > > triggered. Please correct me if I was wrong. I posted some questions > > regarding Ales's test result, to get a better understanding. > > I thought E was the "constant" non-incremental cost. In any case, > P (cost for I-P to successfully run) is also not negligible, e.g., > even a successful I-P execution of the lb_data handler for NB.LB_Group > changes took 13ms at scale: > > 2023-08-08T19:12:27.552Z|1149766|inc_proc_eng|DBG|node: lb_data, handler for input NB_load_balancer_group took 13ms
My point is, the constant cost in the I-P handlers is relatively small. The constant cost mainly comes from the recomputes or partial recomputes, e.g. when "northd" node is I-P but "lflow" or "sync_to_sb_lb" recomputes. If the definition of E includes the cost of recompute of nodes during the engine execution, I think we are on the same page. That part is what the delay approach would help the most. But if E only includes the constant part of an I-P handler, e.g. initializing/destroying some data structures, those shouldn't be a significant cost. And yes, the cost of some I-P handlers is not negligible, but I don't think the delay approach would really help in that case. Does this make sense? Thanks, Han > > > Regards, > Dumitru > > > > > Thanks, > > Han > > > >>> Current run_idl_loop() > >>> runs the IDL only while there are changes and it will not continue > >>> running it if the next update is delayed even by a very short time > >>> interval. What we could potentially do here is to wait for some > >>> very short time interval (maybe configurable or maybe not) and check > >>> the IDL for changes again. This will allow us to better batch updates > >>> and amortize constant engine run cost E without need to sleep for > >>> arbitrary 200ms. (We will sleep for arbitrary 1-10ms at a time, but > >>> that seems less arbitrary. :D ). > >> > >> I think we still need to set a max limit for this sequence of short > >> sleeps and that max limit needs to be configurable by the user because > >> it may increase end-to-end latency for propagating NB changes to > >> ovn-controller. > >> > >> In that case I think I prefer an explicit "arbitrary" max backoff set by > >> the user like this patch initially proposed. But I'm not completely > >> against other options. > >> > >>> > >>> Something like (for illustration purposes; not tested): > >>> > >>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c > >>> index 4fa1b039e..54e4ecb5f 100644 > >>> --- a/northd/ovn-northd.c > >>> +++ b/northd/ovn-northd.c > >>> @@ -689,22 +689,39 @@ run_idl_loop(struct ovsdb_idl_loop *idl_loop, > > const char *name) > >>> unsigned long long duration, start = time_msec(); > >>> unsigned int seqno = UINT_MAX; > >>> struct ovsdb_idl_txn *txn; > >>> + int n_before_sleep = -1; > >>> int n = 0; > >>> > >>> /* Accumulate database changes as long as there are some, > >>> * but no longer than half a second. */ > >>> - while (seqno != ovsdb_idl_get_seqno(idl_loop->idl) > >>> - && time_msec() - start < 500) { > >>> - seqno = ovsdb_idl_get_seqno(idl_loop->idl); > >>> - ovsdb_idl_run(idl_loop->idl); > >>> - n++; > >>> + for (;;) { > >>> + while (seqno != ovsdb_idl_get_seqno(idl_loop->idl) > >>> + && time_msec() - start < 500) { > >>> + seqno = ovsdb_idl_get_seqno(idl_loop->idl); > >>> + ovsdb_idl_run(idl_loop->idl); > >>> + n++; > >>> + } > >>> + /* If we're not out of time yet, try to sleep for a short 10ms > >>> + * in case we'll have more updates. Don't sleep again if there > >>> + * were no updates after the previous short sleep. */ > >>> + if (n > n_before_sleep + 1 && time_msec() - start < 500) { > >>> + n_before_sleep = n; > >>> + poll_timer_wait(10); > >>> + ovsdb_idl_wait(idl_loop->idl); > >>> + poll_block(); > >>> + /* Reset seqno, so we try to run IDL at least one more > > time. */ > >>> + seqno = UINT_MAX; > >>> + } else { > >>> + /* Out of time or no updates since the last sleep. */ > >>> + break; > >>> + } > >>> } > >>> > >>> txn = ovsdb_idl_loop_run(idl_loop); > >>> > >>> duration = time_msec() - start; > >>> - /* ovsdb_idl_run() is called at least 2 times. Once directly and > >>> - * once in the ovsdb_idl_loop_run(). n > 2 means that we received > >>> + /* ovsdb_idl_run() is called at least 3 times. Once directly and > >>> + * once in the ovsdb_idl_loop_run(). n > 3 means that we received > >>> * data on at least 2 subsequent calls. */ > >>> if (n > 2 || duration > 100) { > >>> VLOG(duration > 500 ? VLL_INFO : VLL_DBG, > >>> --- > >>> > >>> I'm not sure how something like this will impact the total CPU usage. > >>> Hopefully, these short sleeps can allow accumulating more changes and > >>> amortize the constant costs better while also reducing perceived CPU > >>> usage. Needs testing, for sure. One other advantage of such a solution > >>> is the code locality, i.e. all the logic being in one place. > >>> > >>> Thoughts? > >>> > >>> Best regards, Ilya Maximets. > >>> > >> > > _______________________________________________ > > dev mailing list > > d...@openvswitch.org > > https://mail.openvswitch.org/mailman/listinfo/ovs-dev > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev