On Fri, Aug 11, 2023 at 6:07 AM Dumitru Ceara <dce...@redhat.com> wrote: > > On 8/10/23 18:38, Ilya Maximets wrote: > > On 8/10/23 17:34, Dumitru Ceara wrote: > >> On 8/10/23 17:20, Han Zhou wrote: > >>> On Thu, Aug 10, 2023 at 6:36 AM Dumitru Ceara <dce...@redhat.com> wrote: > >>>> > >>>> On 8/10/23 15:34, Han Zhou wrote: > >>>>> On Thu, Aug 10, 2023 at 2:29 AM Dumitru Ceara <dce...@redhat.com> wrote: > >>>>>> > >>>>>> On 8/10/23 08:12, Ales Musil wrote: > >>>>>>> On Wed, Aug 9, 2023 at 5:13 PM Mark Michelson <mmich...@redhat.com > > >>>>> wrote: > >>>>>>> > >>>>>>>> Hi Ales, > >>>>>>>> > >>>>>>>> I have some high-level comments/questions about this patch. > >>>>>>>> > >>>>>>> > >>>>>>> Hi Mark, > >>>>>>> > >>>>>> > >>>>>> Hi Ales, Mark, > >>>>>> > >>>>>>> thank you for the review. See my answers inline below. > >>>>>>> > >>>>>>> > >>>>>>>> I have been privy to the conversations that led to this change. My > >>>>>>>> understanding is that by having ovn-northd wake up immediately, it > >>> can > >>>>>>>> be more CPU-intensive than waiting a bit for changes to accumulate > >>> and > >>>>>>>> handling all of those at once instead. However, nothing in either the > >>>>>>>> commit message or ovn-nb.xml explains what the purpose of this new > >>>>>>>> configuration option is. I think you should add a sentence or two to > >>>>>>>> explain why someone would want to enable this option. > >>>>>>>> > >>>>>>>> > >>>>>>> Yeah that's my bad, I have v2 prepared with some explanation in the > >>>>> commit > >>>>>>> message > >>>>>>> together with results from scale run. > >>>>>>> > >>>>>> > >>>>>> +1 we really need to explain why this change is needed. > >>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>> Next, the algorithm used here strikes me as odd. We use the previous > >>>>> run > >>>>>>>> time of ovn-northd to determine how long to wait before running > >>> again. > >>>>>>>> This delay is capped by the configured backoff time. Let's say that > >>>>>>>> we've configured the backoff interval to be 200 ms. If ovn-northd > >>> has a > >>>>>>>> super quick run and only takes 10 ms, then we will only delay the > >>> next > >>>>>>>> run by 10 ms. IMO, this seems like it would not mitigate the original > >>>>>>>> issue by much, since we are only allowing a maximum of 20 ms (10 ms > >>> for > >>>>>>>> the run of ovn-northd + 10 ms delay) of NB changes to accumulate. > >>>>>>>> Conversely, if northd has a huge recompute and it takes 5000 ms to > >>>>>>>> complete, then we would delay the next run by 200ms. In this case, > >>>>>>>> delaying at all seems like it's not necessary since we potentially > >>> have > >>>>>>>> 5000 ms worth of NB DB updates that have not been addressed. I would > >>>>>>>> have expected the opposite approach to be taken. If someone > >>> configures > >>>>>>>> 200ms as their backoff interval, I would expect us to always allow a > >>>>>>>> *minimum* of 200ms of NB changes to accumulate before running again. > >>> So > >>>>>>>> for instance, if northd runs quickly and is done in 10 ms, then we > >>>>> would > >>>>>>>> wait 200 - 10 = 190 ms before processing changes again. If northd > >>> takes > >>>>>>>> a long time to recompute and takes 5000 ms, then we would not wait at > >>>>>>>> all before processing changes again. Was the algorithm chosen based > >>> on > >>>>>>>> experimentation? Is it a well-known method I'm just not familiar > >>> with? > >>>>>> > >>>>>> I think the main assumption (that should probably be made explicit in > >>>>>> the commit log and/or documentation) is that on average changes happen > >>>>>> in a uniform way. This might not always be accurate. > >>>>>> > >>>>>> However, if we're off with the estimate, in the worst case we'd be > >>>>>> adding the configured max delay to the latency of processing changes. > >>>>>> So, as long as the value is not extremely high, the impact is not that > >>>>>> high either. > >>>>>> > >>>>>> Last but not least, as this value would be configured by the CMS, we > >>>>>> assume the CMS knows what they're doing. :) > >>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> I'm not sure if the algorithm is well known. > >>>>>>> > >>>>>>> The thing is that at scale we almost always cap at the backoff so it > >>> has > >>>>>>> probably > >>>>>>> the same effect as your suggestion with the difference that we > >>> actually > >>>>>>> delay even > >>>>>>> after long runs. And that is actually desired, it's true that in the > >>>>> let's > >>>>>>> say 500 ms > >>>>>>> should be enough to accumulate more changes however that can lead to > >>>>> another > >>>>>>> 500 ms run and so on. That in the end means that northd will spin at > >>>>> 100% > >>>>>>> CPU > >>>>>>> anyway which is what we want to avoid. So from another point of view > >>> the > >>>>>>> accumulation > >>>>>>> of IDL changes is a secondary effect which is still desired, but not > >>> the > >>>>>>> main purpose. > >>>>>>> > >>>>>>> Also delaying by short time if the previous run was short is fine, you > >>>>> are > >>>>>>> right that we don't > >>>>>>> accumulate enough however during short running times there is a high > >>>>> chance > >>>>>>> that the > >>>>>>> northd would get to sleep anyway (We will help it to sleep at least a > >>>>> bit > >>>>>>> nevertheless). > >>>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>> Next, I notice that you've added new poll_timer_wait() calls but > >>>>> haven't > >>>>>>>> changed the ovsdb_idl_loop_run() or ovsdb_idl_loop_commit_and_wait() > >>>>>>>> calls. Is there any danger of ovn-northd getting in a busy loop of > >>>>>>>> sleeping and waking because of this? I don't think it should, since > >>>>>>>> presumably ovsdb_idl_loop_run() should clear the conditions waited on > >>>>> by > >>>>>>>> ovsdb_idl_loop_commit_and_wait(), but I want to double-check. > >>>>>>>> > >>>>>>> > >>>>>>> AFAIK it shouldn't cause any issues as ovsdb_idl_loop_run() will > >>> process > >>>>>>> anything > >>>>>>> that it can and wait will be fine. The problem would be if we would > >>>>> skip the > >>>>>>> ovsdb_idl_loop_run() for some reason. > >>>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>> Next, does this have any negative impact on our ability to perform > >>>>>>>> incremental processing in ovn-northd? My concern is that since we are > >>>>>>>> still running the ovsdb IDL loop that if multiple NB changes occur > >>>>>>>> during our delay, then we might have to fall back to a full recompute > >>>>>>>> instead of being able to incrementally process the changes. Are my > >>>>>>>> concerns valid? > >>>>>>>> > >>>>>>> > >>>>>>> I suppose that can happen if there are changes that could result in > >>>>>>> "conflict" > >>>>>>> and full recompute. During the test we haven't seen anything like > >>> that. > >>>>>>> The odds of that happening are small because as stated previously we > >>> are > >>>>>>> doing > >>>>>>> basically the same as if the engine was running for a long time always > >>>>> from > >>>>>>> the IDL > >>>>>>> point of view except that we give IDL a chance to process whatever has > >>>>>>> pilled up > >>>>>>> within the sleep period. > >>>>>>> > >>>>>>> > >>>>>>>> > >>>>>>>> Next, has scale testing shown that this change has made a positive > >>>>>>>> impact? If so, is there any recommendation for how to determine what > >>> to > >>>>>>>> configure the value to? > >>>>>>>> > >>>>>>>> > >>>>>>> It has a huge impact actually the value tested was 200 ms, the > >>>>>>> recommendation > >>>>>> > >>>>>> This was chosen based on the historical data from similar tests which > >>>>>> showed that the I-P engine was taking ~180-200 ms to run at scale. > >>>>>> > >>>>>>> would be < 500 ms. After that point the latency on components creation > >>>>>>> would be > >>>>>>> very noticable. I will put the recommendation into the ovn-nb.xml with > >>>>> the > >>>>>>> latency > >>>>>>> comment. Before I'll post v2 (which has the numbers in commit message) > >>>>> those > >>>>>>> are the test results: > >>>>>>> > >>>>>>> Run without any backoff period: > >>>>>>> northd aggregate CPU 9810% avg / 12765% max > >>>>>>> northd was spinning at 100% CPU the entire second half of the test. > >>>>>>> > >>>>>>> Run with 200 ms max backoff period: > >>>>>>> northd aggregate CPU 6066% avg / 7689% max > >>>>>>> northd was around 60% for the second half of the test > >>>>>>>
Ales, could you clarify the version of OVN and the type of test for the above result? Is it on main branch or with Numans patches for LB I-P? Does the test include LB changes? Are there recomputes during the test? Was the 100% CPU during the second half of the test caused by recompute or by cost of I-P? > > >>>>>>> > >>>>>>>> > >>>>>>>> Finally, is this change expected to be a long-term necessity? This > >>>>>>>> option seems to be useful for cases where northd recomputes are > >>>>>>>> required. Performing recomputes less frequently seems like it would > >>>>>>>> lower the CPU usage of ovn-northd while still processing the same > >>>>> amount > >>>>>>>> of changes. However, once northd can handle most changes > >>> incrementally, > >>>>>>>> is there still a benefit to delaying running? If each run of northd > >>>>>>>> handles all DB changes incrementally, then is there any point in > >>>>> putting > >>>>>>>> delays between those incremental runs? > >>>>>>>> > >>>>>>>> > >>>>>>> Ideally we won't need it in the future. However, the assumption for > >>> not > >>>>>>> needing > >>>>>>> anything like this is that northd will be fast enough to process I-P > >>>>>>> changes and > >>>>>>> be able to sleep between the next batch update arrives from CMS. That > >>>>>>> doesn't > >>>>>>> seem to happen in very near future, one thing to keep in mind is that > >>>>>>> testing > >>>>>>> happened with Numan's I-P for LBs and lflows which make a huge > >>>>> difference, > >>>>>>> but > >>>>>>> still not enough to achieve the mentioned northd state. So from my > >>>>>>> perspective > >>>>>>> it will be relevant for a few releases. And as stated above the point > >>>>> is to > >>>>>>> prevent > >>>>>>> northd to spin at 100% CPU all the time. > >>>>>>> > >>>>>> > >>>>>> +1 it's not the prettiest feature (and some might rightfully call it a > >>>>>> hack) but it seems to me like the cleanest alternative for now, until > >>>>>> northd processing is fully incremental. > >>>>> > >>>>> In most cases it may be fine, but it might be a problem for a worst case > >>>>> scenario: > >>>>> > >>>>> Assume all the changes coming in NB can be incrementally processed but > >>> at > >>>>> a very very high rate, and ovn-northd keeps processing the changes > >>>>> incrementally. Since the change rate is so high, ovn-northd barely > >>> keeps up > >>>>> with the changes with 99% CPU load. For example, I-P for each object > >>> takes > >>>>> 10ms, and the change rate is 99 objects/sec. According to this > >>> algorithm, > >>>>> ovn-northd will always sleep for the maximum 200ms between each IDL run, > >>>>> and then ovn-northd would never keep up with the changes any more - the > >>>>> backlog will become longer and longer because of the wasted idle time. > >>>>> > >>>> > >>>> IDL runs are not skipped. Just I-P engine runs. So I think this won't > >>>> be a problem, or am I missing something? > >>> > >>> Sorry about the typo, I meant to say between each "engine run" instead of > >>> "IDL run". IDL run is not skipped, but the backlog (accumulated changes in > >>> IDL) becomes longer and longer. E.g.: > >>> > >>> (assume change rate is 100 object/sec) > >>> > >>> run-1: handles 1 object, takes 10ms, sleep 10ms > >>> run-2: handles 2 objects, takes 20ms, sleep 20ms > >>> run-3: handles 4 objects, takes 40ms, sleep 40ms > >>> run-4: handles 8 objects, takes 80ms, sleep 80ms > >>> run-5: handles 16 objects, takes 160ms, sleep 160ms > >>> run-6: handles 32 objects, takes 320ms, sleep 200ms > >>> run-7: handles 52 objects (accumulated in 320 + 200 ms), takes 520ms, sleep > >>> 200ms > >>> run-8: handles 72 objects, takes 720ms, sleep 200ms > >>> run-9: handles 92 objects, takes 920ms, sleep 200ms > >>> ... > >>> As we can see the backlog grows indefinitely if the input keeps changing at > >>> the rate of 100 obj/s. > >>> > >> > >> I see now, thanks for the explanation. But isn't this possible today > >> too? Maybe less probable though. > >> > >> Also, we'll probably hit other issues (e.g., timeouts on CMS side) > >> because of the backlog which will (likely?) throttle the CMS. > >> > >> What would be a good solution? It seems hard to define what a "large > >> backlog" is. If we could do that, maybe triggering a full recompute > >> when the backlog is large enough might be OK. > > > > Hmm. That's an interesting problem. > > From my understanding, the cost of incremental processing run consists > > of two parts: some more or less constant engine run cost E and a cost > > of processing actual changes P. Unless CMS is changing the same value > > over and over again, we can't optimize P by accumulating more changes > > per run. But we can amortize the more or less constant cost E. > > > > We had a similar problem with the full recompute in the past. If we > > assume the cost of a recompute R to be more or less constant regardless > > of amount of changes, then we can reduce the total cost by aggregating > > more changes per run. That's what run_idl_loop() function is trying > > to do. It runs IDL for as long as there are changes (capped at 500ms). > > > > We can't really avoid paying the cost P and it will compound with the > > increased number of changes. The only solution here, AFAICT, is to > > fall back to full recompute once P > R. Maybe that can be done by > > somehow calculating the number of tracked vs real rows in IDL. > > Current ovn-northd has this problem and introduction of delays will > > amplify it. > > > > That is still hard to do unfortunately. The ratio of IDL tracked vs > "real" rows is still not a good enough indication. It's possible that > northd can quickly incrementally process row updates for table T1 but > that it has to do intensive computations for row updates for table T2. > > A simpler (dumb?) approach is to just periodically recompute. But that > shouldn't happen too often because then we start using a lot of CPU > again. :) > > > > > The problem this patch is trying to address, IIUC, is that the cost E > > is still high. So, I'm wondering if the solution can be similar to > > what we already have for a full recompute. My assumption was that the cost E is relatively small in the I-P. I thought the patch was trying to help when there were still many recomputes triggered. Please correct me if I was wrong. I posted some questions regarding Ales's test result, to get a better understanding. Thanks, Han >> Current run_idl_loop() > > runs the IDL only while there are changes and it will not continue > > running it if the next update is delayed even by a very short time > > interval. What we could potentially do here is to wait for some > > very short time interval (maybe configurable or maybe not) and check > > the IDL for changes again. This will allow us to better batch updates > > and amortize constant engine run cost E without need to sleep for > > arbitrary 200ms. (We will sleep for arbitrary 1-10ms at a time, but > > that seems less arbitrary. :D ). > > I think we still need to set a max limit for this sequence of short > sleeps and that max limit needs to be configurable by the user because > it may increase end-to-end latency for propagating NB changes to > ovn-controller. > > In that case I think I prefer an explicit "arbitrary" max backoff set by > the user like this patch initially proposed. But I'm not completely > against other options. > > > > > Something like (for illustration purposes; not tested): > > > > diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c > > index 4fa1b039e..54e4ecb5f 100644 > > --- a/northd/ovn-northd.c > > +++ b/northd/ovn-northd.c > > @@ -689,22 +689,39 @@ run_idl_loop(struct ovsdb_idl_loop *idl_loop, const char *name) > > unsigned long long duration, start = time_msec(); > > unsigned int seqno = UINT_MAX; > > struct ovsdb_idl_txn *txn; > > + int n_before_sleep = -1; > > int n = 0; > > > > /* Accumulate database changes as long as there are some, > > * but no longer than half a second. */ > > - while (seqno != ovsdb_idl_get_seqno(idl_loop->idl) > > - && time_msec() - start < 500) { > > - seqno = ovsdb_idl_get_seqno(idl_loop->idl); > > - ovsdb_idl_run(idl_loop->idl); > > - n++; > > + for (;;) { > > + while (seqno != ovsdb_idl_get_seqno(idl_loop->idl) > > + && time_msec() - start < 500) { > > + seqno = ovsdb_idl_get_seqno(idl_loop->idl); > > + ovsdb_idl_run(idl_loop->idl); > > + n++; > > + } > > + /* If we're not out of time yet, try to sleep for a short 10ms > > + * in case we'll have more updates. Don't sleep again if there > > + * were no updates after the previous short sleep. */ > > + if (n > n_before_sleep + 1 && time_msec() - start < 500) { > > + n_before_sleep = n; > > + poll_timer_wait(10); > > + ovsdb_idl_wait(idl_loop->idl); > > + poll_block(); > > + /* Reset seqno, so we try to run IDL at least one more time. */ > > + seqno = UINT_MAX; > > + } else { > > + /* Out of time or no updates since the last sleep. */ > > + break; > > + } > > } > > > > txn = ovsdb_idl_loop_run(idl_loop); > > > > duration = time_msec() - start; > > - /* ovsdb_idl_run() is called at least 2 times. Once directly and > > - * once in the ovsdb_idl_loop_run(). n > 2 means that we received > > + /* ovsdb_idl_run() is called at least 3 times. Once directly and > > + * once in the ovsdb_idl_loop_run(). n > 3 means that we received > > * data on at least 2 subsequent calls. */ > > if (n > 2 || duration > 100) { > > VLOG(duration > 500 ? VLL_INFO : VLL_DBG, > > --- > > > > I'm not sure how something like this will impact the total CPU usage. > > Hopefully, these short sleeps can allow accumulating more changes and > > amortize the constant costs better while also reducing perceived CPU > > usage. Needs testing, for sure. One other advantage of such a solution > > is the code locality, i.e. all the logic being in one place. > > > > Thoughts? > > > > Best regards, Ilya Maximets. > > > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev