On Thu, Aug 10, 2023 at 8:34 AM Dumitru Ceara <dce...@redhat.com> wrote: > > On 8/10/23 17:20, Han Zhou wrote: > > On Thu, Aug 10, 2023 at 6:36 AM Dumitru Ceara <dce...@redhat.com> wrote: > >> > >> On 8/10/23 15:34, Han Zhou wrote: > >>> On Thu, Aug 10, 2023 at 2:29 AM Dumitru Ceara <dce...@redhat.com> wrote: > >>>> > >>>> On 8/10/23 08:12, Ales Musil wrote: > >>>>> On Wed, Aug 9, 2023 at 5:13 PM Mark Michelson <mmich...@redhat.com> > >>> wrote: > >>>>> > >>>>>> Hi Ales, > >>>>>> > >>>>>> I have some high-level comments/questions about this patch. > >>>>>> > >>>>> > >>>>> Hi Mark, > >>>>> > >>>> > >>>> Hi Ales, Mark, > >>>> > >>>>> thank you for the review. See my answers inline below. > >>>>> > >>>>> > >>>>>> I have been privy to the conversations that led to this change. My > >>>>>> understanding is that by having ovn-northd wake up immediately, it > > can > >>>>>> be more CPU-intensive than waiting a bit for changes to accumulate > > and > >>>>>> handling all of those at once instead. However, nothing in either the > >>>>>> commit message or ovn-nb.xml explains what the purpose of this new > >>>>>> configuration option is. I think you should add a sentence or two to > >>>>>> explain why someone would want to enable this option. > >>>>>> > >>>>>> > >>>>> Yeah that's my bad, I have v2 prepared with some explanation in the > >>> commit > >>>>> message > >>>>> together with results from scale run. > >>>>> > >>>> > >>>> +1 we really need to explain why this change is needed. > >>>> > >>>>> > >>>>>> > >>>>>> Next, the algorithm used here strikes me as odd. We use the previous > >>> run > >>>>>> time of ovn-northd to determine how long to wait before running > > again. > >>>>>> This delay is capped by the configured backoff time. Let's say that > >>>>>> we've configured the backoff interval to be 200 ms. If ovn-northd > > has a > >>>>>> super quick run and only takes 10 ms, then we will only delay the > > next > >>>>>> run by 10 ms. IMO, this seems like it would not mitigate the original > >>>>>> issue by much, since we are only allowing a maximum of 20 ms (10 ms > > for > >>>>>> the run of ovn-northd + 10 ms delay) of NB changes to accumulate. > >>>>>> Conversely, if northd has a huge recompute and it takes 5000 ms to > >>>>>> complete, then we would delay the next run by 200ms. In this case, > >>>>>> delaying at all seems like it's not necessary since we potentially > > have > >>>>>> 5000 ms worth of NB DB updates that have not been addressed. I would > >>>>>> have expected the opposite approach to be taken. If someone > > configures > >>>>>> 200ms as their backoff interval, I would expect us to always allow a > >>>>>> *minimum* of 200ms of NB changes to accumulate before running again. > > So > >>>>>> for instance, if northd runs quickly and is done in 10 ms, then we > >>> would > >>>>>> wait 200 - 10 = 190 ms before processing changes again. If northd > > takes > >>>>>> a long time to recompute and takes 5000 ms, then we would not wait at > >>>>>> all before processing changes again. Was the algorithm chosen based > > on > >>>>>> experimentation? Is it a well-known method I'm just not familiar > > with? > >>>> > >>>> I think the main assumption (that should probably be made explicit in > >>>> the commit log and/or documentation) is that on average changes happen > >>>> in a uniform way. This might not always be accurate. > >>>> > >>>> However, if we're off with the estimate, in the worst case we'd be > >>>> adding the configured max delay to the latency of processing changes. > >>>> So, as long as the value is not extremely high, the impact is not that > >>>> high either. > >>>> > >>>> Last but not least, as this value would be configured by the CMS, we > >>>> assume the CMS knows what they're doing. :) > >>>> > >>>>>> > >>>>> > >>>>> I'm not sure if the algorithm is well known. > >>>>> > >>>>> The thing is that at scale we almost always cap at the backoff so it > > has > >>>>> probably > >>>>> the same effect as your suggestion with the difference that we > > actually > >>>>> delay even > >>>>> after long runs. And that is actually desired, it's true that in the > >>> let's > >>>>> say 500 ms > >>>>> should be enough to accumulate more changes however that can lead to > >>> another > >>>>> 500 ms run and so on. That in the end means that northd will spin at > >>> 100% > >>>>> CPU > >>>>> anyway which is what we want to avoid. So from another point of view > > the > >>>>> accumulation > >>>>> of IDL changes is a secondary effect which is still desired, but not > > the > >>>>> main purpose. > >>>>> > >>>>> Also delaying by short time if the previous run was short is fine, you > >>> are > >>>>> right that we don't > >>>>> accumulate enough however during short running times there is a high > >>> chance > >>>>> that the > >>>>> northd would get to sleep anyway (We will help it to sleep at least a > >>> bit > >>>>> nevertheless). > >>>>> > >>>>> > >>>>>> > >>>>>> Next, I notice that you've added new poll_timer_wait() calls but > >>> haven't > >>>>>> changed the ovsdb_idl_loop_run() or ovsdb_idl_loop_commit_and_wait() > >>>>>> calls. Is there any danger of ovn-northd getting in a busy loop of > >>>>>> sleeping and waking because of this? I don't think it should, since > >>>>>> presumably ovsdb_idl_loop_run() should clear the conditions waited on > >>> by > >>>>>> ovsdb_idl_loop_commit_and_wait(), but I want to double-check. > >>>>>> > >>>>> > >>>>> AFAIK it shouldn't cause any issues as ovsdb_idl_loop_run() will > > process > >>>>> anything > >>>>> that it can and wait will be fine. The problem would be if we would > >>> skip the > >>>>> ovsdb_idl_loop_run() for some reason. > >>>>> > >>>>> > >>>>>> > >>>>>> Next, does this have any negative impact on our ability to perform > >>>>>> incremental processing in ovn-northd? My concern is that since we are > >>>>>> still running the ovsdb IDL loop that if multiple NB changes occur > >>>>>> during our delay, then we might have to fall back to a full recompute > >>>>>> instead of being able to incrementally process the changes. Are my > >>>>>> concerns valid? > >>>>>> > >>>>> > >>>>> I suppose that can happen if there are changes that could result in > >>>>> "conflict" > >>>>> and full recompute. During the test we haven't seen anything like > > that. > >>>>> The odds of that happening are small because as stated previously we > > are > >>>>> doing > >>>>> basically the same as if the engine was running for a long time always > >>> from > >>>>> the IDL > >>>>> point of view except that we give IDL a chance to process whatever has > >>>>> pilled up > >>>>> within the sleep period. > >>>>> > >>>>> > >>>>>> > >>>>>> Next, has scale testing shown that this change has made a positive > >>>>>> impact? If so, is there any recommendation for how to determine what > > to > >>>>>> configure the value to? > >>>>>> > >>>>>> > >>>>> It has a huge impact actually the value tested was 200 ms, the > >>>>> recommendation > >>>> > >>>> This was chosen based on the historical data from similar tests which > >>>> showed that the I-P engine was taking ~180-200 ms to run at scale. > >>>> > >>>>> would be < 500 ms. After that point the latency on components creation > >>>>> would be > >>>>> very noticable. I will put the recommendation into the ovn-nb.xml with > >>> the > >>>>> latency > >>>>> comment. Before I'll post v2 (which has the numbers in commit message) > >>> those > >>>>> are the test results: > >>>>> > >>>>> Run without any backoff period: > >>>>> northd aggregate CPU 9810% avg / 12765% max > >>>>> northd was spinning at 100% CPU the entire second half of the test. > >>>>> > >>>>> Run with 200 ms max backoff period: > >>>>> northd aggregate CPU 6066% avg / 7689% max > >>>>> northd was around 60% for the second half of the test > >>>>> > >>>>> > >>>>>> > >>>>>> Finally, is this change expected to be a long-term necessity? This > >>>>>> option seems to be useful for cases where northd recomputes are > >>>>>> required. Performing recomputes less frequently seems like it would > >>>>>> lower the CPU usage of ovn-northd while still processing the same > >>> amount > >>>>>> of changes. However, once northd can handle most changes > > incrementally, > >>>>>> is there still a benefit to delaying running? If each run of northd > >>>>>> handles all DB changes incrementally, then is there any point in > >>> putting > >>>>>> delays between those incremental runs? > >>>>>> > >>>>>> > >>>>> Ideally we won't need it in the future. However, the assumption for > > not > >>>>> needing > >>>>> anything like this is that northd will be fast enough to process I-P > >>>>> changes and > >>>>> be able to sleep between the next batch update arrives from CMS. That > >>>>> doesn't > >>>>> seem to happen in very near future, one thing to keep in mind is that > >>>>> testing > >>>>> happened with Numan's I-P for LBs and lflows which make a huge > >>> difference, > >>>>> but > >>>>> still not enough to achieve the mentioned northd state. So from my > >>>>> perspective > >>>>> it will be relevant for a few releases. And as stated above the point > >>> is to > >>>>> prevent > >>>>> northd to spin at 100% CPU all the time. > >>>>> > >>>> > >>>> +1 it's not the prettiest feature (and some might rightfully call it a > >>>> hack) but it seems to me like the cleanest alternative for now, until > >>>> northd processing is fully incremental. > >>> > >>> In most cases it may be fine, but it might be a problem for a worst case > >>> scenario: > >>> > >>> Assume all the changes coming in NB can be incrementally processed but > > at > >>> a very very high rate, and ovn-northd keeps processing the changes > >>> incrementally. Since the change rate is so high, ovn-northd barely > > keeps up > >>> with the changes with 99% CPU load. For example, I-P for each object > > takes > >>> 10ms, and the change rate is 99 objects/sec. According to this > > algorithm, > >>> ovn-northd will always sleep for the maximum 200ms between each IDL run, > >>> and then ovn-northd would never keep up with the changes any more - the > >>> backlog will become longer and longer because of the wasted idle time. > >>> > >> > >> IDL runs are not skipped. Just I-P engine runs. So I think this won't > >> be a problem, or am I missing something? > > > > Sorry about the typo, I meant to say between each "engine run" instead of > > "IDL run". IDL run is not skipped, but the backlog (accumulated changes in > > IDL) becomes longer and longer. E.g.: > > > > (assume change rate is 100 object/sec) > > > > run-1: handles 1 object, takes 10ms, sleep 10ms > > run-2: handles 2 objects, takes 20ms, sleep 20ms > > run-3: handles 4 objects, takes 40ms, sleep 40ms > > run-4: handles 8 objects, takes 80ms, sleep 80ms > > run-5: handles 16 objects, takes 160ms, sleep 160ms > > run-6: handles 32 objects, takes 320ms, sleep 200ms > > run-7: handles 52 objects (accumulated in 320 + 200 ms), takes 520ms, sleep > > 200ms > > run-8: handles 72 objects, takes 720ms, sleep 200ms > > run-9: handles 92 objects, takes 920ms, sleep 200ms > > ... > > As we can see the backlog grows indefinitely if the input keeps changing at > > the rate of 100 obj/s. > > > > I see now, thanks for the explanation. But isn't this possible today > too? Maybe less probable though. >
It is about performance penalty. Today let's say the throughput of ovn-northd can handle is 100 obj/s, with the 200ms max delay configured, it can only handle 80 obj/s. > Also, we'll probably hit other issues (e.g., timeouts on CMS side) > because of the backlog which will (likely?) throttle the CMS. > If there is back-pressure to CMS then the CMS throughput is degraded in the same way. > What would be a good solution? It seems hard to define what a "large > backlog" is. If we could do that, maybe triggering a full recompute > when the backlog is large enough might be OK. > Yes, in such a scenario, full recompute is a better solution because it is guaranteed to keep up (with certain latency). But I think it is tricky to make the decision when we should fallback to recompute. I think our goal is to implement a really fast I-P, so that with any reasonable input change rate it is still faster than full recompute. However it is not the case today (but I believe it can be improved). So I think this patch is for a temporary solution before the goal is implemented, to make a tradeoff between latency and throughput v.s. CPU load. With this in mind, I think it is ok to just call out the side-effects in the documentation for both latency and throughput, so that CMS/user can make a better judgement. I am not against the "triggering a full recompute when backlog is large enough" strategy, but not sure how practical it is to implement. Thanks, Han > What do you think? > > Thanks, > Dumitru > > > Thanks, > > Han > > > >> > >> Regards, > >> Dumitru > >> > >>> In practice this means decreased max throughput supported by ovn-northd, > >>> which probably needs to be called out as a tradeoff (in addition to the > >>> latency tradeoff). > >>> > >>> Thanks, > >>> Han > >>> > >>>> > >>>> Regards, > >>>> Dumitru > >>>> > >>>>> > >>>>>> > >>>>>> On 8/9/23 01:29, Ales Musil wrote: > >>>>>>> Add config option called "northd-backoff-interval-ms" that allows > >>>>>>> to delay northd engine runs capped by the config option. > >>>>>>> When the config option is set to 0 or unspecified, the engine > >>>>>>> will run without any restrictions. If the value >0 we will delay > >>>>>>> northd engine run by the previous run time capped by the > >>>>>>> config option. > >>>>>>> > >>>>>>> Signed-off-by: Ales Musil <amu...@redhat.com> > >>>>>>> --- > >>>>>>> NEWS | 2 ++ > >>>>>>> northd/inc-proc-northd.c | 23 ++++++++++++++++++++++- > >>>>>>> northd/inc-proc-northd.h | 3 ++- > >>>>>>> northd/ovn-northd.c | 9 +++++++-- > >>>>>>> ovn-nb.xml | 7 +++++++ > >>>>>>> 5 files changed, 40 insertions(+), 4 deletions(-) > >>>>>>> > >>>>>>> diff --git a/NEWS b/NEWS > >>>>>>> index 8275877f9..6109f13a2 100644 > >>>>>>> --- a/NEWS > >>>>>>> +++ b/NEWS > >>>>>>> @@ -10,6 +10,8 @@ Post v23.06.0 > >>>>>>> - To allow optimizing ovn-controller's monitor conditions for > > the > >>>>>> regular > >>>>>>> VIF case, ovn-controller now unconditionally monitors all > >>> sub-ports > >>>>>>> (ports with parent_port set). > >>>>>>> + - Add "northd-backoff-interval-ms" config option to delay northd > >>>>>> engine > >>>>>>> + runs capped at the set value. > >>>>>>> > >>>>>>> OVN v23.06.0 - 01 Jun 2023 > >>>>>>> -------------------------- > >>>>>>> diff --git a/northd/inc-proc-northd.c b/northd/inc-proc-northd.c > >>>>>>> index d328deb22..87db50ad1 100644 > >>>>>>> --- a/northd/inc-proc-northd.c > >>>>>>> +++ b/northd/inc-proc-northd.c > >>>>>>> @@ -42,6 +42,8 @@ VLOG_DEFINE_THIS_MODULE(inc_proc_northd); > >>>>>>> > >>>>>>> static unixctl_cb_func chassis_features_list; > >>>>>>> > >>>>>>> +static int64_t next_northd_run_ms = 0; > >>>>>>> + > >>>>>>> #define NB_NODES \ > >>>>>>> NB_NODE(nb_global, "nb_global") \ > >>>>>>> NB_NODE(logical_switch, "logical_switch") \ > >>>>>>> @@ -295,8 +297,10 @@ void inc_proc_northd_init(struct ovsdb_idl_loop > >>> *nb, > >>>>>>> /* Returns true if the incremental processing ended up updating > >>> nodes. > >>>>>> */ > >>>>>>> bool inc_proc_northd_run(struct ovsdb_idl_txn *ovnnb_txn, > >>>>>>> struct ovsdb_idl_txn *ovnsb_txn, > >>>>>>> - bool recompute) { > >>>>>>> + bool recompute, uint32_t backoff_ms) { > >>>>>>> ovs_assert(ovnnb_txn && ovnsb_txn); > >>>>>>> + > >>>>>>> + int64_t start = time_msec(); > >>>>>>> engine_init_run(); > >>>>>>> > >>>>>>> /* Force a full recompute if instructed to, for example, > > after a > >>>>>> NB/SB > >>>>>>> @@ -330,6 +334,12 @@ bool inc_proc_northd_run(struct ovsdb_idl_txn > >>>>>> *ovnnb_txn, > >>>>>>> } else { > >>>>>>> engine_set_force_recompute(false); > >>>>>>> } > >>>>>>> + > >>>>>>> + int64_t now = time_msec(); > >>>>>>> + /* Postpone the next run by length of current run with maximum > >>>>>> capped > >>>>>>> + * by "northd-backoff-interval-ms" interval. */ > >>>>>>> + next_northd_run_ms = now + MIN(now - start, backoff_ms); > >>>>>>> + > >>>>>>> return engine_has_updated(); > >>>>>>> } > >>>>>>> > >>>>>>> @@ -339,6 +349,17 @@ void inc_proc_northd_cleanup(void) > >>>>>>> engine_set_context(NULL); > >>>>>>> } > >>>>>>> > >>>>>>> +bool > >>>>>>> +inc_proc_northd_can_run(bool recompute) > >>>>>>> +{ > >>>>>>> + if (recompute || time_msec() >= next_northd_run_ms) { > >>>>>>> + return true; > >>>>>>> + } > >>>>>>> + > >>>>>>> + poll_timer_wait_until(next_northd_run_ms); > >>>>>>> + return false; > >>>>>>> +} > >>>>>>> + > >>>>>>> static void > >>>>>>> chassis_features_list(struct unixctl_conn *conn, int argc > >>> OVS_UNUSED, > >>>>>>> const char *argv[] OVS_UNUSED, void > >>> *features_) > >>>>>>> diff --git a/northd/inc-proc-northd.h b/northd/inc-proc-northd.h > >>>>>>> index 9b81c7ee0..af418d7d7 100644 > >>>>>>> --- a/northd/inc-proc-northd.h > >>>>>>> +++ b/northd/inc-proc-northd.h > >>>>>>> @@ -10,7 +10,8 @@ void inc_proc_northd_init(struct ovsdb_idl_loop > > *nb, > >>>>>>> struct ovsdb_idl_loop *sb); > >>>>>>> bool inc_proc_northd_run(struct ovsdb_idl_txn *ovnnb_txn, > >>>>>>> struct ovsdb_idl_txn *ovnsb_txn, > >>>>>>> - bool recompute); > >>>>>>> + bool recompute, uint32_t backoff_ms); > >>>>>>> void inc_proc_northd_cleanup(void); > >>>>>>> +bool inc_proc_northd_can_run(bool recompute); > >>>>>>> > >>>>>>> #endif /* INC_PROC_NORTHD */ > >>>>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c > >>>>>>> index 4fa1b039e..3202b50a1 100644 > >>>>>>> --- a/northd/ovn-northd.c > >>>>>>> +++ b/northd/ovn-northd.c > >>>>>>> @@ -868,6 +868,7 @@ main(int argc, char *argv[]) > >>>>>>> /* Main loop. */ > >>>>>>> exiting = false; > >>>>>>> > >>>>>>> + uint32_t northd_backoff_ms = 0; > >>>>>>> bool recompute = false; > >>>>>>> while (!exiting) { > >>>>>>> update_ssl_config(); > >>>>>>> @@ -932,10 +933,12 @@ main(int argc, char *argv[]) > >>>>>>> > >>>>>>> if (ovsdb_idl_has_lock(ovnsb_idl_loop.idl)) { > >>>>>>> bool activity = false; > >>>>>>> - if (ovnnb_txn && ovnsb_txn) { > >>>>>>> + if (ovnnb_txn && ovnsb_txn && > >>>>>>> + inc_proc_northd_can_run(recompute)) { > >>>>>>> int64_t loop_start_time = time_wall_msec(); > >>>>>>> activity = inc_proc_northd_run(ovnnb_txn, > >>>>>> ovnsb_txn, > >>>>>>> - recompute); > >>>>>>> + recompute, > >>>>>>> + > >>> northd_backoff_ms); > >>>>>>> recompute = false; > >>>>>>> check_and_add_supported_dhcp_opts_to_sb_db( > >>>>>>> ovnsb_txn, ovnsb_idl_loop.idl); > >>>>>>> @@ -1019,6 +1022,8 @@ main(int argc, char *argv[]) > >>>>>>> if (nb) { > >>>>>>> interval = smap_get_int(&nb->options, > >>>>>> "northd_probe_interval", > >>>>>>> interval); > >>>>>>> + northd_backoff_ms = smap_get_uint(&nb->options, > >>>>>>> + > >>>>>> "northd-backoff-interval-ms", 0); > >>>>>>> } > >>>>>>> set_idl_probe_interval(ovnnb_idl_loop.idl, ovnnb_db, > >>> interval); > >>>>>>> set_idl_probe_interval(ovnsb_idl_loop.idl, ovnsb_db, > >>> interval); > >>>>>>> diff --git a/ovn-nb.xml b/ovn-nb.xml > >>>>>>> index 4fbf4f7e5..115dfd536 100644 > >>>>>>> --- a/ovn-nb.xml > >>>>>>> +++ b/ovn-nb.xml > >>>>>>> @@ -349,6 +349,13 @@ > >>>>>>> of HWOL compatibility with GDP. > >>>>>>> </column> > >>>>>>> > >>>>>>> + <column name="options" key="northd-backoff-interval-ms"> > >>>>>>> + Maximum interval that the northd incremental engine is > >>> delayed > >>>>>> by > >>>>>>> + in milliseconds. Setting the value to nonzero delays the > > next > >>>>>> northd > >>>>>>> + engine run by the previous run time, capped by the > > specified > >>>>>> value. > >>>>>>> + If the value is zero the engine won't be delayed at all. > >>>>>>> + </column> > >>>>>>> + > >>>>>>> <group title="Options for configuring interconnection route > >>>>>> advertisement"> > >>>>>>> <p> > >>>>>>> These options control how routes are advertised between > >>> OVN > >>>>>> > >>>>>> > >>>>> Thanks, > >>>>> Ales > >>>>> > >>>> > >>>> _______________________________________________ > >>>> dev mailing list > >>>> d...@openvswitch.org > >>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > >>> > >> > > > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev