On Thu, Aug 10, 2023 at 2:29 AM Dumitru Ceara <dce...@redhat.com> wrote: > > On 8/10/23 08:12, Ales Musil wrote: > > On Wed, Aug 9, 2023 at 5:13 PM Mark Michelson <mmich...@redhat.com> wrote: > > > >> Hi Ales, > >> > >> I have some high-level comments/questions about this patch. > >> > > > > Hi Mark, > > > > Hi Ales, Mark, > > > thank you for the review. See my answers inline below. > > > > > >> I have been privy to the conversations that led to this change. My > >> understanding is that by having ovn-northd wake up immediately, it can > >> be more CPU-intensive than waiting a bit for changes to accumulate and > >> handling all of those at once instead. However, nothing in either the > >> commit message or ovn-nb.xml explains what the purpose of this new > >> configuration option is. I think you should add a sentence or two to > >> explain why someone would want to enable this option. > >> > >> > > Yeah that's my bad, I have v2 prepared with some explanation in the commit > > message > > together with results from scale run. > > > > +1 we really need to explain why this change is needed. > > > > >> > >> Next, the algorithm used here strikes me as odd. We use the previous run > >> time of ovn-northd to determine how long to wait before running again. > >> This delay is capped by the configured backoff time. Let's say that > >> we've configured the backoff interval to be 200 ms. If ovn-northd has a > >> super quick run and only takes 10 ms, then we will only delay the next > >> run by 10 ms. IMO, this seems like it would not mitigate the original > >> issue by much, since we are only allowing a maximum of 20 ms (10 ms for > >> the run of ovn-northd + 10 ms delay) of NB changes to accumulate. > >> Conversely, if northd has a huge recompute and it takes 5000 ms to > >> complete, then we would delay the next run by 200ms. In this case, > >> delaying at all seems like it's not necessary since we potentially have > >> 5000 ms worth of NB DB updates that have not been addressed. I would > >> have expected the opposite approach to be taken. If someone configures > >> 200ms as their backoff interval, I would expect us to always allow a > >> *minimum* of 200ms of NB changes to accumulate before running again. So > >> for instance, if northd runs quickly and is done in 10 ms, then we would > >> wait 200 - 10 = 190 ms before processing changes again. If northd takes > >> a long time to recompute and takes 5000 ms, then we would not wait at > >> all before processing changes again. Was the algorithm chosen based on > >> experimentation? Is it a well-known method I'm just not familiar with? > > I think the main assumption (that should probably be made explicit in > the commit log and/or documentation) is that on average changes happen > in a uniform way. This might not always be accurate. > > However, if we're off with the estimate, in the worst case we'd be > adding the configured max delay to the latency of processing changes. > So, as long as the value is not extremely high, the impact is not that > high either. > > Last but not least, as this value would be configured by the CMS, we > assume the CMS knows what they're doing. :) > > >> > > > > I'm not sure if the algorithm is well known. > > > > The thing is that at scale we almost always cap at the backoff so it has > > probably > > the same effect as your suggestion with the difference that we actually > > delay even > > after long runs. And that is actually desired, it's true that in the let's > > say 500 ms > > should be enough to accumulate more changes however that can lead to another > > 500 ms run and so on. That in the end means that northd will spin at 100% > > CPU > > anyway which is what we want to avoid. So from another point of view the > > accumulation > > of IDL changes is a secondary effect which is still desired, but not the > > main purpose. > > > > Also delaying by short time if the previous run was short is fine, you are > > right that we don't > > accumulate enough however during short running times there is a high chance > > that the > > northd would get to sleep anyway (We will help it to sleep at least a bit > > nevertheless). > > > > > >> > >> Next, I notice that you've added new poll_timer_wait() calls but haven't > >> changed the ovsdb_idl_loop_run() or ovsdb_idl_loop_commit_and_wait() > >> calls. Is there any danger of ovn-northd getting in a busy loop of > >> sleeping and waking because of this? I don't think it should, since > >> presumably ovsdb_idl_loop_run() should clear the conditions waited on by > >> ovsdb_idl_loop_commit_and_wait(), but I want to double-check. > >> > > > > AFAIK it shouldn't cause any issues as ovsdb_idl_loop_run() will process > > anything > > that it can and wait will be fine. The problem would be if we would skip the > > ovsdb_idl_loop_run() for some reason. > > > > > >> > >> Next, does this have any negative impact on our ability to perform > >> incremental processing in ovn-northd? My concern is that since we are > >> still running the ovsdb IDL loop that if multiple NB changes occur > >> during our delay, then we might have to fall back to a full recompute > >> instead of being able to incrementally process the changes. Are my > >> concerns valid? > >> > > > > I suppose that can happen if there are changes that could result in > > "conflict" > > and full recompute. During the test we haven't seen anything like that. > > The odds of that happening are small because as stated previously we are > > doing > > basically the same as if the engine was running for a long time always from > > the IDL > > point of view except that we give IDL a chance to process whatever has > > pilled up > > within the sleep period. > > > > > >> > >> Next, has scale testing shown that this change has made a positive > >> impact? If so, is there any recommendation for how to determine what to > >> configure the value to? > >> > >> > > It has a huge impact actually the value tested was 200 ms, the > > recommendation > > This was chosen based on the historical data from similar tests which > showed that the I-P engine was taking ~180-200 ms to run at scale. > > > would be < 500 ms. After that point the latency on components creation > > would be > > very noticable. I will put the recommendation into the ovn-nb.xml with the > > latency > > comment. Before I'll post v2 (which has the numbers in commit message) those > > are the test results: > > > > Run without any backoff period: > > northd aggregate CPU 9810% avg / 12765% max > > northd was spinning at 100% CPU the entire second half of the test. > > > > Run with 200 ms max backoff period: > > northd aggregate CPU 6066% avg / 7689% max > > northd was around 60% for the second half of the test > > > > > >> > >> Finally, is this change expected to be a long-term necessity? This > >> option seems to be useful for cases where northd recomputes are > >> required. Performing recomputes less frequently seems like it would > >> lower the CPU usage of ovn-northd while still processing the same amount > >> of changes. However, once northd can handle most changes incrementally, > >> is there still a benefit to delaying running? If each run of northd > >> handles all DB changes incrementally, then is there any point in putting > >> delays between those incremental runs? > >> > >> > > Ideally we won't need it in the future. However, the assumption for not > > needing > > anything like this is that northd will be fast enough to process I-P > > changes and > > be able to sleep between the next batch update arrives from CMS. That > > doesn't > > seem to happen in very near future, one thing to keep in mind is that > > testing > > happened with Numan's I-P for LBs and lflows which make a huge difference, > > but > > still not enough to achieve the mentioned northd state. So from my > > perspective > > it will be relevant for a few releases. And as stated above the point is to > > prevent > > northd to spin at 100% CPU all the time. > > > > +1 it's not the prettiest feature (and some might rightfully call it a > hack) but it seems to me like the cleanest alternative for now, until > northd processing is fully incremental.
In most cases it may be fine, but it might be a problem for a worst case scenario: Assume all the changes coming in NB can be incrementally processed but at a very very high rate, and ovn-northd keeps processing the changes incrementally. Since the change rate is so high, ovn-northd barely keeps up with the changes with 99% CPU load. For example, I-P for each object takes 10ms, and the change rate is 99 objects/sec. According to this algorithm, ovn-northd will always sleep for the maximum 200ms between each IDL run, and then ovn-northd would never keep up with the changes any more - the backlog will become longer and longer because of the wasted idle time. In practice this means decreased max throughput supported by ovn-northd, which probably needs to be called out as a tradeoff (in addition to the latency tradeoff). Thanks, Han > > Regards, > Dumitru > > > > >> > >> On 8/9/23 01:29, Ales Musil wrote: > >>> Add config option called "northd-backoff-interval-ms" that allows > >>> to delay northd engine runs capped by the config option. > >>> When the config option is set to 0 or unspecified, the engine > >>> will run without any restrictions. If the value >0 we will delay > >>> northd engine run by the previous run time capped by the > >>> config option. > >>> > >>> Signed-off-by: Ales Musil <amu...@redhat.com> > >>> --- > >>> NEWS | 2 ++ > >>> northd/inc-proc-northd.c | 23 ++++++++++++++++++++++- > >>> northd/inc-proc-northd.h | 3 ++- > >>> northd/ovn-northd.c | 9 +++++++-- > >>> ovn-nb.xml | 7 +++++++ > >>> 5 files changed, 40 insertions(+), 4 deletions(-) > >>> > >>> diff --git a/NEWS b/NEWS > >>> index 8275877f9..6109f13a2 100644 > >>> --- a/NEWS > >>> +++ b/NEWS > >>> @@ -10,6 +10,8 @@ Post v23.06.0 > >>> - To allow optimizing ovn-controller's monitor conditions for the > >> regular > >>> VIF case, ovn-controller now unconditionally monitors all sub-ports > >>> (ports with parent_port set). > >>> + - Add "northd-backoff-interval-ms" config option to delay northd > >> engine > >>> + runs capped at the set value. > >>> > >>> OVN v23.06.0 - 01 Jun 2023 > >>> -------------------------- > >>> diff --git a/northd/inc-proc-northd.c b/northd/inc-proc-northd.c > >>> index d328deb22..87db50ad1 100644 > >>> --- a/northd/inc-proc-northd.c > >>> +++ b/northd/inc-proc-northd.c > >>> @@ -42,6 +42,8 @@ VLOG_DEFINE_THIS_MODULE(inc_proc_northd); > >>> > >>> static unixctl_cb_func chassis_features_list; > >>> > >>> +static int64_t next_northd_run_ms = 0; > >>> + > >>> #define NB_NODES \ > >>> NB_NODE(nb_global, "nb_global") \ > >>> NB_NODE(logical_switch, "logical_switch") \ > >>> @@ -295,8 +297,10 @@ void inc_proc_northd_init(struct ovsdb_idl_loop *nb, > >>> /* Returns true if the incremental processing ended up updating nodes. > >> */ > >>> bool inc_proc_northd_run(struct ovsdb_idl_txn *ovnnb_txn, > >>> struct ovsdb_idl_txn *ovnsb_txn, > >>> - bool recompute) { > >>> + bool recompute, uint32_t backoff_ms) { > >>> ovs_assert(ovnnb_txn && ovnsb_txn); > >>> + > >>> + int64_t start = time_msec(); > >>> engine_init_run(); > >>> > >>> /* Force a full recompute if instructed to, for example, after a > >> NB/SB > >>> @@ -330,6 +334,12 @@ bool inc_proc_northd_run(struct ovsdb_idl_txn > >> *ovnnb_txn, > >>> } else { > >>> engine_set_force_recompute(false); > >>> } > >>> + > >>> + int64_t now = time_msec(); > >>> + /* Postpone the next run by length of current run with maximum > >> capped > >>> + * by "northd-backoff-interval-ms" interval. */ > >>> + next_northd_run_ms = now + MIN(now - start, backoff_ms); > >>> + > >>> return engine_has_updated(); > >>> } > >>> > >>> @@ -339,6 +349,17 @@ void inc_proc_northd_cleanup(void) > >>> engine_set_context(NULL); > >>> } > >>> > >>> +bool > >>> +inc_proc_northd_can_run(bool recompute) > >>> +{ > >>> + if (recompute || time_msec() >= next_northd_run_ms) { > >>> + return true; > >>> + } > >>> + > >>> + poll_timer_wait_until(next_northd_run_ms); > >>> + return false; > >>> +} > >>> + > >>> static void > >>> chassis_features_list(struct unixctl_conn *conn, int argc OVS_UNUSED, > >>> const char *argv[] OVS_UNUSED, void *features_) > >>> diff --git a/northd/inc-proc-northd.h b/northd/inc-proc-northd.h > >>> index 9b81c7ee0..af418d7d7 100644 > >>> --- a/northd/inc-proc-northd.h > >>> +++ b/northd/inc-proc-northd.h > >>> @@ -10,7 +10,8 @@ void inc_proc_northd_init(struct ovsdb_idl_loop *nb, > >>> struct ovsdb_idl_loop *sb); > >>> bool inc_proc_northd_run(struct ovsdb_idl_txn *ovnnb_txn, > >>> struct ovsdb_idl_txn *ovnsb_txn, > >>> - bool recompute); > >>> + bool recompute, uint32_t backoff_ms); > >>> void inc_proc_northd_cleanup(void); > >>> +bool inc_proc_northd_can_run(bool recompute); > >>> > >>> #endif /* INC_PROC_NORTHD */ > >>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c > >>> index 4fa1b039e..3202b50a1 100644 > >>> --- a/northd/ovn-northd.c > >>> +++ b/northd/ovn-northd.c > >>> @@ -868,6 +868,7 @@ main(int argc, char *argv[]) > >>> /* Main loop. */ > >>> exiting = false; > >>> > >>> + uint32_t northd_backoff_ms = 0; > >>> bool recompute = false; > >>> while (!exiting) { > >>> update_ssl_config(); > >>> @@ -932,10 +933,12 @@ main(int argc, char *argv[]) > >>> > >>> if (ovsdb_idl_has_lock(ovnsb_idl_loop.idl)) { > >>> bool activity = false; > >>> - if (ovnnb_txn && ovnsb_txn) { > >>> + if (ovnnb_txn && ovnsb_txn && > >>> + inc_proc_northd_can_run(recompute)) { > >>> int64_t loop_start_time = time_wall_msec(); > >>> activity = inc_proc_northd_run(ovnnb_txn, > >> ovnsb_txn, > >>> - recompute); > >>> + recompute, > >>> + northd_backoff_ms); > >>> recompute = false; > >>> check_and_add_supported_dhcp_opts_to_sb_db( > >>> ovnsb_txn, ovnsb_idl_loop.idl); > >>> @@ -1019,6 +1022,8 @@ main(int argc, char *argv[]) > >>> if (nb) { > >>> interval = smap_get_int(&nb->options, > >> "northd_probe_interval", > >>> interval); > >>> + northd_backoff_ms = smap_get_uint(&nb->options, > >>> + > >> "northd-backoff-interval-ms", 0); > >>> } > >>> set_idl_probe_interval(ovnnb_idl_loop.idl, ovnnb_db, interval); > >>> set_idl_probe_interval(ovnsb_idl_loop.idl, ovnsb_db, interval); > >>> diff --git a/ovn-nb.xml b/ovn-nb.xml > >>> index 4fbf4f7e5..115dfd536 100644 > >>> --- a/ovn-nb.xml > >>> +++ b/ovn-nb.xml > >>> @@ -349,6 +349,13 @@ > >>> of HWOL compatibility with GDP. > >>> </column> > >>> > >>> + <column name="options" key="northd-backoff-interval-ms"> > >>> + Maximum interval that the northd incremental engine is delayed > >> by > >>> + in milliseconds. Setting the value to nonzero delays the next > >> northd > >>> + engine run by the previous run time, capped by the specified > >> value. > >>> + If the value is zero the engine won't be delayed at all. > >>> + </column> > >>> + > >>> <group title="Options for configuring interconnection route > >> advertisement"> > >>> <p> > >>> These options control how routes are advertised between OVN > >> > >> > > Thanks, > > Ales > > > > _______________________________________________ > dev mailing list > d...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev