On 8/10/23 15:34, Han Zhou wrote: > On Thu, Aug 10, 2023 at 2:29 AM Dumitru Ceara <dce...@redhat.com> wrote: >> >> On 8/10/23 08:12, Ales Musil wrote: >>> On Wed, Aug 9, 2023 at 5:13 PM Mark Michelson <mmich...@redhat.com> > wrote: >>> >>>> Hi Ales, >>>> >>>> I have some high-level comments/questions about this patch. >>>> >>> >>> Hi Mark, >>> >> >> Hi Ales, Mark, >> >>> thank you for the review. See my answers inline below. >>> >>> >>>> I have been privy to the conversations that led to this change. My >>>> understanding is that by having ovn-northd wake up immediately, it can >>>> be more CPU-intensive than waiting a bit for changes to accumulate and >>>> handling all of those at once instead. However, nothing in either the >>>> commit message or ovn-nb.xml explains what the purpose of this new >>>> configuration option is. I think you should add a sentence or two to >>>> explain why someone would want to enable this option. >>>> >>>> >>> Yeah that's my bad, I have v2 prepared with some explanation in the > commit >>> message >>> together with results from scale run. >>> >> >> +1 we really need to explain why this change is needed. >> >>> >>>> >>>> Next, the algorithm used here strikes me as odd. We use the previous > run >>>> time of ovn-northd to determine how long to wait before running again. >>>> This delay is capped by the configured backoff time. Let's say that >>>> we've configured the backoff interval to be 200 ms. If ovn-northd has a >>>> super quick run and only takes 10 ms, then we will only delay the next >>>> run by 10 ms. IMO, this seems like it would not mitigate the original >>>> issue by much, since we are only allowing a maximum of 20 ms (10 ms for >>>> the run of ovn-northd + 10 ms delay) of NB changes to accumulate. >>>> Conversely, if northd has a huge recompute and it takes 5000 ms to >>>> complete, then we would delay the next run by 200ms. In this case, >>>> delaying at all seems like it's not necessary since we potentially have >>>> 5000 ms worth of NB DB updates that have not been addressed. I would >>>> have expected the opposite approach to be taken. If someone configures >>>> 200ms as their backoff interval, I would expect us to always allow a >>>> *minimum* of 200ms of NB changes to accumulate before running again. So >>>> for instance, if northd runs quickly and is done in 10 ms, then we > would >>>> wait 200 - 10 = 190 ms before processing changes again. If northd takes >>>> a long time to recompute and takes 5000 ms, then we would not wait at >>>> all before processing changes again. Was the algorithm chosen based on >>>> experimentation? Is it a well-known method I'm just not familiar with? >> >> I think the main assumption (that should probably be made explicit in >> the commit log and/or documentation) is that on average changes happen >> in a uniform way. This might not always be accurate. >> >> However, if we're off with the estimate, in the worst case we'd be >> adding the configured max delay to the latency of processing changes. >> So, as long as the value is not extremely high, the impact is not that >> high either. >> >> Last but not least, as this value would be configured by the CMS, we >> assume the CMS knows what they're doing. :) >> >>>> >>> >>> I'm not sure if the algorithm is well known. >>> >>> The thing is that at scale we almost always cap at the backoff so it has >>> probably >>> the same effect as your suggestion with the difference that we actually >>> delay even >>> after long runs. And that is actually desired, it's true that in the > let's >>> say 500 ms >>> should be enough to accumulate more changes however that can lead to > another >>> 500 ms run and so on. That in the end means that northd will spin at > 100% >>> CPU >>> anyway which is what we want to avoid. So from another point of view the >>> accumulation >>> of IDL changes is a secondary effect which is still desired, but not the >>> main purpose. >>> >>> Also delaying by short time if the previous run was short is fine, you > are >>> right that we don't >>> accumulate enough however during short running times there is a high > chance >>> that the >>> northd would get to sleep anyway (We will help it to sleep at least a > bit >>> nevertheless). >>> >>> >>>> >>>> Next, I notice that you've added new poll_timer_wait() calls but > haven't >>>> changed the ovsdb_idl_loop_run() or ovsdb_idl_loop_commit_and_wait() >>>> calls. Is there any danger of ovn-northd getting in a busy loop of >>>> sleeping and waking because of this? I don't think it should, since >>>> presumably ovsdb_idl_loop_run() should clear the conditions waited on > by >>>> ovsdb_idl_loop_commit_and_wait(), but I want to double-check. >>>> >>> >>> AFAIK it shouldn't cause any issues as ovsdb_idl_loop_run() will process >>> anything >>> that it can and wait will be fine. The problem would be if we would > skip the >>> ovsdb_idl_loop_run() for some reason. >>> >>> >>>> >>>> Next, does this have any negative impact on our ability to perform >>>> incremental processing in ovn-northd? My concern is that since we are >>>> still running the ovsdb IDL loop that if multiple NB changes occur >>>> during our delay, then we might have to fall back to a full recompute >>>> instead of being able to incrementally process the changes. Are my >>>> concerns valid? >>>> >>> >>> I suppose that can happen if there are changes that could result in >>> "conflict" >>> and full recompute. During the test we haven't seen anything like that. >>> The odds of that happening are small because as stated previously we are >>> doing >>> basically the same as if the engine was running for a long time always > from >>> the IDL >>> point of view except that we give IDL a chance to process whatever has >>> pilled up >>> within the sleep period. >>> >>> >>>> >>>> Next, has scale testing shown that this change has made a positive >>>> impact? If so, is there any recommendation for how to determine what to >>>> configure the value to? >>>> >>>> >>> It has a huge impact actually the value tested was 200 ms, the >>> recommendation >> >> This was chosen based on the historical data from similar tests which >> showed that the I-P engine was taking ~180-200 ms to run at scale. >> >>> would be < 500 ms. After that point the latency on components creation >>> would be >>> very noticable. I will put the recommendation into the ovn-nb.xml with > the >>> latency >>> comment. Before I'll post v2 (which has the numbers in commit message) > those >>> are the test results: >>> >>> Run without any backoff period: >>> northd aggregate CPU 9810% avg / 12765% max >>> northd was spinning at 100% CPU the entire second half of the test. >>> >>> Run with 200 ms max backoff period: >>> northd aggregate CPU 6066% avg / 7689% max >>> northd was around 60% for the second half of the test >>> >>> >>>> >>>> Finally, is this change expected to be a long-term necessity? This >>>> option seems to be useful for cases where northd recomputes are >>>> required. Performing recomputes less frequently seems like it would >>>> lower the CPU usage of ovn-northd while still processing the same > amount >>>> of changes. However, once northd can handle most changes incrementally, >>>> is there still a benefit to delaying running? If each run of northd >>>> handles all DB changes incrementally, then is there any point in > putting >>>> delays between those incremental runs? >>>> >>>> >>> Ideally we won't need it in the future. However, the assumption for not >>> needing >>> anything like this is that northd will be fast enough to process I-P >>> changes and >>> be able to sleep between the next batch update arrives from CMS. That >>> doesn't >>> seem to happen in very near future, one thing to keep in mind is that >>> testing >>> happened with Numan's I-P for LBs and lflows which make a huge > difference, >>> but >>> still not enough to achieve the mentioned northd state. So from my >>> perspective >>> it will be relevant for a few releases. And as stated above the point > is to >>> prevent >>> northd to spin at 100% CPU all the time. >>> >> >> +1 it's not the prettiest feature (and some might rightfully call it a >> hack) but it seems to me like the cleanest alternative for now, until >> northd processing is fully incremental. > > In most cases it may be fine, but it might be a problem for a worst case > scenario: > > Assume all the changes coming in NB can be incrementally processed but at > a very very high rate, and ovn-northd keeps processing the changes > incrementally. Since the change rate is so high, ovn-northd barely keeps up > with the changes with 99% CPU load. For example, I-P for each object takes > 10ms, and the change rate is 99 objects/sec. According to this algorithm, > ovn-northd will always sleep for the maximum 200ms between each IDL run, > and then ovn-northd would never keep up with the changes any more - the > backlog will become longer and longer because of the wasted idle time. >
IDL runs are not skipped. Just I-P engine runs. So I think this won't be a problem, or am I missing something? Regards, Dumitru > In practice this means decreased max throughput supported by ovn-northd, > which probably needs to be called out as a tradeoff (in addition to the > latency tradeoff). > > Thanks, > Han > >> >> Regards, >> Dumitru >> >>> >>>> >>>> On 8/9/23 01:29, Ales Musil wrote: >>>>> Add config option called "northd-backoff-interval-ms" that allows >>>>> to delay northd engine runs capped by the config option. >>>>> When the config option is set to 0 or unspecified, the engine >>>>> will run without any restrictions. If the value >0 we will delay >>>>> northd engine run by the previous run time capped by the >>>>> config option. >>>>> >>>>> Signed-off-by: Ales Musil <amu...@redhat.com> >>>>> --- >>>>> NEWS | 2 ++ >>>>> northd/inc-proc-northd.c | 23 ++++++++++++++++++++++- >>>>> northd/inc-proc-northd.h | 3 ++- >>>>> northd/ovn-northd.c | 9 +++++++-- >>>>> ovn-nb.xml | 7 +++++++ >>>>> 5 files changed, 40 insertions(+), 4 deletions(-) >>>>> >>>>> diff --git a/NEWS b/NEWS >>>>> index 8275877f9..6109f13a2 100644 >>>>> --- a/NEWS >>>>> +++ b/NEWS >>>>> @@ -10,6 +10,8 @@ Post v23.06.0 >>>>> - To allow optimizing ovn-controller's monitor conditions for the >>>> regular >>>>> VIF case, ovn-controller now unconditionally monitors all > sub-ports >>>>> (ports with parent_port set). >>>>> + - Add "northd-backoff-interval-ms" config option to delay northd >>>> engine >>>>> + runs capped at the set value. >>>>> >>>>> OVN v23.06.0 - 01 Jun 2023 >>>>> -------------------------- >>>>> diff --git a/northd/inc-proc-northd.c b/northd/inc-proc-northd.c >>>>> index d328deb22..87db50ad1 100644 >>>>> --- a/northd/inc-proc-northd.c >>>>> +++ b/northd/inc-proc-northd.c >>>>> @@ -42,6 +42,8 @@ VLOG_DEFINE_THIS_MODULE(inc_proc_northd); >>>>> >>>>> static unixctl_cb_func chassis_features_list; >>>>> >>>>> +static int64_t next_northd_run_ms = 0; >>>>> + >>>>> #define NB_NODES \ >>>>> NB_NODE(nb_global, "nb_global") \ >>>>> NB_NODE(logical_switch, "logical_switch") \ >>>>> @@ -295,8 +297,10 @@ void inc_proc_northd_init(struct ovsdb_idl_loop > *nb, >>>>> /* Returns true if the incremental processing ended up updating > nodes. >>>> */ >>>>> bool inc_proc_northd_run(struct ovsdb_idl_txn *ovnnb_txn, >>>>> struct ovsdb_idl_txn *ovnsb_txn, >>>>> - bool recompute) { >>>>> + bool recompute, uint32_t backoff_ms) { >>>>> ovs_assert(ovnnb_txn && ovnsb_txn); >>>>> + >>>>> + int64_t start = time_msec(); >>>>> engine_init_run(); >>>>> >>>>> /* Force a full recompute if instructed to, for example, after a >>>> NB/SB >>>>> @@ -330,6 +334,12 @@ bool inc_proc_northd_run(struct ovsdb_idl_txn >>>> *ovnnb_txn, >>>>> } else { >>>>> engine_set_force_recompute(false); >>>>> } >>>>> + >>>>> + int64_t now = time_msec(); >>>>> + /* Postpone the next run by length of current run with maximum >>>> capped >>>>> + * by "northd-backoff-interval-ms" interval. */ >>>>> + next_northd_run_ms = now + MIN(now - start, backoff_ms); >>>>> + >>>>> return engine_has_updated(); >>>>> } >>>>> >>>>> @@ -339,6 +349,17 @@ void inc_proc_northd_cleanup(void) >>>>> engine_set_context(NULL); >>>>> } >>>>> >>>>> +bool >>>>> +inc_proc_northd_can_run(bool recompute) >>>>> +{ >>>>> + if (recompute || time_msec() >= next_northd_run_ms) { >>>>> + return true; >>>>> + } >>>>> + >>>>> + poll_timer_wait_until(next_northd_run_ms); >>>>> + return false; >>>>> +} >>>>> + >>>>> static void >>>>> chassis_features_list(struct unixctl_conn *conn, int argc > OVS_UNUSED, >>>>> const char *argv[] OVS_UNUSED, void > *features_) >>>>> diff --git a/northd/inc-proc-northd.h b/northd/inc-proc-northd.h >>>>> index 9b81c7ee0..af418d7d7 100644 >>>>> --- a/northd/inc-proc-northd.h >>>>> +++ b/northd/inc-proc-northd.h >>>>> @@ -10,7 +10,8 @@ void inc_proc_northd_init(struct ovsdb_idl_loop *nb, >>>>> struct ovsdb_idl_loop *sb); >>>>> bool inc_proc_northd_run(struct ovsdb_idl_txn *ovnnb_txn, >>>>> struct ovsdb_idl_txn *ovnsb_txn, >>>>> - bool recompute); >>>>> + bool recompute, uint32_t backoff_ms); >>>>> void inc_proc_northd_cleanup(void); >>>>> +bool inc_proc_northd_can_run(bool recompute); >>>>> >>>>> #endif /* INC_PROC_NORTHD */ >>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c >>>>> index 4fa1b039e..3202b50a1 100644 >>>>> --- a/northd/ovn-northd.c >>>>> +++ b/northd/ovn-northd.c >>>>> @@ -868,6 +868,7 @@ main(int argc, char *argv[]) >>>>> /* Main loop. */ >>>>> exiting = false; >>>>> >>>>> + uint32_t northd_backoff_ms = 0; >>>>> bool recompute = false; >>>>> while (!exiting) { >>>>> update_ssl_config(); >>>>> @@ -932,10 +933,12 @@ main(int argc, char *argv[]) >>>>> >>>>> if (ovsdb_idl_has_lock(ovnsb_idl_loop.idl)) { >>>>> bool activity = false; >>>>> - if (ovnnb_txn && ovnsb_txn) { >>>>> + if (ovnnb_txn && ovnsb_txn && >>>>> + inc_proc_northd_can_run(recompute)) { >>>>> int64_t loop_start_time = time_wall_msec(); >>>>> activity = inc_proc_northd_run(ovnnb_txn, >>>> ovnsb_txn, >>>>> - recompute); >>>>> + recompute, >>>>> + > northd_backoff_ms); >>>>> recompute = false; >>>>> check_and_add_supported_dhcp_opts_to_sb_db( >>>>> ovnsb_txn, ovnsb_idl_loop.idl); >>>>> @@ -1019,6 +1022,8 @@ main(int argc, char *argv[]) >>>>> if (nb) { >>>>> interval = smap_get_int(&nb->options, >>>> "northd_probe_interval", >>>>> interval); >>>>> + northd_backoff_ms = smap_get_uint(&nb->options, >>>>> + >>>> "northd-backoff-interval-ms", 0); >>>>> } >>>>> set_idl_probe_interval(ovnnb_idl_loop.idl, ovnnb_db, > interval); >>>>> set_idl_probe_interval(ovnsb_idl_loop.idl, ovnsb_db, > interval); >>>>> diff --git a/ovn-nb.xml b/ovn-nb.xml >>>>> index 4fbf4f7e5..115dfd536 100644 >>>>> --- a/ovn-nb.xml >>>>> +++ b/ovn-nb.xml >>>>> @@ -349,6 +349,13 @@ >>>>> of HWOL compatibility with GDP. >>>>> </column> >>>>> >>>>> + <column name="options" key="northd-backoff-interval-ms"> >>>>> + Maximum interval that the northd incremental engine is > delayed >>>> by >>>>> + in milliseconds. Setting the value to nonzero delays the next >>>> northd >>>>> + engine run by the previous run time, capped by the specified >>>> value. >>>>> + If the value is zero the engine won't be delayed at all. >>>>> + </column> >>>>> + >>>>> <group title="Options for configuring interconnection route >>>> advertisement"> >>>>> <p> >>>>> These options control how routes are advertised between > OVN >>>> >>>> >>> Thanks, >>> Ales >>> >> >> _______________________________________________ >> dev mailing list >> d...@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev