On 8/10/23 15:34, Han Zhou wrote:
> On Thu, Aug 10, 2023 at 2:29 AM Dumitru Ceara <dce...@redhat.com> wrote:
>>
>> On 8/10/23 08:12, Ales Musil wrote:
>>> On Wed, Aug 9, 2023 at 5:13 PM Mark Michelson <mmich...@redhat.com>
> wrote:
>>>
>>>> Hi Ales,
>>>>
>>>> I have some high-level comments/questions about this patch.
>>>>
>>>
>>> Hi Mark,
>>>
>>
>> Hi Ales, Mark,
>>
>>> thank you for the review. See my answers inline below.
>>>
>>>
>>>> I have been privy to the conversations that led to this change. My
>>>> understanding is that by having ovn-northd wake up immediately, it can
>>>> be more CPU-intensive than waiting a bit for changes to accumulate and
>>>> handling all of those at once instead. However, nothing in either the
>>>> commit message or ovn-nb.xml explains what the purpose of this new
>>>> configuration option is. I think you should add a sentence or two to
>>>> explain why someone would want to enable this option.
>>>>
>>>>
>>> Yeah that's my bad, I have v2 prepared with some explanation in the
> commit
>>> message
>>> together with results from scale run.
>>>
>>
>> +1 we really need to explain why this change is needed.
>>
>>>
>>>>
>>>> Next, the algorithm used here strikes me as odd. We use the previous
> run
>>>> time of ovn-northd to determine how long to wait before running again.
>>>> This delay is capped by the configured backoff time. Let's say that
>>>> we've configured the backoff interval to be 200 ms. If ovn-northd has a
>>>> super quick run and only takes 10 ms, then we will only delay the next
>>>> run by 10 ms. IMO, this seems like it would not mitigate the original
>>>> issue by much, since we are only allowing a maximum of 20 ms (10 ms for
>>>> the run of ovn-northd + 10 ms delay) of NB changes to accumulate.
>>>> Conversely, if northd has a huge recompute and it takes 5000 ms to
>>>> complete, then we would delay the next run by 200ms. In this case,
>>>> delaying at all seems like it's not necessary since we potentially have
>>>> 5000 ms worth of NB DB updates that have not been addressed. I would
>>>> have expected the opposite approach to be taken. If someone configures
>>>> 200ms as their backoff interval, I would expect us to always allow a
>>>> *minimum* of 200ms of NB changes to accumulate before running again. So
>>>> for instance, if northd runs quickly and is done in 10 ms, then we
> would
>>>> wait 200 - 10 = 190 ms before processing changes again. If northd takes
>>>> a long time to recompute and takes 5000 ms, then we would not wait at
>>>> all before processing changes again. Was the algorithm chosen based on
>>>> experimentation? Is it a well-known method I'm just not familiar with?
>>
>> I think the main assumption (that should probably be made explicit in
>> the commit log and/or documentation) is that on average changes happen
>> in a uniform way.  This might not always be accurate.
>>
>> However, if we're off with the estimate, in the worst case we'd be
>> adding the configured max delay to the latency of processing changes.
>> So, as long as the value is not extremely high, the impact is not that
>> high either.
>>
>> Last but not least, as this value would be configured by the CMS, we
>> assume the CMS knows what they're doing. :)
>>
>>>>
>>>
>>> I'm not sure if the algorithm is well known.
>>>
>>> The thing is that at scale we almost always cap at the backoff so it has
>>> probably
>>> the same effect as your suggestion with the difference that we actually
>>> delay even
>>> after long runs. And that is actually desired, it's true that in the
> let's
>>> say 500 ms
>>> should be enough to accumulate more changes however that can lead to
> another
>>> 500 ms run and so on. That in the end means that northd will spin at
> 100%
>>> CPU
>>> anyway which is what we want to avoid. So from another point of view the
>>> accumulation
>>> of IDL changes is a secondary effect which is still desired, but not the
>>> main purpose.
>>>
>>> Also delaying by short time if the previous run was short is fine, you
> are
>>> right that we don't
>>> accumulate enough however during short running times there is a high
> chance
>>> that the
>>> northd would get to sleep anyway (We will help it to sleep at least a
> bit
>>> nevertheless).
>>>
>>>
>>>>
>>>> Next, I notice that you've added new poll_timer_wait() calls but
> haven't
>>>> changed the ovsdb_idl_loop_run() or ovsdb_idl_loop_commit_and_wait()
>>>> calls. Is there any danger of ovn-northd getting in a busy loop of
>>>> sleeping and waking because of this? I don't think it should, since
>>>> presumably ovsdb_idl_loop_run() should clear the conditions waited on
> by
>>>> ovsdb_idl_loop_commit_and_wait(), but I want to double-check.
>>>>
>>>
>>> AFAIK it shouldn't cause any issues as ovsdb_idl_loop_run() will process
>>> anything
>>> that it can and wait will be fine. The problem would be if we would
> skip the
>>> ovsdb_idl_loop_run() for some reason.
>>>
>>>
>>>>
>>>> Next, does this have any negative impact on our ability to perform
>>>> incremental processing in ovn-northd? My concern is that since we are
>>>> still running the ovsdb IDL loop that if multiple NB changes occur
>>>> during our delay, then we might have to fall back to a full recompute
>>>> instead of being able to incrementally process the changes. Are my
>>>> concerns valid?
>>>>
>>>
>>> I suppose that can happen if there are changes that could result in
>>> "conflict"
>>> and full recompute. During the test we haven't seen anything like that.
>>> The odds of that happening are small because as stated previously we are
>>> doing
>>> basically the same as if the engine was running for a long time always
> from
>>> the IDL
>>> point of view except that we give IDL a chance to process whatever has
>>> pilled up
>>> within the sleep period.
>>>
>>>
>>>>
>>>> Next, has scale testing shown that this change has made a positive
>>>> impact? If so, is there any recommendation for how to determine what to
>>>> configure the value to?
>>>>
>>>>
>>> It has a huge impact actually the value tested was 200 ms, the
>>> recommendation
>>
>> This was chosen based on the historical data from similar tests which
>> showed that the I-P engine was taking ~180-200 ms to run at scale.
>>
>>> would be < 500 ms. After that point the latency on components creation
>>> would be
>>> very noticable. I will put the recommendation into the ovn-nb.xml with
> the
>>> latency
>>> comment. Before I'll post v2 (which has the numbers in commit message)
> those
>>> are the test results:
>>>
>>> Run without any backoff period:
>>> northd aggregate CPU 9810% avg / 12765% max
>>> northd was spinning at 100% CPU the entire second half of the test.
>>>
>>> Run with 200 ms max backoff period:
>>> northd aggregate CPU 6066% avg / 7689% max
>>> northd was around 60% for the second half of the test
>>>
>>>
>>>>
>>>> Finally, is this change expected to be a long-term necessity? This
>>>> option seems to be useful for cases where northd recomputes are
>>>> required. Performing recomputes less frequently seems like it would
>>>> lower the CPU usage of ovn-northd while still processing the same
> amount
>>>> of changes. However, once northd can handle most changes incrementally,
>>>> is there still a benefit to delaying running? If each run of northd
>>>> handles all DB changes incrementally, then is there any point in
> putting
>>>> delays between those incremental runs?
>>>>
>>>>
>>> Ideally we won't need it in the future. However, the assumption for not
>>> needing
>>> anything like this is that northd will be fast enough to process I-P
>>> changes and
>>> be able to sleep between the next batch update arrives from CMS. That
>>> doesn't
>>> seem to happen in very near future, one thing to keep in mind is that
>>> testing
>>> happened with Numan's I-P for LBs and lflows which make a huge
> difference,
>>> but
>>> still not enough to achieve the mentioned northd state. So from my
>>> perspective
>>> it will be relevant for a few releases. And as stated above the point
> is to
>>> prevent
>>> northd to spin at 100% CPU all the time.
>>>
>>
>> +1 it's not the prettiest feature (and some might rightfully call it a
>> hack) but it seems to me like the cleanest alternative for now, until
>> northd processing is fully incremental.
> 
> In most cases it may be fine, but it might be a problem for a worst case
> scenario:
> 
> Assume all the changes coming in NB can be incrementally processed but at
> a very very high rate, and ovn-northd keeps processing the changes
> incrementally. Since the change rate is so high, ovn-northd barely keeps up
> with the changes with 99% CPU load. For example, I-P for each object takes
> 10ms, and the change rate is 99 objects/sec. According to this algorithm,
> ovn-northd will always sleep for the maximum 200ms between each IDL run,
> and then ovn-northd would never keep up with the changes any more - the
> backlog will become longer and longer because of the wasted idle time.
> 

IDL runs are not skipped.  Just I-P engine runs.  So I think this won't
be a problem, or am I missing something?

Regards,
Dumitru

> In practice this means decreased max throughput supported by ovn-northd,
> which probably needs to be called out as a tradeoff (in addition to the
> latency tradeoff).
> 
> Thanks,
> Han
> 
>>
>> Regards,
>> Dumitru
>>
>>>
>>>>
>>>> On 8/9/23 01:29, Ales Musil wrote:
>>>>> Add config option called "northd-backoff-interval-ms" that allows
>>>>> to delay northd engine runs capped by the config option.
>>>>> When the config option is set to 0 or unspecified, the engine
>>>>> will run without any restrictions. If the value >0 we will delay
>>>>> northd engine run by the previous run time capped by the
>>>>> config option.
>>>>>
>>>>> Signed-off-by: Ales Musil <amu...@redhat.com>
>>>>> ---
>>>>>   NEWS                     |  2 ++
>>>>>   northd/inc-proc-northd.c | 23 ++++++++++++++++++++++-
>>>>>   northd/inc-proc-northd.h |  3 ++-
>>>>>   northd/ovn-northd.c      |  9 +++++++--
>>>>>   ovn-nb.xml               |  7 +++++++
>>>>>   5 files changed, 40 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/NEWS b/NEWS
>>>>> index 8275877f9..6109f13a2 100644
>>>>> --- a/NEWS
>>>>> +++ b/NEWS
>>>>> @@ -10,6 +10,8 @@ Post v23.06.0
>>>>>     - To allow optimizing ovn-controller's monitor conditions for the
>>>> regular
>>>>>       VIF case, ovn-controller now unconditionally monitors all
> sub-ports
>>>>>       (ports with parent_port set).
>>>>> +  - Add "northd-backoff-interval-ms" config option to delay northd
>>>> engine
>>>>> +    runs capped at the set value.
>>>>>
>>>>>   OVN v23.06.0 - 01 Jun 2023
>>>>>   --------------------------
>>>>> diff --git a/northd/inc-proc-northd.c b/northd/inc-proc-northd.c
>>>>> index d328deb22..87db50ad1 100644
>>>>> --- a/northd/inc-proc-northd.c
>>>>> +++ b/northd/inc-proc-northd.c
>>>>> @@ -42,6 +42,8 @@ VLOG_DEFINE_THIS_MODULE(inc_proc_northd);
>>>>>
>>>>>   static unixctl_cb_func chassis_features_list;
>>>>>
>>>>> +static int64_t next_northd_run_ms = 0;
>>>>> +
>>>>>   #define NB_NODES \
>>>>>       NB_NODE(nb_global, "nb_global") \
>>>>>       NB_NODE(logical_switch, "logical_switch") \
>>>>> @@ -295,8 +297,10 @@ void inc_proc_northd_init(struct ovsdb_idl_loop
> *nb,
>>>>>   /* Returns true if the incremental processing ended up updating
> nodes.
>>>> */
>>>>>   bool inc_proc_northd_run(struct ovsdb_idl_txn *ovnnb_txn,
>>>>>                            struct ovsdb_idl_txn *ovnsb_txn,
>>>>> -                         bool recompute) {
>>>>> +                         bool recompute, uint32_t backoff_ms) {
>>>>>       ovs_assert(ovnnb_txn && ovnsb_txn);
>>>>> +
>>>>> +    int64_t start = time_msec();
>>>>>       engine_init_run();
>>>>>
>>>>>       /* Force a full recompute if instructed to, for example, after a
>>>> NB/SB
>>>>> @@ -330,6 +334,12 @@ bool inc_proc_northd_run(struct ovsdb_idl_txn
>>>> *ovnnb_txn,
>>>>>       } else {
>>>>>           engine_set_force_recompute(false);
>>>>>       }
>>>>> +
>>>>> +    int64_t now = time_msec();
>>>>> +    /* Postpone the next run by length of current run with maximum
>>>> capped
>>>>> +     * by "northd-backoff-interval-ms" interval. */
>>>>> +    next_northd_run_ms = now + MIN(now - start, backoff_ms);
>>>>> +
>>>>>       return engine_has_updated();
>>>>>   }
>>>>>
>>>>> @@ -339,6 +349,17 @@ void inc_proc_northd_cleanup(void)
>>>>>       engine_set_context(NULL);
>>>>>   }
>>>>>
>>>>> +bool
>>>>> +inc_proc_northd_can_run(bool recompute)
>>>>> +{
>>>>> +    if (recompute || time_msec() >= next_northd_run_ms) {
>>>>> +        return true;
>>>>> +    }
>>>>> +
>>>>> +    poll_timer_wait_until(next_northd_run_ms);
>>>>> +    return false;
>>>>> +}
>>>>> +
>>>>>   static void
>>>>>   chassis_features_list(struct unixctl_conn *conn, int argc
> OVS_UNUSED,
>>>>>                         const char *argv[] OVS_UNUSED, void
> *features_)
>>>>> diff --git a/northd/inc-proc-northd.h b/northd/inc-proc-northd.h
>>>>> index 9b81c7ee0..af418d7d7 100644
>>>>> --- a/northd/inc-proc-northd.h
>>>>> +++ b/northd/inc-proc-northd.h
>>>>> @@ -10,7 +10,8 @@ void inc_proc_northd_init(struct ovsdb_idl_loop *nb,
>>>>>                             struct ovsdb_idl_loop *sb);
>>>>>   bool inc_proc_northd_run(struct ovsdb_idl_txn *ovnnb_txn,
>>>>>                            struct ovsdb_idl_txn *ovnsb_txn,
>>>>> -                         bool recompute);
>>>>> +                         bool recompute, uint32_t backoff_ms);
>>>>>   void inc_proc_northd_cleanup(void);
>>>>> +bool inc_proc_northd_can_run(bool recompute);
>>>>>
>>>>>   #endif /* INC_PROC_NORTHD */
>>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
>>>>> index 4fa1b039e..3202b50a1 100644
>>>>> --- a/northd/ovn-northd.c
>>>>> +++ b/northd/ovn-northd.c
>>>>> @@ -868,6 +868,7 @@ main(int argc, char *argv[])
>>>>>       /* Main loop. */
>>>>>       exiting = false;
>>>>>
>>>>> +    uint32_t northd_backoff_ms = 0;
>>>>>       bool recompute = false;
>>>>>       while (!exiting) {
>>>>>           update_ssl_config();
>>>>> @@ -932,10 +933,12 @@ main(int argc, char *argv[])
>>>>>
>>>>>               if (ovsdb_idl_has_lock(ovnsb_idl_loop.idl)) {
>>>>>                   bool activity = false;
>>>>> -                if (ovnnb_txn && ovnsb_txn) {
>>>>> +                if (ovnnb_txn && ovnsb_txn &&
>>>>> +                    inc_proc_northd_can_run(recompute)) {
>>>>>                       int64_t loop_start_time = time_wall_msec();
>>>>>                       activity = inc_proc_northd_run(ovnnb_txn,
>>>> ovnsb_txn,
>>>>> -                                                        recompute);
>>>>> +                                                   recompute,
>>>>> +
> northd_backoff_ms);
>>>>>                       recompute = false;
>>>>>                       check_and_add_supported_dhcp_opts_to_sb_db(
>>>>>                                    ovnsb_txn, ovnsb_idl_loop.idl);
>>>>> @@ -1019,6 +1022,8 @@ main(int argc, char *argv[])
>>>>>           if (nb) {
>>>>>               interval = smap_get_int(&nb->options,
>>>> "northd_probe_interval",
>>>>>                                       interval);
>>>>> +            northd_backoff_ms = smap_get_uint(&nb->options,
>>>>> +
>>>> "northd-backoff-interval-ms", 0);
>>>>>           }
>>>>>           set_idl_probe_interval(ovnnb_idl_loop.idl, ovnnb_db,
> interval);
>>>>>           set_idl_probe_interval(ovnsb_idl_loop.idl, ovnsb_db,
> interval);
>>>>> diff --git a/ovn-nb.xml b/ovn-nb.xml
>>>>> index 4fbf4f7e5..115dfd536 100644
>>>>> --- a/ovn-nb.xml
>>>>> +++ b/ovn-nb.xml
>>>>> @@ -349,6 +349,13 @@
>>>>>           of HWOL compatibility with GDP.
>>>>>         </column>
>>>>>
>>>>> +      <column name="options" key="northd-backoff-interval-ms">
>>>>> +        Maximum interval that the northd incremental engine is
> delayed
>>>> by
>>>>> +        in milliseconds. Setting the value to nonzero delays the next
>>>> northd
>>>>> +        engine run by the previous run time, capped by the specified
>>>> value.
>>>>> +        If the value is zero the engine won't be delayed at all.
>>>>> +      </column>
>>>>> +
>>>>>         <group title="Options for configuring interconnection route
>>>> advertisement">
>>>>>           <p>
>>>>>             These options control how routes are advertised between
> OVN
>>>>
>>>>
>>> Thanks,
>>> Ales
>>>
>>
>> _______________________________________________
>> dev mailing list
>> d...@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> 

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to