On 8/10/23 17:20, Han Zhou wrote:
> On Thu, Aug 10, 2023 at 6:36 AM Dumitru Ceara <dce...@redhat.com> wrote:
>>
>> On 8/10/23 15:34, Han Zhou wrote:
>>> On Thu, Aug 10, 2023 at 2:29 AM Dumitru Ceara <dce...@redhat.com> wrote:
>>>>
>>>> On 8/10/23 08:12, Ales Musil wrote:
>>>>> On Wed, Aug 9, 2023 at 5:13 PM Mark Michelson <mmich...@redhat.com>
>>> wrote:
>>>>>
>>>>>> Hi Ales,
>>>>>>
>>>>>> I have some high-level comments/questions about this patch.
>>>>>>
>>>>>
>>>>> Hi Mark,
>>>>>
>>>>
>>>> Hi Ales, Mark,
>>>>
>>>>> thank you for the review. See my answers inline below.
>>>>>
>>>>>
>>>>>> I have been privy to the conversations that led to this change. My
>>>>>> understanding is that by having ovn-northd wake up immediately, it
> can
>>>>>> be more CPU-intensive than waiting a bit for changes to accumulate
> and
>>>>>> handling all of those at once instead. However, nothing in either the
>>>>>> commit message or ovn-nb.xml explains what the purpose of this new
>>>>>> configuration option is. I think you should add a sentence or two to
>>>>>> explain why someone would want to enable this option.
>>>>>>
>>>>>>
>>>>> Yeah that's my bad, I have v2 prepared with some explanation in the
>>> commit
>>>>> message
>>>>> together with results from scale run.
>>>>>
>>>>
>>>> +1 we really need to explain why this change is needed.
>>>>
>>>>>
>>>>>>
>>>>>> Next, the algorithm used here strikes me as odd. We use the previous
>>> run
>>>>>> time of ovn-northd to determine how long to wait before running
> again.
>>>>>> This delay is capped by the configured backoff time. Let's say that
>>>>>> we've configured the backoff interval to be 200 ms. If ovn-northd
> has a
>>>>>> super quick run and only takes 10 ms, then we will only delay the
> next
>>>>>> run by 10 ms. IMO, this seems like it would not mitigate the original
>>>>>> issue by much, since we are only allowing a maximum of 20 ms (10 ms
> for
>>>>>> the run of ovn-northd + 10 ms delay) of NB changes to accumulate.
>>>>>> Conversely, if northd has a huge recompute and it takes 5000 ms to
>>>>>> complete, then we would delay the next run by 200ms. In this case,
>>>>>> delaying at all seems like it's not necessary since we potentially
> have
>>>>>> 5000 ms worth of NB DB updates that have not been addressed. I would
>>>>>> have expected the opposite approach to be taken. If someone
> configures
>>>>>> 200ms as their backoff interval, I would expect us to always allow a
>>>>>> *minimum* of 200ms of NB changes to accumulate before running again.
> So
>>>>>> for instance, if northd runs quickly and is done in 10 ms, then we
>>> would
>>>>>> wait 200 - 10 = 190 ms before processing changes again. If northd
> takes
>>>>>> a long time to recompute and takes 5000 ms, then we would not wait at
>>>>>> all before processing changes again. Was the algorithm chosen based
> on
>>>>>> experimentation? Is it a well-known method I'm just not familiar
> with?
>>>>
>>>> I think the main assumption (that should probably be made explicit in
>>>> the commit log and/or documentation) is that on average changes happen
>>>> in a uniform way.  This might not always be accurate.
>>>>
>>>> However, if we're off with the estimate, in the worst case we'd be
>>>> adding the configured max delay to the latency of processing changes.
>>>> So, as long as the value is not extremely high, the impact is not that
>>>> high either.
>>>>
>>>> Last but not least, as this value would be configured by the CMS, we
>>>> assume the CMS knows what they're doing. :)
>>>>
>>>>>>
>>>>>
>>>>> I'm not sure if the algorithm is well known.
>>>>>
>>>>> The thing is that at scale we almost always cap at the backoff so it
> has
>>>>> probably
>>>>> the same effect as your suggestion with the difference that we
> actually
>>>>> delay even
>>>>> after long runs. And that is actually desired, it's true that in the
>>> let's
>>>>> say 500 ms
>>>>> should be enough to accumulate more changes however that can lead to
>>> another
>>>>> 500 ms run and so on. That in the end means that northd will spin at
>>> 100%
>>>>> CPU
>>>>> anyway which is what we want to avoid. So from another point of view
> the
>>>>> accumulation
>>>>> of IDL changes is a secondary effect which is still desired, but not
> the
>>>>> main purpose.
>>>>>
>>>>> Also delaying by short time if the previous run was short is fine, you
>>> are
>>>>> right that we don't
>>>>> accumulate enough however during short running times there is a high
>>> chance
>>>>> that the
>>>>> northd would get to sleep anyway (We will help it to sleep at least a
>>> bit
>>>>> nevertheless).
>>>>>
>>>>>
>>>>>>
>>>>>> Next, I notice that you've added new poll_timer_wait() calls but
>>> haven't
>>>>>> changed the ovsdb_idl_loop_run() or ovsdb_idl_loop_commit_and_wait()
>>>>>> calls. Is there any danger of ovn-northd getting in a busy loop of
>>>>>> sleeping and waking because of this? I don't think it should, since
>>>>>> presumably ovsdb_idl_loop_run() should clear the conditions waited on
>>> by
>>>>>> ovsdb_idl_loop_commit_and_wait(), but I want to double-check.
>>>>>>
>>>>>
>>>>> AFAIK it shouldn't cause any issues as ovsdb_idl_loop_run() will
> process
>>>>> anything
>>>>> that it can and wait will be fine. The problem would be if we would
>>> skip the
>>>>> ovsdb_idl_loop_run() for some reason.
>>>>>
>>>>>
>>>>>>
>>>>>> Next, does this have any negative impact on our ability to perform
>>>>>> incremental processing in ovn-northd? My concern is that since we are
>>>>>> still running the ovsdb IDL loop that if multiple NB changes occur
>>>>>> during our delay, then we might have to fall back to a full recompute
>>>>>> instead of being able to incrementally process the changes. Are my
>>>>>> concerns valid?
>>>>>>
>>>>>
>>>>> I suppose that can happen if there are changes that could result in
>>>>> "conflict"
>>>>> and full recompute. During the test we haven't seen anything like
> that.
>>>>> The odds of that happening are small because as stated previously we
> are
>>>>> doing
>>>>> basically the same as if the engine was running for a long time always
>>> from
>>>>> the IDL
>>>>> point of view except that we give IDL a chance to process whatever has
>>>>> pilled up
>>>>> within the sleep period.
>>>>>
>>>>>
>>>>>>
>>>>>> Next, has scale testing shown that this change has made a positive
>>>>>> impact? If so, is there any recommendation for how to determine what
> to
>>>>>> configure the value to?
>>>>>>
>>>>>>
>>>>> It has a huge impact actually the value tested was 200 ms, the
>>>>> recommendation
>>>>
>>>> This was chosen based on the historical data from similar tests which
>>>> showed that the I-P engine was taking ~180-200 ms to run at scale.
>>>>
>>>>> would be < 500 ms. After that point the latency on components creation
>>>>> would be
>>>>> very noticable. I will put the recommendation into the ovn-nb.xml with
>>> the
>>>>> latency
>>>>> comment. Before I'll post v2 (which has the numbers in commit message)
>>> those
>>>>> are the test results:
>>>>>
>>>>> Run without any backoff period:
>>>>> northd aggregate CPU 9810% avg / 12765% max
>>>>> northd was spinning at 100% CPU the entire second half of the test.
>>>>>
>>>>> Run with 200 ms max backoff period:
>>>>> northd aggregate CPU 6066% avg / 7689% max
>>>>> northd was around 60% for the second half of the test
>>>>>
>>>>>
>>>>>>
>>>>>> Finally, is this change expected to be a long-term necessity? This
>>>>>> option seems to be useful for cases where northd recomputes are
>>>>>> required. Performing recomputes less frequently seems like it would
>>>>>> lower the CPU usage of ovn-northd while still processing the same
>>> amount
>>>>>> of changes. However, once northd can handle most changes
> incrementally,
>>>>>> is there still a benefit to delaying running? If each run of northd
>>>>>> handles all DB changes incrementally, then is there any point in
>>> putting
>>>>>> delays between those incremental runs?
>>>>>>
>>>>>>
>>>>> Ideally we won't need it in the future. However, the assumption for
> not
>>>>> needing
>>>>> anything like this is that northd will be fast enough to process I-P
>>>>> changes and
>>>>> be able to sleep between the next batch update arrives from CMS. That
>>>>> doesn't
>>>>> seem to happen in very near future, one thing to keep in mind is that
>>>>> testing
>>>>> happened with Numan's I-P for LBs and lflows which make a huge
>>> difference,
>>>>> but
>>>>> still not enough to achieve the mentioned northd state. So from my
>>>>> perspective
>>>>> it will be relevant for a few releases. And as stated above the point
>>> is to
>>>>> prevent
>>>>> northd to spin at 100% CPU all the time.
>>>>>
>>>>
>>>> +1 it's not the prettiest feature (and some might rightfully call it a
>>>> hack) but it seems to me like the cleanest alternative for now, until
>>>> northd processing is fully incremental.
>>>
>>> In most cases it may be fine, but it might be a problem for a worst case
>>> scenario:
>>>
>>> Assume all the changes coming in NB can be incrementally processed but
> at
>>> a very very high rate, and ovn-northd keeps processing the changes
>>> incrementally. Since the change rate is so high, ovn-northd barely
> keeps up
>>> with the changes with 99% CPU load. For example, I-P for each object
> takes
>>> 10ms, and the change rate is 99 objects/sec. According to this
> algorithm,
>>> ovn-northd will always sleep for the maximum 200ms between each IDL run,
>>> and then ovn-northd would never keep up with the changes any more - the
>>> backlog will become longer and longer because of the wasted idle time.
>>>
>>
>> IDL runs are not skipped.  Just I-P engine runs.  So I think this won't
>> be a problem, or am I missing something?
> 
> Sorry about the typo, I meant to say between each "engine run" instead of
> "IDL run". IDL run is not skipped, but the backlog (accumulated changes in
> IDL) becomes longer and longer. E.g.:
> 
> (assume change rate is 100 object/sec)
> 
> run-1: handles 1 object, takes 10ms, sleep 10ms
> run-2: handles 2 objects, takes 20ms, sleep 20ms
> run-3: handles 4 objects, takes 40ms, sleep 40ms
> run-4: handles 8 objects, takes 80ms, sleep 80ms
> run-5: handles 16 objects, takes 160ms, sleep 160ms
> run-6: handles 32 objects, takes 320ms, sleep 200ms
> run-7: handles 52 objects (accumulated in 320 + 200 ms), takes 520ms, sleep
> 200ms
> run-8: handles 72 objects, takes 720ms, sleep 200ms
> run-9: handles 92 objects, takes 920ms, sleep 200ms
> ...
> As we can see the backlog grows indefinitely if the input keeps changing at
> the rate of 100 obj/s.
> 

I see now, thanks for the explanation.  But isn't this possible today
too?  Maybe less probable though.

Also, we'll probably hit other issues (e.g., timeouts on CMS side)
because of the backlog which will (likely?) throttle the CMS.

What would be a good solution?  It seems hard to define what a "large
backlog" is.  If we could do that, maybe triggering a full recompute
when the backlog is large enough might be OK.

What do you think?

Thanks,
Dumitru

> Thanks,
> Han
> 
>>
>> Regards,
>> Dumitru
>>
>>> In practice this means decreased max throughput supported by ovn-northd,
>>> which probably needs to be called out as a tradeoff (in addition to the
>>> latency tradeoff).
>>>
>>> Thanks,
>>> Han
>>>
>>>>
>>>> Regards,
>>>> Dumitru
>>>>
>>>>>
>>>>>>
>>>>>> On 8/9/23 01:29, Ales Musil wrote:
>>>>>>> Add config option called "northd-backoff-interval-ms" that allows
>>>>>>> to delay northd engine runs capped by the config option.
>>>>>>> When the config option is set to 0 or unspecified, the engine
>>>>>>> will run without any restrictions. If the value >0 we will delay
>>>>>>> northd engine run by the previous run time capped by the
>>>>>>> config option.
>>>>>>>
>>>>>>> Signed-off-by: Ales Musil <amu...@redhat.com>
>>>>>>> ---
>>>>>>>   NEWS                     |  2 ++
>>>>>>>   northd/inc-proc-northd.c | 23 ++++++++++++++++++++++-
>>>>>>>   northd/inc-proc-northd.h |  3 ++-
>>>>>>>   northd/ovn-northd.c      |  9 +++++++--
>>>>>>>   ovn-nb.xml               |  7 +++++++
>>>>>>>   5 files changed, 40 insertions(+), 4 deletions(-)
>>>>>>>
>>>>>>> diff --git a/NEWS b/NEWS
>>>>>>> index 8275877f9..6109f13a2 100644
>>>>>>> --- a/NEWS
>>>>>>> +++ b/NEWS
>>>>>>> @@ -10,6 +10,8 @@ Post v23.06.0
>>>>>>>     - To allow optimizing ovn-controller's monitor conditions for
> the
>>>>>> regular
>>>>>>>       VIF case, ovn-controller now unconditionally monitors all
>>> sub-ports
>>>>>>>       (ports with parent_port set).
>>>>>>> +  - Add "northd-backoff-interval-ms" config option to delay northd
>>>>>> engine
>>>>>>> +    runs capped at the set value.
>>>>>>>
>>>>>>>   OVN v23.06.0 - 01 Jun 2023
>>>>>>>   --------------------------
>>>>>>> diff --git a/northd/inc-proc-northd.c b/northd/inc-proc-northd.c
>>>>>>> index d328deb22..87db50ad1 100644
>>>>>>> --- a/northd/inc-proc-northd.c
>>>>>>> +++ b/northd/inc-proc-northd.c
>>>>>>> @@ -42,6 +42,8 @@ VLOG_DEFINE_THIS_MODULE(inc_proc_northd);
>>>>>>>
>>>>>>>   static unixctl_cb_func chassis_features_list;
>>>>>>>
>>>>>>> +static int64_t next_northd_run_ms = 0;
>>>>>>> +
>>>>>>>   #define NB_NODES \
>>>>>>>       NB_NODE(nb_global, "nb_global") \
>>>>>>>       NB_NODE(logical_switch, "logical_switch") \
>>>>>>> @@ -295,8 +297,10 @@ void inc_proc_northd_init(struct ovsdb_idl_loop
>>> *nb,
>>>>>>>   /* Returns true if the incremental processing ended up updating
>>> nodes.
>>>>>> */
>>>>>>>   bool inc_proc_northd_run(struct ovsdb_idl_txn *ovnnb_txn,
>>>>>>>                            struct ovsdb_idl_txn *ovnsb_txn,
>>>>>>> -                         bool recompute) {
>>>>>>> +                         bool recompute, uint32_t backoff_ms) {
>>>>>>>       ovs_assert(ovnnb_txn && ovnsb_txn);
>>>>>>> +
>>>>>>> +    int64_t start = time_msec();
>>>>>>>       engine_init_run();
>>>>>>>
>>>>>>>       /* Force a full recompute if instructed to, for example,
> after a
>>>>>> NB/SB
>>>>>>> @@ -330,6 +334,12 @@ bool inc_proc_northd_run(struct ovsdb_idl_txn
>>>>>> *ovnnb_txn,
>>>>>>>       } else {
>>>>>>>           engine_set_force_recompute(false);
>>>>>>>       }
>>>>>>> +
>>>>>>> +    int64_t now = time_msec();
>>>>>>> +    /* Postpone the next run by length of current run with maximum
>>>>>> capped
>>>>>>> +     * by "northd-backoff-interval-ms" interval. */
>>>>>>> +    next_northd_run_ms = now + MIN(now - start, backoff_ms);
>>>>>>> +
>>>>>>>       return engine_has_updated();
>>>>>>>   }
>>>>>>>
>>>>>>> @@ -339,6 +349,17 @@ void inc_proc_northd_cleanup(void)
>>>>>>>       engine_set_context(NULL);
>>>>>>>   }
>>>>>>>
>>>>>>> +bool
>>>>>>> +inc_proc_northd_can_run(bool recompute)
>>>>>>> +{
>>>>>>> +    if (recompute || time_msec() >= next_northd_run_ms) {
>>>>>>> +        return true;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    poll_timer_wait_until(next_northd_run_ms);
>>>>>>> +    return false;
>>>>>>> +}
>>>>>>> +
>>>>>>>   static void
>>>>>>>   chassis_features_list(struct unixctl_conn *conn, int argc
>>> OVS_UNUSED,
>>>>>>>                         const char *argv[] OVS_UNUSED, void
>>> *features_)
>>>>>>> diff --git a/northd/inc-proc-northd.h b/northd/inc-proc-northd.h
>>>>>>> index 9b81c7ee0..af418d7d7 100644
>>>>>>> --- a/northd/inc-proc-northd.h
>>>>>>> +++ b/northd/inc-proc-northd.h
>>>>>>> @@ -10,7 +10,8 @@ void inc_proc_northd_init(struct ovsdb_idl_loop
> *nb,
>>>>>>>                             struct ovsdb_idl_loop *sb);
>>>>>>>   bool inc_proc_northd_run(struct ovsdb_idl_txn *ovnnb_txn,
>>>>>>>                            struct ovsdb_idl_txn *ovnsb_txn,
>>>>>>> -                         bool recompute);
>>>>>>> +                         bool recompute, uint32_t backoff_ms);
>>>>>>>   void inc_proc_northd_cleanup(void);
>>>>>>> +bool inc_proc_northd_can_run(bool recompute);
>>>>>>>
>>>>>>>   #endif /* INC_PROC_NORTHD */
>>>>>>> diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
>>>>>>> index 4fa1b039e..3202b50a1 100644
>>>>>>> --- a/northd/ovn-northd.c
>>>>>>> +++ b/northd/ovn-northd.c
>>>>>>> @@ -868,6 +868,7 @@ main(int argc, char *argv[])
>>>>>>>       /* Main loop. */
>>>>>>>       exiting = false;
>>>>>>>
>>>>>>> +    uint32_t northd_backoff_ms = 0;
>>>>>>>       bool recompute = false;
>>>>>>>       while (!exiting) {
>>>>>>>           update_ssl_config();
>>>>>>> @@ -932,10 +933,12 @@ main(int argc, char *argv[])
>>>>>>>
>>>>>>>               if (ovsdb_idl_has_lock(ovnsb_idl_loop.idl)) {
>>>>>>>                   bool activity = false;
>>>>>>> -                if (ovnnb_txn && ovnsb_txn) {
>>>>>>> +                if (ovnnb_txn && ovnsb_txn &&
>>>>>>> +                    inc_proc_northd_can_run(recompute)) {
>>>>>>>                       int64_t loop_start_time = time_wall_msec();
>>>>>>>                       activity = inc_proc_northd_run(ovnnb_txn,
>>>>>> ovnsb_txn,
>>>>>>> -                                                        recompute);
>>>>>>> +                                                   recompute,
>>>>>>> +
>>> northd_backoff_ms);
>>>>>>>                       recompute = false;
>>>>>>>                       check_and_add_supported_dhcp_opts_to_sb_db(
>>>>>>>                                    ovnsb_txn, ovnsb_idl_loop.idl);
>>>>>>> @@ -1019,6 +1022,8 @@ main(int argc, char *argv[])
>>>>>>>           if (nb) {
>>>>>>>               interval = smap_get_int(&nb->options,
>>>>>> "northd_probe_interval",
>>>>>>>                                       interval);
>>>>>>> +            northd_backoff_ms = smap_get_uint(&nb->options,
>>>>>>> +
>>>>>> "northd-backoff-interval-ms", 0);
>>>>>>>           }
>>>>>>>           set_idl_probe_interval(ovnnb_idl_loop.idl, ovnnb_db,
>>> interval);
>>>>>>>           set_idl_probe_interval(ovnsb_idl_loop.idl, ovnsb_db,
>>> interval);
>>>>>>> diff --git a/ovn-nb.xml b/ovn-nb.xml
>>>>>>> index 4fbf4f7e5..115dfd536 100644
>>>>>>> --- a/ovn-nb.xml
>>>>>>> +++ b/ovn-nb.xml
>>>>>>> @@ -349,6 +349,13 @@
>>>>>>>           of HWOL compatibility with GDP.
>>>>>>>         </column>
>>>>>>>
>>>>>>> +      <column name="options" key="northd-backoff-interval-ms">
>>>>>>> +        Maximum interval that the northd incremental engine is
>>> delayed
>>>>>> by
>>>>>>> +        in milliseconds. Setting the value to nonzero delays the
> next
>>>>>> northd
>>>>>>> +        engine run by the previous run time, capped by the
> specified
>>>>>> value.
>>>>>>> +        If the value is zero the engine won't be delayed at all.
>>>>>>> +      </column>
>>>>>>> +
>>>>>>>         <group title="Options for configuring interconnection route
>>>>>> advertisement">
>>>>>>>           <p>
>>>>>>>             These options control how routes are advertised between
>>> OVN
>>>>>>
>>>>>>
>>>>> Thanks,
>>>>> Ales
>>>>>
>>>>
>>>> _______________________________________________
>>>> dev mailing list
>>>> d...@openvswitch.org
>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>
>>
> 

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to