Re: [ovs-dev] [PATCH ovn] northd: Allow delay of northd engine runs

Han Zhou Fri, 11 Aug 2023 22:08:47 -0700

On Fri, Aug 11, 2023 at 6:07 AM Dumitru Ceara <dce...@redhat.com> wrote:
>
> On 8/10/23 18:38, Ilya Maximets wrote:
> > On 8/10/23 17:34, Dumitru Ceara wrote:
> >> On 8/10/23 17:20, Han Zhou wrote:
> >>> On Thu, Aug 10, 2023 at 6:36 AM Dumitru Ceara <dce...@redhat.com>
wrote:
> >>>>
> >>>> On 8/10/23 15:34, Han Zhou wrote:
> >>>>> On Thu, Aug 10, 2023 at 2:29 AM Dumitru Ceara <dce...@redhat.com>
wrote:
> >>>>>>
> >>>>>> On 8/10/23 08:12, Ales Musil wrote:
> >>>>>>> On Wed, Aug 9, 2023 at 5:13 PM Mark Michelson <mmich...@redhat.com
>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Ales,
> >>>>>>>>
> >>>>>>>> I have some high-level comments/questions about this patch.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Hi Mark,
> >>>>>>>
> >>>>>>
> >>>>>> Hi Ales, Mark,
> >>>>>>
> >>>>>>> thank you for the review. See my answers inline below.
> >>>>>>>
> >>>>>>>
> >>>>>>>> I have been privy to the conversations that led to this change.
My
> >>>>>>>> understanding is that by having ovn-northd wake up immediately,
it
> >>> can
> >>>>>>>> be more CPU-intensive than waiting a bit for changes to
accumulate
> >>> and
> >>>>>>>> handling all of those at once instead. However, nothing in
either the
> >>>>>>>> commit message or ovn-nb.xml explains what the purpose of this
new
> >>>>>>>> configuration option is. I think you should add a sentence or
two to
> >>>>>>>> explain why someone would want to enable this option.
> >>>>>>>>
> >>>>>>>>
> >>>>>>> Yeah that's my bad, I have v2 prepared with some explanation in
the
> >>>>> commit
> >>>>>>> message
> >>>>>>> together with results from scale run.
> >>>>>>>
> >>>>>>
> >>>>>> +1 we really need to explain why this change is needed.
> >>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Next, the algorithm used here strikes me as odd. We use the
previous
> >>>>> run
> >>>>>>>> time of ovn-northd to determine how long to wait before running
> >>> again.
> >>>>>>>> This delay is capped by the configured backoff time. Let's say
that
> >>>>>>>> we've configured the backoff interval to be 200 ms. If ovn-northd
> >>> has a
> >>>>>>>> super quick run and only takes 10 ms, then we will only delay the
> >>> next
> >>>>>>>> run by 10 ms. IMO, this seems like it would not mitigate the
original
> >>>>>>>> issue by much, since we are only allowing a maximum of 20 ms (10
ms
> >>> for
> >>>>>>>> the run of ovn-northd + 10 ms delay) of NB changes to accumulate.
> >>>>>>>> Conversely, if northd has a huge recompute and it takes 5000 ms
to
> >>>>>>>> complete, then we would delay the next run by 200ms. In this
case,
> >>>>>>>> delaying at all seems like it's not necessary since we
potentially
> >>> have
> >>>>>>>> 5000 ms worth of NB DB updates that have not been addressed. I
would
> >>>>>>>> have expected the opposite approach to be taken. If someone
> >>> configures
> >>>>>>>> 200ms as their backoff interval, I would expect us to always
allow a
> >>>>>>>> *minimum* of 200ms of NB changes to accumulate before running
again.
> >>> So
> >>>>>>>> for instance, if northd runs quickly and is done in 10 ms, then
we
> >>>>> would
> >>>>>>>> wait 200 - 10 = 190 ms before processing changes again. If northd
> >>> takes
> >>>>>>>> a long time to recompute and takes 5000 ms, then we would not
wait at
> >>>>>>>> all before processing changes again. Was the algorithm chosen
based
> >>> on
> >>>>>>>> experimentation? Is it a well-known method I'm just not familiar
> >>> with?
> >>>>>>
> >>>>>> I think the main assumption (that should probably be made explicit
in
> >>>>>> the commit log and/or documentation) is that on average changes
happen
> >>>>>> in a uniform way.  This might not always be accurate.
> >>>>>>
> >>>>>> However, if we're off with the estimate, in the worst case we'd be
> >>>>>> adding the configured max delay to the latency of processing
changes.
> >>>>>> So, as long as the value is not extremely high, the impact is not
that
> >>>>>> high either.
> >>>>>>
> >>>>>> Last but not least, as this value would be configured by the CMS,
we
> >>>>>> assume the CMS knows what they're doing. :)
> >>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>> I'm not sure if the algorithm is well known.
> >>>>>>>
> >>>>>>> The thing is that at scale we almost always cap at the backoff so
it
> >>> has
> >>>>>>> probably
> >>>>>>> the same effect as your suggestion with the difference that we
> >>> actually
> >>>>>>> delay even
> >>>>>>> after long runs. And that is actually desired, it's true that in
the
> >>>>> let's
> >>>>>>> say 500 ms
> >>>>>>> should be enough to accumulate more changes however that can lead
to
> >>>>> another
> >>>>>>> 500 ms run and so on. That in the end means that northd will spin
at
> >>>>> 100%
> >>>>>>> CPU
> >>>>>>> anyway which is what we want to avoid. So from another point of
view
> >>> the
> >>>>>>> accumulation
> >>>>>>> of IDL changes is a secondary effect which is still desired, but
not
> >>> the
> >>>>>>> main purpose.
> >>>>>>>
> >>>>>>> Also delaying by short time if the previous run was short is
fine, you
> >>>>> are
> >>>>>>> right that we don't
> >>>>>>> accumulate enough however during short running times there is a
high
> >>>>> chance
> >>>>>>> that the
> >>>>>>> northd would get to sleep anyway (We will help it to sleep at
least a
> >>>>> bit
> >>>>>>> nevertheless).
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Next, I notice that you've added new poll_timer_wait() calls but
> >>>>> haven't
> >>>>>>>> changed the ovsdb_idl_loop_run() or
ovsdb_idl_loop_commit_and_wait()
> >>>>>>>> calls. Is there any danger of ovn-northd getting in a busy loop
of
> >>>>>>>> sleeping and waking because of this? I don't think it should,
since
> >>>>>>>> presumably ovsdb_idl_loop_run() should clear the conditions
waited on
> >>>>> by
> >>>>>>>> ovsdb_idl_loop_commit_and_wait(), but I want to double-check.
> >>>>>>>>
> >>>>>>>
> >>>>>>> AFAIK it shouldn't cause any issues as ovsdb_idl_loop_run() will
> >>> process
> >>>>>>> anything
> >>>>>>> that it can and wait will be fine. The problem would be if we
would
> >>>>> skip the
> >>>>>>> ovsdb_idl_loop_run() for some reason.
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Next, does this have any negative impact on our ability to
perform
> >>>>>>>> incremental processing in ovn-northd? My concern is that since
we are
> >>>>>>>> still running the ovsdb IDL loop that if multiple NB changes
occur
> >>>>>>>> during our delay, then we might have to fall back to a full
recompute
> >>>>>>>> instead of being able to incrementally process the changes. Are
my
> >>>>>>>> concerns valid?
> >>>>>>>>
> >>>>>>>
> >>>>>>> I suppose that can happen if there are changes that could result
in
> >>>>>>> "conflict"
> >>>>>>> and full recompute. During the test we haven't seen anything like
> >>> that.
> >>>>>>> The odds of that happening are small because as stated previously
we
> >>> are
> >>>>>>> doing
> >>>>>>> basically the same as if the engine was running for a long time
always
> >>>>> from
> >>>>>>> the IDL
> >>>>>>> point of view except that we give IDL a chance to process
whatever has
> >>>>>>> pilled up
> >>>>>>> within the sleep period.
> >>>>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Next, has scale testing shown that this change has made a
positive
> >>>>>>>> impact? If so, is there any recommendation for how to determine
what
> >>> to
> >>>>>>>> configure the value to?
> >>>>>>>>
> >>>>>>>>
> >>>>>>> It has a huge impact actually the value tested was 200 ms, the
> >>>>>>> recommendation
> >>>>>>
> >>>>>> This was chosen based on the historical data from similar tests
which
> >>>>>> showed that the I-P engine was taking ~180-200 ms to run at scale.
> >>>>>>
> >>>>>>> would be < 500 ms. After that point the latency on components
creation
> >>>>>>> would be
> >>>>>>> very noticable. I will put the recommendation into the ovn-nb.xml
with
> >>>>> the
> >>>>>>> latency
> >>>>>>> comment. Before I'll post v2 (which has the numbers in commit
message)
> >>>>> those
> >>>>>>> are the test results:
> >>>>>>>
> >>>>>>> Run without any backoff period:
> >>>>>>> northd aggregate CPU 9810% avg / 12765% max
> >>>>>>> northd was spinning at 100% CPU the entire second half of the
test.
> >>>>>>>
> >>>>>>> Run with 200 ms max backoff period:
> >>>>>>> northd aggregate CPU 6066% avg / 7689% max
> >>>>>>> northd was around 60% for the second half of the test
> >>>>>>>


Ales, could you clarify the version of OVN and the type of test for the
above result? Is it on main branch or with Numans patches for LB I-P? Does
the test include LB changes? Are there recomputes during the test? Was the
100% CPU during the second half of the test caused by recompute or by cost
of I-P?

>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> Finally, is this change expected to be a long-term necessity?
This
> >>>>>>>> option seems to be useful for cases where northd recomputes are
> >>>>>>>> required. Performing recomputes less frequently seems like it
would
> >>>>>>>> lower the CPU usage of ovn-northd while still processing the same
> >>>>> amount
> >>>>>>>> of changes. However, once northd can handle most changes
> >>> incrementally,
> >>>>>>>> is there still a benefit to delaying running? If each run of
northd
> >>>>>>>> handles all DB changes incrementally, then is there any point in
> >>>>> putting
> >>>>>>>> delays between those incremental runs?
> >>>>>>>>
> >>>>>>>>
> >>>>>>> Ideally we won't need it in the future. However, the assumption
for
> >>> not
> >>>>>>> needing
> >>>>>>> anything like this is that northd will be fast enough to process
I-P
> >>>>>>> changes and
> >>>>>>> be able to sleep between the next batch update arrives from CMS.
That
> >>>>>>> doesn't
> >>>>>>> seem to happen in very near future, one thing to keep in mind is
that
> >>>>>>> testing
> >>>>>>> happened with Numan's I-P for LBs and lflows which make a huge
> >>>>> difference,
> >>>>>>> but
> >>>>>>> still not enough to achieve the mentioned northd state. So from my
> >>>>>>> perspective
> >>>>>>> it will be relevant for a few releases. And as stated above the
point
> >>>>> is to
> >>>>>>> prevent
> >>>>>>> northd to spin at 100% CPU all the time.
> >>>>>>>
> >>>>>>
> >>>>>> +1 it's not the prettiest feature (and some might rightfully call
it a
> >>>>>> hack) but it seems to me like the cleanest alternative for now,
until
> >>>>>> northd processing is fully incremental.
> >>>>>
> >>>>> In most cases it may be fine, but it might be a problem for a worst
case
> >>>>> scenario:
> >>>>>
> >>>>> Assume all the changes coming in NB can be incrementally processed
but
> >>> at
> >>>>> a very very high rate, and ovn-northd keeps processing the changes
> >>>>> incrementally. Since the change rate is so high, ovn-northd barely
> >>> keeps up
> >>>>> with the changes with 99% CPU load. For example, I-P for each object
> >>> takes
> >>>>> 10ms, and the change rate is 99 objects/sec. According to this
> >>> algorithm,
> >>>>> ovn-northd will always sleep for the maximum 200ms between each IDL
run,
> >>>>> and then ovn-northd would never keep up with the changes any more -
the
> >>>>> backlog will become longer and longer because of the wasted idle
time.
> >>>>>
> >>>>
> >>>> IDL runs are not skipped.  Just I-P engine runs.  So I think this
won't
> >>>> be a problem, or am I missing something?
> >>>
> >>> Sorry about the typo, I meant to say between each "engine run"
instead of
> >>> "IDL run". IDL run is not skipped, but the backlog (accumulated
changes in
> >>> IDL) becomes longer and longer. E.g.:
> >>>
> >>> (assume change rate is 100 object/sec)
> >>>
> >>> run-1: handles 1 object, takes 10ms, sleep 10ms
> >>> run-2: handles 2 objects, takes 20ms, sleep 20ms
> >>> run-3: handles 4 objects, takes 40ms, sleep 40ms
> >>> run-4: handles 8 objects, takes 80ms, sleep 80ms
> >>> run-5: handles 16 objects, takes 160ms, sleep 160ms
> >>> run-6: handles 32 objects, takes 320ms, sleep 200ms
> >>> run-7: handles 52 objects (accumulated in 320 + 200 ms), takes 520ms,
sleep
> >>> 200ms
> >>> run-8: handles 72 objects, takes 720ms, sleep 200ms
> >>> run-9: handles 92 objects, takes 920ms, sleep 200ms
> >>> ...
> >>> As we can see the backlog grows indefinitely if the input keeps
changing at
> >>> the rate of 100 obj/s.
> >>>
> >>
> >> I see now, thanks for the explanation.  But isn't this possible today
> >> too?  Maybe less probable though.
> >>
> >> Also, we'll probably hit other issues (e.g., timeouts on CMS side)
> >> because of the backlog which will (likely?) throttle the CMS.
> >>
> >> What would be a good solution?  It seems hard to define what a "large
> >> backlog" is.  If we could do that, maybe triggering a full recompute
> >> when the backlog is large enough might be OK.
> >
> > Hmm.  That's an interesting problem.
> > From my understanding, the cost of incremental processing run consists
> > of two parts: some more or less constant engine run cost E and a cost
> > of processing actual changes P.  Unless CMS is changing the same value
> > over and over again, we can't optimize P by accumulating more changes
> > per run.  But we can amortize the more or less constant cost E.
> >
> > We had a similar problem with the full recompute in the past.  If we
> > assume the cost of a recompute R to be more or less constant regardless
> > of amount of changes, then we can reduce the total cost by aggregating
> > more changes per run.  That's what run_idl_loop() function is trying
> > to do.  It runs IDL for as long as there are changes (capped at 500ms).
> >
> > We can't really avoid paying the cost P and it will compound with the
> > increased number of changes.  The only solution here, AFAICT, is to
> > fall back to full recompute once P > R.  Maybe that can be done by
> > somehow calculating the number of tracked vs real rows in IDL.
> > Current ovn-northd has this problem and introduction of delays will
> > amplify it.
> >
>
> That is still hard to do unfortunately.  The ratio of IDL tracked vs
> "real" rows is still not a good enough indication.  It's possible that
> northd can quickly incrementally process row updates for table T1 but
> that it has to do intensive computations for row updates for table T2.
>
> A simpler (dumb?) approach is to just periodically recompute.  But that
> shouldn't happen too often because then we start using a lot of CPU
> again. :)
>
> >
> > The problem this patch is trying to address, IIUC, is that the cost E
> > is still high.  So, I'm wondering if the solution can be similar to
> > what we already have for a full recompute.

My assumption was that the cost E is relatively small in the I-P. I thought
the patch was trying to help when there were still many recomputes
triggered. Please correct me if I was wrong. I posted some questions
regarding Ales's test result, to get a better understanding.

Thanks,
Han

>>  Current run_idl_loop()
> > runs the IDL only while there are changes and it will not continue
> > running it if the next update is delayed even by a very short time
> > interval.  What we could potentially do here is to wait for some
> > very short time interval (maybe configurable or maybe not) and check
> > the IDL for changes again.  This will allow us to better batch updates
> > and amortize constant engine run cost E without need to sleep for
> > arbitrary 200ms. (We will sleep for arbitrary 1-10ms at a time, but
> > that seems less arbitrary. :D ).
>
> I think we still need to set a max limit for this sequence of short
> sleeps and that max limit needs to be configurable by the user because
> it may increase end-to-end latency for propagating NB changes to
> ovn-controller.
>
> In that case I think I prefer an explicit "arbitrary" max backoff set by
> the user like this patch initially proposed.  But I'm not completely
> against other options.
>
> >
> > Something like (for illustration purposes; not tested):
> >
> > diff --git a/northd/ovn-northd.c b/northd/ovn-northd.c
> > index 4fa1b039e..54e4ecb5f 100644
> > --- a/northd/ovn-northd.c
> > +++ b/northd/ovn-northd.c
> > @@ -689,22 +689,39 @@ run_idl_loop(struct ovsdb_idl_loop *idl_loop,
const char *name)
> >      unsigned long long duration, start = time_msec();
> >      unsigned int seqno = UINT_MAX;
> >      struct ovsdb_idl_txn *txn;
> > +    int n_before_sleep = -1;
> >      int n = 0;
> >
> >      /* Accumulate database changes as long as there are some,
> >       * but no longer than half a second. */
> > -    while (seqno != ovsdb_idl_get_seqno(idl_loop->idl)
> > -           && time_msec() - start < 500) {
> > -        seqno = ovsdb_idl_get_seqno(idl_loop->idl);
> > -        ovsdb_idl_run(idl_loop->idl);
> > -        n++;
> > +    for (;;) {
> > +        while (seqno != ovsdb_idl_get_seqno(idl_loop->idl)
> > +               && time_msec() - start < 500) {
> > +            seqno = ovsdb_idl_get_seqno(idl_loop->idl);
> > +            ovsdb_idl_run(idl_loop->idl);
> > +            n++;
> > +        }
> > +        /* If we're not out of time yet, try to sleep for a short 10ms
> > +         * in case we'll have more updates.  Don't sleep again if there
> > +         * were no updates after the previous short sleep. */
> > +        if (n > n_before_sleep + 1 && time_msec() - start < 500) {
> > +            n_before_sleep = n;
> > +            poll_timer_wait(10);
> > +            ovsdb_idl_wait(idl_loop->idl);
> > +            poll_block();
> > +            /* Reset seqno, so we try to run IDL at least one more
time. */
> > +            seqno = UINT_MAX;
> > +        } else {
> > +            /* Out of time or no updates since the last sleep. */
> > +            break;
> > +        }
> >      }
> >
> >      txn = ovsdb_idl_loop_run(idl_loop);
> >
> >      duration = time_msec() - start;
> > -    /* ovsdb_idl_run() is called at least 2 times.  Once directly and
> > -     * once in the ovsdb_idl_loop_run().  n > 2 means that we received
> > +    /* ovsdb_idl_run() is called at least 3 times.  Once directly and
> > +     * once in the ovsdb_idl_loop_run().  n > 3 means that we received
> >       * data on at least 2 subsequent calls. */
> >      if (n > 2 || duration > 100) {
> >          VLOG(duration > 500 ? VLL_INFO : VLL_DBG,
> > ---
> >
> > I'm not sure how something like this will impact the total CPU usage.
> > Hopefully, these short sleeps can allow accumulating more changes and
> > amortize the constant costs better while also reducing perceived CPU
> > usage.  Needs testing, for sure.  One other advantage of such a solution
> > is the code locality, i.e. all the logic being in one place.
> >
> > Thoughts?
> >
> > Best regards, Ilya Maximets.
> >
>
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH ovn] northd: Allow delay of northd engine runs

Reply via email to