On Tue, May 14, 2024 at 12:31 AM Dumitru Ceara <dce...@redhat.com> wrote:
>
> On 5/8/24 18:01, Numan Siddique wrote:
> > On Wed, May 8, 2024 at 8:42 AM Шагов Георгий via discuss <
> > ovs-discuss@openvswitch.org> wrote:
> >
> >> Hello everyone
> >>
> >>
> >>
> >> In some aspect it might be considered as a continuation of this thread:
> >> (link1), yet it is different
> >>
> >> After we have upgrade from OVN 22.03 to OVN 24.03, we have indeed found
> >> increase in performance in 3-4 times
> >>
> >> And yet still we do observe high CPU load for NorthD process; taking
> >> deeper into the logs we have found:
> >>
> >>
> >>
> >
> > Thanks for reporting this issue.
> >
> >
> > 2024-05-07T08:36:46.505Z|18503|poll_loop|INFO|wakeup due to [POLLIN] on
fd
> >> 15 (10.34.22.66:60716<->10.34.22.66:6642) at lib/stream-fd.c:157 (94%
CPU
> >> usage)
> >>
> >> *2024-05-07T08:37:38.857Z|18504|inc_proc_eng|INFO|node: northd,
recompute
> >> (missing handler for input SB_datapath_binding) took 52313ms*
> >>
> >> *2024-05-07T08:37:48.335Z|18505|inc_proc_eng|INFO|node: lflow,
recompute
> >> (failed handler for input northd) took 7759ms*
> >>
> >> *2024-05-07T08:37:48.718Z|18506|timeval|WARN|Unreasonably long 62213ms
> >> poll interval (56201ms user, 2900ms system)*
> >>
> >>
> >>
> >> As you can see there is a significant delay in 52 secs
> >>
>
> This is huge indeed!
>
> >> Correct me please, if I am in the wrong, but IMU: ‘*missing handler
for*’
> >> – practically means absence of the inc-engine handler from some node
(in
> >> this sample: *SB_datapath_binding*)
> >>
> >
> > That's correct.
> >
> > Before plunging into Development it would be great to clarify/adjust
with
> >> Community’s position
> >>
> >>    - Why there is not handler for this node?
> >>
> >>
> > Our approach has been to add a handler  for any input change only if it
is
> > frequent or if it can be easily handled.
> > We also have skipped adding handlers if it increases the code
complexity.
> > Having said that I think we are open
> > to adding more handlers if it makes sense or if it results in scale
> > improvements.
> >
> > Right now we fall back to a full recompute of northd engine for any
changes
> > to a logical switch or logical router.
> > Does your deployment create/delete logical switches/routers frequently ?
> > Is it possible to enable ovn debug logs
> > and share them ?  I'm curious to know what are the changes to SB
datapath
> > binding.
> >
> > Feel free to share your OVN NB and SB DBs if you're ok with it.  I can
> > deploy those DBs and see why recompute is so expensive.
> >
> >
> >
> >>    - Any particular reason for this or just the peculiarity of our
> >>    installation highlighted this issue?
> >>
> >>
> > My guess is that your installation is frequently creating , deleting or
> > modifying logical switches or routers.
> >
> >
> >>    -
> >>    - Do you think there is a reason in implementing that handler? (
> >>    *SB_datapath_binding*)
> >>
> >>
> > I'm fine adding a handler if it helps in the scale.   In our use cases,
we
> > don't frequently create/delete the logical switches and routers
> > and hence it is ok to fall back to full recomputes for such changes.
> >
> >
> >>    -
> >>
> >>
> >>
> >> Any ideas are highly appreciated.
> >>
> >
> > You're welcome to work on it and submit patches to add a handler for
> > SB_datapath_binding.
> >
> > @Dumitru Ceara <dce...@redhat.com> @Han Zhou <hz...@ovn.org> if you've
any
> > reservations on adding more handlers please do comment here.
> >
>
> In general, especially if it fixes a scalability issue like this one,
> it's probably fine.  In practice it depends a bit on how much complexity
> this would add to the code.
>
I agree with the general statement.

> But the best way to tell is to have a way to reproduce this, e.g., NB/SB
> databases and the NB/SB jsonrpc update that caused the recompute.
>

Yes, it is better to understand why in this deployment the recompute took
so long (52s). Is it simply too large scale, or is it because of some
uncommon configuration that we don't handle efficiently and can be
optimized to improve recompute performance.

Otherwise, even if we can implement datapath I-P, there can be just another
input change that triggers recompute and causes the same latency. It is
just not sustainable to maintain more and more I-P in northd.

> Regards,
> Dumitru
>
_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Reply via email to