On Wed, Feb 19, 2020 at 2:11 AM Numan Siddique <num...@ovn.org> wrote:
>
> On Wed, Feb 19, 2020 at 1:16 PM Han Zhou <zhou...@gmail.com> wrote:
> >
> > On Tue, Feb 18, 2020 at 11:56 AM Numan Siddique <num...@ovn.org> wrote:
> > >
> > > On Tue, Feb 18, 2020 at 11:50 PM Han Zhou <zhou...@gmail.com> wrote:
> > > >
> > > > On Tue, Feb 18, 2020 at 9:27 AM Numan Siddique <num...@ovn.org>
wrote:
> > > >
> > > > > On Tue, Feb 18, 2020 at 10:34 PM Han Zhou <zhou...@gmail.com>
wrote:
> > > > > >
> > > > > > On Tue, Feb 18, 2020 at 7:32 AM <num...@ovn.org> wrote:
> > > > > > >
> > > > > > > From: Numan Siddique <num...@ovn.org>
> > > > > > >
> > > > > > > After the patch [1], which added caching of lflow expr, the
> > > > > > lflow_resource_ref
> > > > > > > is not rebuilt properly when lflow_run() is called. If a
lflow is
> > > > > already
> > > > > > cached
> > > > > > > in lflow expr cache, then the lflow_resource_ref is not
updated.
> > > > > > > But flow_output_run() clears the lflow_resource_ref cache
before
> > > > > calling
> > > > > > lflow_run().
> > > > > > >
> > > > > > > This results in incorrect OF flows present in the switch even
if
> > the
> > > > > > > address set/port group references have changed.
> > > > > > >
> > > > > > > This patch fixes this issue by not clearing the
lflow_resource_ref
> > > > > cache.
> > > > > > > This cache gets cleaned up during lflow change handler or in
> > > > > lflow_run().
> > > > > >
> > > > > > Hi Numan,
> > > > > >
> > > > > > This approach looks dangerous to me, since the
lflow_resource_ref
> > is not
> > > > > a
> > > > > > cache but part of the engine data. Originally, the life cycle
of it
> > > > > follows
> > > > > > the same principle like other data of I-P engine, now if we
change
> > the
> > > > > > principle we need to be very careful.
> > > > > > At least one scenario would have problem. E.g. when there is a
> > pending
> > > > > > transaction, we cannot run engine in that iteration, and we will
> > trigger
> > > > > a
> > > > > > complete recompute next time, because the tracking data would be
> > lost in
> > > > > > the next iteration. So it is not possible to call
> > > > > > lflow_resource_destroy_lflow() for deleted rows in that case
because
> > > > > there
> > > > > > won't be any deleted flows tracked. It seems the commit 672508f6
> > > > > > (ovn-controller: Fix memory issues due to lflow expr caching.)
> > would have
> > > > > > similar problem, too, in that case.
> > > > > >
> > > > > > I am not sure if there would be other corner cases that would
have
> > issue
> > > > > > with this approach. Probably we can think more of it for making
the
> > data
> > > > > > required to build the lflow_resource_ref part of the expr cache.
> > > > >
> > > > > Hi Han,
> > > > >
> > > > > I'll take a closer look at your comments and come back tomorrow.
> > > > > Is it wise to disable or revert  lflow expr caching commit ? And
then
> > > > > address all these
> > > > > issues properly ? SInce we are close to 20.03 release .
> > > > >
> > > >
> > > > Yes, reverting for 20.03 sounds good to me. We can always back port
> > later
> > > > when it’s mature enough.
> > >
> > > Hi Han,
> > >
> > > I've submitted v2 by taking a different approach as suggested by you.
v2
> > will
> > > now store the addr set and port group name sets as part of lflow expr
> > cache
> > > and add those to the lflow resource ref.
> > >
> > > Kindly take a look. I think this should work now without any issues
:).
> > >
> > > Thanks
> > > Numan
> > >
> >
> > Hi Numan,
> >
> > Thanks for the revision. I acked.
> >
>
> Thanks for the review.
>
> > For the change-tracking problem I mentioned for commit 672508f6
> > (ovn-controller: Fix memory issues due to lflow expr caching.), do you
plan
> > to address that as well? It would still have memory leak in some corner
> > cases, but the impact may be minor.
>
> I wouldn't say memory leak, but it would be in the cache lingering around.
> But I see the chances are less.
>
> Lets say few logical flows are deleted. If
> lflow_handle_changed_flows() handler is called,
> it'll clear the expr cache for these lflows. If full recompute is
> triggered, then lflow_run()
> will clear the expr cache if the deleted logical flows are still in
> the tracked buffer. If however,
> for some reason (may be due to engine abort) the deleted lflows are not
tracked,
> the expr cache will not be cleared for these lflows.
>
The common case for this to happen is when there is an update to SB from
this chassis, e.g. for binding a port or mac-binding update, when the
notification for the update comes, the ovnsb_idl_txn is NULL, so the round
of engine_run() may be skipped. If in the same DB update notification there
are also logical_flow deletions involved, the change tracking is lost and
the cache is left lingering.

With commit e2ab60e3a7 ("ovn-controller: Run I-P engine even when no SB txn
is available.") the chance becomes smaller. It happens only if in the same
iteration when ovnsb_idl_txn is NULL, another updates to SB is required,
e.g. another port-binding change or mac-binding change happened on this
chassis.

> One way to address this would be to periodically clear the expr cache
> say every 'x' minutes.
> What do you think of this approach ? This would be a straightforward
> implementation.
>

Given that the issue is not triggered frequently, the periodical cleanup
sounds like a good idea. It may be undesirable to completely flush the
cache, which can result in spike of CPU and latency every 'x' minutes.
Instead, would it be better to do "garbage collection" every 'x' minutes,
i.e. compare cache with all SB lflows, and delete only the non-existed
entries?

Thanks,
Han
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to