On Thu, Apr 30, 2020 at 8:43 AM Han Zhou <hz...@ovn.org> wrote:

> On Wed, Apr 29, 2020 at 4:50 PM Han Zhou <hz...@ovn.org> wrote:
> >
> >
> >
> > On Wed, Apr 29, 2020 at 2:45 PM Dumitru Ceara <dce...@redhat.com> wrote:
> > >
> > > On 4/29/20 9:57 PM, Han Zhou wrote:
> > > >
> > > >
> > > > On Wed, Apr 29, 2020 at 12:17 PM Numan Siddique <num...@ovn.org
> > > > <mailto:num...@ovn.org>> wrote:
> > > >>
> > > >>
> > > >>
> > > >> On Wed, Apr 29, 2020 at 9:57 PM Dumitru Ceara <dce...@redhat.com
> > > > <mailto:dce...@redhat.com>> wrote:
> > > >>>
> > > >>> In some cases, if the NB/SB databases ovn-northd connects to are
> > > >>> inconsistent, ovn-northd might generate transactions that fail
> > > >>> continuously due to failed integrity checks on the SB database
> server.
> > > >>>
> > > >>> The first patch of the series addresses inconsistencies due to
> stale
> > > >>> Datapath_Binding records in the SB database.
> > > >>>
> > > >>> The second patch of the series addresses inconsistencies due to
> stale
> > > >>> tunnel_key values in various SB database table records.
> > > >>>
> > > >>> Reported-by: Dan Williams <d...@redhat.com <mailto:d...@redhat.com
> >>
> > > >>> Reported-at: https://bugzilla.redhat.com/1828637
> > > >>> Signed-off-by: Dumitru Ceara <dce...@redhat.com
> > > > <mailto:dce...@redhat.com>>
> > > >>>
> > > >>> Dumitru Ceara (2):
> > > >>>       ovn-northd: Clear SB records depending on stale datapaths.
> > > >>>       ovn-northd: Fix tunnel_key allocation for SB records.
> > > >>
> > > >>
> > > >> Hi Dumitru,
> > > >>
> > > >> I did some testing in my ovn-fake-multinode setup. These are my
> > > > observations.
> > > >>
> > > >> I created a logical switch sw0 with 4 logical ports. So the next
> > > > tunnel key should be 5.
> > > >> I stopped ovn-northd and  created a couple of port_binding entries
> > > > manually using
> > > >> "ovn-sbctl create port_binding"  with tunnel keys 5 and 6.
> > > >> I also created a logical port in sw0. Then I started ovn-northd.
> > > > ovn-northd deletes the port binding
> > > >> entries added by me and creates the port_binding entry for the
> logical
> > > > port with the tunnel_key=5
> > > >> in the same transaction.
> > > >>
> > > >> I think ovn-northd syncs the south db based on the contents of the
> > > > north db.
> > > >>
> > > >> There's no harm in having your patches. But I'm not really sure if
> it
> > > > resolves the issue we have observed.
> > > >>
> > > >> Just to brief everyone about the issue we are seeing, we see below
> > > > logs in ovn-northd.
> > > >>
> > > >> *******
> > > >> 2020-04-16T23:02:33Z|00127|ovsdb_idl|WARN|transaction error:
> > > > {"details":"Transaction causes multiple rows in \"Port_Binding\"
> table
> > > > to have identical values (23eb9016-45f9-4158-be35-77b2713b9a0f and 7)
> > > > for index on columns \"datapath\" and \"tunnel_key\".  First row,
> with
> > > > UUID e4f11a7b-09b6-454f-a125-34cc4b144ef6, had the following index
> > > > values before the transaction: bdbb436e-f98c-4651-9b80-6e8b95044560
> and
> > > > 7.  Second row, with UUID d37cc3f1-8633-440f-b145-8222a0d4723c,
> existed
> > > > in the database before this transaction and was not modified by the
> > > > transaction.","error":"constraint violation"}
> > > >> ******
> > > >>
> > > >> And because of this constraint violation error, ovn-northd cannot
> > > > further write to the sb db until it is restarted.
> > > >>
> > > >> In my opinion this can only happen if ovn-northd doesn't see the
> port
> > > > binding row (which is actually present in the DB) in its IDL
> in-memory db.
> > > >> I suspect this could have happened when ovn-northd reconnects to the
> > > > same master or connects to the new master and it doesn't get the
> proper
> > > >> updates.
> > > >>
> > > >> Maybe in this case, the IDL should request the db contents with txn
> id
> > > > =0, so that it receives the complete dump of the db.
> > > >>
> > > >> Is it possible that ovn-northd sees a port binding with a tunnel key
> > > > 'x' and still allocates the same tunnel id 'x' to a new logical port
> ?
> > > >> If so, then definitely your patches makes sense.
> > > >>
> > > >> @Han - Have you seen this issue in your deployments ? Do you have
> > > > comments here ?
> > > >>
> > > >> Thanks
> > > >> Numan
> > > >>
> > > > I never saw such issue before, but I am not sure if this is possible
> due
> > > > to bugs. Currently there is a bug fix under review:
> > > >
>
> https://patchwork.ozlabs.org/project/openvswitch/patch/20200422183842.6303.99600.st...@dceara.remote.csb/
> .
> > > > However, northd doesn't conditionally monitor the rows so I am not
> sure
> > > > if this is the root cause of the northd inconsistency issue discussed
> here.
> > > >
> > > > I don't think we should fix in northd (or ovn-controller) to handle
> the
> > > > inconsistency of ovsdb. The consistency should be expected from ovsdb
> > > > and we should fix ovsdb/IDL when there is such kind of bug.
> Otherwise,
> > > > there might be too many places to fix and even re-design. My
> > > > understanding is, if the ovsdb IDL sees a temporarily stale data, the
> > > > current northd/ovn-controller logic should be able to correct
> themselves
> > > > once the data is up-to-date. Moreover, for northd, it is connected to
> > > > leader-only in clustered mode, which avoids the possibility of seeing
> > > > staled data in northd (unless there is a bug).
> > > >
> > > > To summarize, I think we need to find the root cause of the
> > > > inconsistency between IDL and server and fix it there, instead of
> > > > changing ovn-northd to accommodate the inconsistency. (consistency is
> > > > the biggest advantage of OVSDB, to ease the application
> implementation).
> > > >
> > > > Thanks,
> > > > Han
> > >
> > > Hi Han, Numan,
> > >
> > > I might have misused "inconsistency" in this context. What I meant was
> > > more on the note of "discrepancies between NB and SB databases".
> > >
> > > This is a very simple reproducer for the port_binding tunnel_key issue,
> > > no clustering of NB/SB dbs involved:
> > >
> > > # Create two logical switches with one port each.
> > > $ ovn-nbctl ls-add ls1
> > > $ ovn-nbctl lsp-add ls1 p1
> > > $ ovn-nbctl ls-add ls2
> > > $ ovn-nbctl lsp-add ls2 p2
> > > $ ovn-nbctl --wait=sb sync
> > >
> > > # At this point PB for p1 has tunnel_key=1
> > > # At this point PB for p2 has tunnel_key=2
> > >
> > > # Simulate the SB db going away (could be network
> > > # issues or crash or some other event).
> > > $ ovn-ctl stop_sb_ovsdb
> > >
> > > # CMS decides to move p2 from ls2 to ls1 and removes
> > > # ls2 completely.
> > > $ ovn-nbctl ls-del ls2
> > > $ ovn-nbctl lsp-add ls1 p2
> > >
> > > # Simulate SB DB coming back online.
> > > $ ovn-ctl start_sb_ovsdb
> > >
> > > At this point ovn-northd will try to set the datapath field in PB2 to
> > > point to datapath_binding corresponding to ls1 but will *not* change
> > > tunnel_key.
> > >
> > > We get:
> > > 2020-04-29T20:52:41.327Z|00016|ovsdb_idl|WARN|transaction error:
> > > {"details":"Transaction causes multiple rows in \"Port_Binding\" table
> > > to have identical values (1b1c4b39-c045-448d-a532-8edbe5544e13 and 1)
> > > for index on columns \"datapath\" and \"tunnel_key\".  First row, with
> > > UUID e20219fa-ef67-49a2-81cd-739fa80d2bd4, existed in the database
> > > before this transaction and was not modified by the transaction.
> Second
> > > row, with UUID 50b0e240-8a4d-4e98-8e2f-97c94811d1b1, had the following
> > > index values before the transaction:
> > > a9b5959f-2f48-44e7-b6bb-f7148c28e4b5 and 1.","error":"constraint
> violation"}
> > >
> > > And ovn-northd keeps retrying the same transaction at every iteration
> > > from this point on and fails continuously.
> > >
> > > For the stale datapath issue (patch #1 in the series) a similar
> > > reproducer is:
> > >
> > > # Create a logical router with on router port.
> > > $ ovn-nbctl lr-add lr
> > > $ ovn-nbctl lrp-add lr p 00:00:00:00:00:01 1.1.1.1/24
> > >
> > > # Simulate that a mac binding was created for the router
> > > # port.
> > > $ dp=$(ovn-sbctl --bare --columns _uuid list datapath .)
> > > $ ovn-sbctl create mac_binding logical_port="p" ip="1.1.1.2"
> datapath="$dp"
> > > $ ovn-nbctl --wait=sb sync
> > >
> > > # Simulate the SB db going away (could be network
> > > # issues or crash or some other event).
> > > $ ovn-ctl stop_sb_ovsdb
> > >
> > > # CMS decides to delete lr.
> > > $ ovn-nbctl lr-del lr
> > >
> > > # CMS decides to readd lr and router port.
> > > $ ovn-nbctl lr-add lr
> > > $ ovn-nbctl lrp-add lr p 00:00:00:00:00:01 1.1.1.1/24
> > >
> > > # Simulate SB DB coming back online.
> > > $ ovn-ctl start_sb_ovsdb
> > >
> > > At this point ovn-northd will try to clear the old datapath record from
> > > SB DB *without* destroying the mac binding record.
> > >
> > > We get:
> > > 2020-04-29T21:41:42.145Z|00013|ovsdb_idl|WARN|transaction error:
> > > {"details":"cannot delete Datapath_Binding row
> > > de8d19d6-d67b-499b-8825-12d34ec60946 because of 1 remaining
> > > reference(s)","error":"referential integrity violation"}
> > >
> > > I think both situations above should be addressed by ovn-northd and
> > > stale datapath/mac_binding/port_binding/etc records should be purged. I
> > > guess there might be other scenarios that would trigger constraint
> > > violations too but this is what I found so far.
> > >
> > > If you agree, I can send a v2 and add tests for the two simplified
> > > scenarios I mentioned above.
> > >
> > > What do you think?
>

Thanks Dumitru. for the explanation. It would be great to add these tests
in v2.

Thanks
Numan


> > >
> > > Thanks,
> > > Dumitru
> > >
> >
> > Thanks Dumitru for explaining. Now I understand the problem. So it has
> nothing to do with OVSDB consistency itself, but just northd'd logic. I
> don't even need to stop SB to reproduce. Here is how I reproduced it:
> > $ ovn-nbctl ls-add ls1
> > $ ovn-nbctl ls-add ls2
> > $ ovn-nbctl lsp-add ls1 lsp1
> > $ ovn-nbctl lsp-add ls2 lsp2
> > $ ovn-nbctl lsp-del ls2 -- lsp-add ls1 lsp2
>
> Sorry for the typo. The last command was:
> $ ovn-nbctl lsp-del lsp2 -- lsp-add ls1 lsp2
>
> >
> > 2020-04-29T23:46:17.675Z|00007|ovsdb_idl|WARN|transaction error:
> {"details":"Transaction causes multiple rows in \"Port_Binding\" table to
> have identical values (be595a3b-3904-4229-9ba2-884b27a86b75 and 1) for
> index on columns \"datapath\" and \"tunnel_key\".  First row, with UUID
> d4cc6ec5-4817-47c9-aa83-9985d3b7b452, existed in the database before this
> transaction and was not modified by the transaction.  Second row, with UUID
> b874ab93-d97a-4583-8ac3-c353a40b180d, had the following index values before
> the transaction: 6940ad91-83c5-4fe9-bab5-4fbec6714b0d and
> 1.","error":"constraint violation"}
> >
> > I will take a closer look at the fix.
> >
> > Thanks,
> > Han
> _______________________________________________
> dev mailing list
> d...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
>
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to