On Tue, 27 Apr 2021 at 23:08, Numan Siddique <num...@ovn.org> wrote: > > On Tue, Apr 27, 2021 at 4:58 PM Francois <rigault.franc...@gmail.com> wrote: > > > > On Tue, 27 Apr 2021 at 22:20, Numan Siddique <num...@ovn.org> wrote: > > > > > > On Tue, Apr 27, 2021 at 9:11 AM Francois <rigault.franc...@gmail.com> > > > wrote: > > > > > > > > > The ovn-controller running on chassis-1 will not detect the BFD failover. > > > > Thanks for your answer! Ok for chassis-1. > > > > What I don't understand is why chassis-2, who is aware that chassis-1 > > is down, is not able to act as a gateway for its own ports. > > I see what's going on. So ovn-controller on chassis-2 detects the failover > and claims the cr-<gateway_port>. But ovn-controller on chassis-1 which has > higher priority claims it back because according to it, BFD is fine. > > You can probably monitor the ovn-controller logs on both chassis, and you > might notice claim/release logs. > > Or you can do "tail -f ovnsb_db.db" and see that there are constant updates > to the cr-<gateway_port>. > > Having 3 chassis will not result in this split brain scenario which you have > probably observed.
I am going to do a bit more research and see what happens on some real OpenStack installation, maybe I messed up somewhere. There is nothing logged in the ovn-controller, and nothing flooding the DB (+one line saying port_binding is down). My understanding was that the move of gateway (as it happens for chassis-3) happens without the involvement of the control plane, in other words in case the first gateway fails, the flows to move to the second gateway are already installed and can be used straight away. I am puzzled because if I trace the packet from chassis-2 before and after chassis-1 dies, it always end up in flow 37. reg15=0x3,metadata=0x4, priority 100, cookie 0x7a15360f set_field:0x4/0xffffff->tun_id set_field:0x3->tun_metadata0 move:NXM_NX_REG14[0..14]->NXM_NX_TUN_METADATA0[16..30] -> NXM_NX_TUN_METADATA0[16..30] is now 0x1 bundle(eth_src,0,active_backup,ofport,members:7) Only difference is, when chassis-1 is up, the added -> output to kernel tunnel It seems that there is no backup flow for packets not going through a tunnel, straight to external. Before tackling the tricky cases, I would like to make it work when it fails "as documented" :), just one chassis dying but traffic being quickly dispatched somewhere else. Thanks On Tue, 27 Apr 2021 at 23:08, Numan Siddique <num...@ovn.org> wrote: > > On Tue, Apr 27, 2021 at 4:58 PM Francois <rigault.franc...@gmail.com> wrote: > > > > On Tue, 27 Apr 2021 at 22:20, Numan Siddique <num...@ovn.org> wrote: > > > > > > On Tue, Apr 27, 2021 at 9:11 AM Francois <rigault.franc...@gmail.com> > > > wrote: > > > > > > > > > The ovn-controller running on chassis-1 will not detect the BFD failover. > > > > Thanks for your answer! Ok for chassis-1. > > > > What I don't understand is why chassis-2, who is aware that chassis-1 > > is down, is not able to act as a gateway for its own ports. > > I see what's going on. So ovn-controller on chassis-2 detects the failover > and claims the cr-<gateway_port>. But ovn-controller on chassis-1 which has > higher priority claims it back because according to it, BFD is fine. > > You can probably monitor the ovn-controller logs on both chassis, and you > might notice claim/release logs. > > Or you can do "tail -f ovnsb_db.db" and see that there are constant updates > to the cr-<gateway_port>. > > Having 3 chassis will not result in this split brain scenario which you have > probably observed. > > Thanks > Numan > > > > > > Francois > > _______________________________________________ > > discuss mailing list > > disc...@openvswitch.org > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > > _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss