[ovs-discuss] Issue with failover running ovsdb-server in A/P mode with Pacemaker
Hi folks, While working with an OpenStack environment running OVN and ovsdb-server in A/P configuration with Pacemaker we hit an issue that has been probably around for a long time. The bug itself seems to be related with ovsdb-server not updating the read-only flag properly. With a 3 nodes cluster running ovsdb-server in active/passive mode, when we restart the master-node, pacemaker promotes another node as master and moves the associated IPAddr2 resource to it. At this point, ovn-controller instances across the cloud reconnect to the new node but there's a window where ovsdb-server is still running as backup. For those ovn-controller instances that reconnect within that window, every attempt to write in the OVSDB will fail with "operation not allowed when database server is in read only mode". This state will remain forever unless a reconnection is forced. Restarting ovn-controller or killing the connection (for example with tcpkill) will make things work again. A workaround in OVN OCF script could be to wait for the ovsdb_server_promote function to wait until we get 'running/active' on that instance. Another open question is what should clients (in this case, ovn-controller) do in such situation? Shall they log an error and attempt a reconnection (rate limited)? Thoughts? Thanks a lot, Daniel ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] Issue with failover running ovsdb-server in A/P mode with Pacemaker
On Mon, Jul 8, 2019 at 3:52 PM Daniel Alvarez Sanchez wrote: > Hi folks, > > While working with an OpenStack environment running OVN and > ovsdb-server in A/P configuration with Pacemaker we hit an issue that > has been probably around for a long time. The bug itself seems to be > related with ovsdb-server not updating the read-only flag properly. > > With a 3 nodes cluster running ovsdb-server in active/passive mode, > when we restart the master-node, pacemaker promotes another node as > master and moves the associated IPAddr2 resource to it. > At this point, ovn-controller instances across the cloud reconnect to > the new node but there's a window where ovsdb-server is still running > as backup. > > For those ovn-controller instances that reconnect within that window, > every attempt to write in the OVSDB will fail with "operation not > allowed when database server is in read only mode". This state will > remain forever unless a reconnection is forced. Restarting > ovn-controller or killing the connection (for example with tcpkill) > will make things work again. > > A workaround in OVN OCF script could be to wait for the > ovsdb_server_promote function to wait until we get 'running/active' on > that instance. > > Another open question is what should clients (in this case, > ovn-controller) do in such situation? Shall they log an error and > attempt a reconnection (rate limited)? > Thanks for reporting this issue Daniel. I can easily reproduce the issue with the below commands. $ This should have failed. Since OVN_NB_DAEMON is set, ovn-nbctl talks to the ovn-nbctl daemon and it is able to create a logical switch even though the db is in backup mode $unset OVN_NB_DAEMON $ovn-nbctl ls-add sw2 ovn-nbctl: transaction error: {"details":"insert operation not allowed when database server is in read only mode","error":"not allowed"} I looked into the ovsdb-server code, when the user changes the state of the ovsdb-server, the read_only param of active ovsdb_server_sessions are not updated. Thanks Numan > Thoughts? > > Thanks a lot, > Daniel > ___ > discuss mailing list > disc...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] Issue with failover running ovsdb-server in A/P mode with Pacemaker
Hi, Thanks for reporting, Daniel. On Mon, Jul 8, 2019 at 11:22 AM Daniel Alvarez Sanchez wrote: > > Hi folks, > > While working with an OpenStack environment running OVN and > ovsdb-server in A/P configuration with Pacemaker we hit an issue that > has been probably around for a long time. The bug itself seems to be > related with ovsdb-server not updating the read-only flag properly. > > With a 3 nodes cluster running ovsdb-server in active/passive mode, > when we restart the master-node, pacemaker promotes another node as > master and moves the associated IPAddr2 resource to it. > At this point, ovn-controller instances across the cloud reconnect to > the new node but there's a window where ovsdb-server is still running > as backup. > > For those ovn-controller instances that reconnect within that window, > every attempt to write in the OVSDB will fail with "operation not > allowed when database server is in read only mode". This state will > remain forever unless a reconnection is forced. Restarting > ovn-controller or killing the connection (for example with tcpkill) > will make things work again. > > A workaround in OVN OCF script could be to wait for the > ovsdb_server_promote function to wait until we get 'running/active' on > that instance. > > Another open question is what should clients (in this case, > ovn-controller) do in such situation? Shall they log an error and > attempt a reconnection (rate limited)? > I would say so, ovn-controller _requires_ a read-write session for it to function properly. Either it can retry to reconnect forever as you suggested or assert and exit if it's a read-only connection or a combination of the two (retry first and then exit). Also, we need to improve the logs for such errors. While debugging the problem it wasn't "easy" to find why ovn-controller wasn't updating the database (we were looking into the nb_cfg column of the Chassis table in the Southbound OVSDB). We've checked the state of the connection (it was stable), the process was healthy, etc... Was only when we enabled the DBG log level for ovn-controller that we've started seeing messages such as: 2019-07-04T15:11:19.522Z|00148|jsonrpc|DBG|tcp:172.17.1.27:6642: received notification, method="update2", params=[["monid","OVN_Southbound"],{"Chassis":{"cb669c72-0f84-412c-a3b f-482119649d85":{"modify":{"nb_cfg":3300] 2019-07-04T15:11:19.522Z|00149|jsonrpc|DBG|tcp:172.17.1.27:6642: received reply, result=[{"details":"update operation not allowed when database server is in read only mode","er ror":"not allowed"}], id=8062 So, perhaps logging it as ERROR would be better because without the DBG level all we could see in the logs was two INFO messages saying that it reconnected to the Southbound OVSDB. Cheers, Lucas ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] Issue with failover running ovsdb-server in A/P mode with Pacemaker
I *think* that it may not a bug in ovsdb-server but a problem with ovn-controller as it doesn't seem to be a DB change aware client. When the role changes from master to backup or viceversa, connections are expected to be reestablished for all clients except those that are not aware of db changes [0] (note the 'false' argument). This flag is explained here [1] and looks like since ovn-controller is not monitoring the Database table in the _Server database, then the connection with it is not re-established. This is just a blind guess but I can give it a shot :) [0] https://github.com/openvswitch/ovs/blob/403a6a0cb003f1d48b0a3cbf11a2806c45e9d076/ovsdb/jsonrpc-server.c#L368 [1] https://github.com/openvswitch/ovs/blob/403a6a0cb003f1d48b0a3cbf11a2806c45e9d076/ovsdb/jsonrpc-server.c#L450-L456 On Mon, Jul 8, 2019 at 12:45 PM Numan Siddique wrote: > > > > > On Mon, Jul 8, 2019 at 3:52 PM Daniel Alvarez Sanchez > wrote: >> >> Hi folks, >> >> While working with an OpenStack environment running OVN and >> ovsdb-server in A/P configuration with Pacemaker we hit an issue that >> has been probably around for a long time. The bug itself seems to be >> related with ovsdb-server not updating the read-only flag properly. >> >> With a 3 nodes cluster running ovsdb-server in active/passive mode, >> when we restart the master-node, pacemaker promotes another node as >> master and moves the associated IPAddr2 resource to it. >> At this point, ovn-controller instances across the cloud reconnect to >> the new node but there's a window where ovsdb-server is still running >> as backup. >> >> For those ovn-controller instances that reconnect within that window, >> every attempt to write in the OVSDB will fail with "operation not >> allowed when database server is in read only mode". This state will >> remain forever unless a reconnection is forced. Restarting >> ovn-controller or killing the connection (for example with tcpkill) >> will make things work again. >> >> A workaround in OVN OCF script could be to wait for the >> ovsdb_server_promote function to wait until we get 'running/active' on >> that instance. >> >> Another open question is what should clients (in this case, >> ovn-controller) do in such situation? Shall they log an error and >> attempt a reconnection (rate limited)? > > > Thanks for reporting this issue Daniel. > > I can easily reproduce the issue with the below commands. > > $ $export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach) > $ovn-nbctl ls-add sw0 > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/sync-status > state: active > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/set-active-ovsdb-server > tcp:192.0.2.2:6641 > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/connect-active-ovsdb-server > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/sync-status > state: backup > connecting: tcp:192.0.2.2:6641 > $ovn-nbctl ls-add sw1 --> This should have failed. Since OVN_NB_DAEMON is > set, ovn-nbctl talks to the >ovn-nbctl daemon and it is able to > create a logical switch even though the db is in backup mode > $unset OVN_NB_DAEMON > $ovn-nbctl ls-add sw2 > ovn-nbctl: transaction error: {"details":"insert operation not allowed when > database server is in read only mode","error":"not allowed"} > > > I looked into the ovsdb-server code, when the user changes the state of the > ovsdb-server, the read_only param of active ovsdb_server_sessions > are not updated. > > Thanks > Numan > >> >> Thoughts? >> >> Thanks a lot, >> Daniel >> ___ >> discuss mailing list >> disc...@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] Issue with failover running ovsdb-server in A/P mode with Pacemaker
I tried a simple patch and it fixes the issue (see below). The question now is, do we want to do this? I think it makes sense to drop *all* the connections when the role changes but I'm curious to see what other people think: diff --git a/ovsdb/jsonrpc-server.c b/ovsdb/jsonrpc-server.c index 4dda63a..ddbbc2e 100644 --- a/ovsdb/jsonrpc-server.c +++ b/ovsdb/jsonrpc-server.c @@ -365,7 +365,7 @@ ovsdb_jsonrpc_server_set_read_only(struct ovsdb_jsonrpc_server *svr, { if (svr->read_only != read_only) { svr->read_only = read_only; -ovsdb_jsonrpc_server_reconnect(svr, false, +ovsdb_jsonrpc_server_reconnect(svr, true, xstrdup(read_only ? "making server read-only" : "making server read/write")); $export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach) $ovn-nbctl ls-add sw0 $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/sync-status state: active $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/set-active-ovsdb-server tcp:192.0.2.2:6641 $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/connect-active-ovsdb-server $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/sync-status state: backup connecting: tcp:192.0.2.2:6641 $ ovn-nbctl ls-add sw1 ovn-nbctl: transaction error: {"details":"insert operation not allowed when database server is in read only mode","error":"not allowed"} On Mon, Jul 8, 2019 at 1:25 PM Daniel Alvarez Sanchez wrote: > > I *think* that it may not a bug in ovsdb-server but a problem with > ovn-controller as it doesn't seem to be a DB change aware client. > > When the role changes from master to backup or viceversa, connections > are expected to be reestablished for all clients except those that are > not aware of db changes [0] (note the 'false' argument). This flag is > explained here [1] and looks like since ovn-controller is not > monitoring the Database table in the _Server database, then the > connection with it is not re-established. This is just a blind guess > but I can give it a shot :) > > [0] > https://github.com/openvswitch/ovs/blob/403a6a0cb003f1d48b0a3cbf11a2806c45e9d076/ovsdb/jsonrpc-server.c#L368 > [1] > https://github.com/openvswitch/ovs/blob/403a6a0cb003f1d48b0a3cbf11a2806c45e9d076/ovsdb/jsonrpc-server.c#L450-L456 > > On Mon, Jul 8, 2019 at 12:45 PM Numan Siddique wrote: > > > > > > > > > > On Mon, Jul 8, 2019 at 3:52 PM Daniel Alvarez Sanchez > > wrote: > >> > >> Hi folks, > >> > >> While working with an OpenStack environment running OVN and > >> ovsdb-server in A/P configuration with Pacemaker we hit an issue that > >> has been probably around for a long time. The bug itself seems to be > >> related with ovsdb-server not updating the read-only flag properly. > >> > >> With a 3 nodes cluster running ovsdb-server in active/passive mode, > >> when we restart the master-node, pacemaker promotes another node as > >> master and moves the associated IPAddr2 resource to it. > >> At this point, ovn-controller instances across the cloud reconnect to > >> the new node but there's a window where ovsdb-server is still running > >> as backup. > >> > >> For those ovn-controller instances that reconnect within that window, > >> every attempt to write in the OVSDB will fail with "operation not > >> allowed when database server is in read only mode". This state will > >> remain forever unless a reconnection is forced. Restarting > >> ovn-controller or killing the connection (for example with tcpkill) > >> will make things work again. > >> > >> A workaround in OVN OCF script could be to wait for the > >> ovsdb_server_promote function to wait until we get 'running/active' on > >> that instance. > >> > >> Another open question is what should clients (in this case, > >> ovn-controller) do in such situation? Shall they log an error and > >> attempt a reconnection (rate limited)? > > > > > > Thanks for reporting this issue Daniel. > > > > I can easily reproduce the issue with the below commands. > > > > $ > $export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach) > > $ovn-nbctl ls-add sw0 > > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/sync-status > > state: active > > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/set-active-ovsdb-server > > tcp:192.0.2.2:6641 > > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/connect-active-ovsdb-server > > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/sync-status > > state: backup > > connecting: tcp:192.0.2.2:6641 > > $ovn-nbctl ls-add sw1 --> This should have failed. Since OVN_NB_DAEMON is > > set, ovn-nbctl talks to the > >ovn-nbctl daemon and it is able > > to create a logical switch even though the db is in backup mode > > $unset OVN_NB_DAEMON > > $ovn-nbctl ls-add sw2 > > ovn-nbctl: transaction error: {"details":"insert operation not allowed when > > database server is in read only mode","error":"not allowed"} > > > > > > I looked into the ovsdb-server code, when t
Re: [ovs-discuss] Issue with failover running ovsdb-server in A/P mode with Pacemaker
ovn-controller is in fact change-aware, but the _Server database doesn't report whether a particular database is read-only or read/write. I guess that was an oversight when I designed that schema. That means that there's no way for clients to monitor whether a particular database changes between read-only and read/write. I guess there are two ways to fix it: 1. Add a read/write column to the _Server schema and implement it in ovsdb-server and ovn-controller. 2. Make ovsdb-server kill connections when read/write status changes. #2 is probably what we should do right away. #1 can wait. On Mon, Jul 08, 2019 at 01:25:09PM +0200, Daniel Alvarez Sanchez wrote: > I *think* that it may not a bug in ovsdb-server but a problem with > ovn-controller as it doesn't seem to be a DB change aware client. > > When the role changes from master to backup or viceversa, connections > are expected to be reestablished for all clients except those that are > not aware of db changes [0] (note the 'false' argument). This flag is > explained here [1] and looks like since ovn-controller is not > monitoring the Database table in the _Server database, then the > connection with it is not re-established. This is just a blind guess > but I can give it a shot :) > > [0] > https://github.com/openvswitch/ovs/blob/403a6a0cb003f1d48b0a3cbf11a2806c45e9d076/ovsdb/jsonrpc-server.c#L368 > [1] > https://github.com/openvswitch/ovs/blob/403a6a0cb003f1d48b0a3cbf11a2806c45e9d076/ovsdb/jsonrpc-server.c#L450-L456 > > On Mon, Jul 8, 2019 at 12:45 PM Numan Siddique wrote: > > > > > > > > > > On Mon, Jul 8, 2019 at 3:52 PM Daniel Alvarez Sanchez > > wrote: > >> > >> Hi folks, > >> > >> While working with an OpenStack environment running OVN and > >> ovsdb-server in A/P configuration with Pacemaker we hit an issue that > >> has been probably around for a long time. The bug itself seems to be > >> related with ovsdb-server not updating the read-only flag properly. > >> > >> With a 3 nodes cluster running ovsdb-server in active/passive mode, > >> when we restart the master-node, pacemaker promotes another node as > >> master and moves the associated IPAddr2 resource to it. > >> At this point, ovn-controller instances across the cloud reconnect to > >> the new node but there's a window where ovsdb-server is still running > >> as backup. > >> > >> For those ovn-controller instances that reconnect within that window, > >> every attempt to write in the OVSDB will fail with "operation not > >> allowed when database server is in read only mode". This state will > >> remain forever unless a reconnection is forced. Restarting > >> ovn-controller or killing the connection (for example with tcpkill) > >> will make things work again. > >> > >> A workaround in OVN OCF script could be to wait for the > >> ovsdb_server_promote function to wait until we get 'running/active' on > >> that instance. > >> > >> Another open question is what should clients (in this case, > >> ovn-controller) do in such situation? Shall they log an error and > >> attempt a reconnection (rate limited)? > > > > > > Thanks for reporting this issue Daniel. > > > > I can easily reproduce the issue with the below commands. > > > > $ > $export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach) > > $ovn-nbctl ls-add sw0 > > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/sync-status > > state: active > > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/set-active-ovsdb-server > > tcp:192.0.2.2:6641 > > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/connect-active-ovsdb-server > > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/sync-status > > state: backup > > connecting: tcp:192.0.2.2:6641 > > $ovn-nbctl ls-add sw1 --> This should have failed. Since OVN_NB_DAEMON is > > set, ovn-nbctl talks to the > >ovn-nbctl daemon and it is able > > to create a logical switch even though the db is in backup mode > > $unset OVN_NB_DAEMON > > $ovn-nbctl ls-add sw2 > > ovn-nbctl: transaction error: {"details":"insert operation not allowed when > > database server is in read only mode","error":"not allowed"} > > > > > > I looked into the ovsdb-server code, when the user changes the state of the > > ovsdb-server, the read_only param of active ovsdb_server_sessions > > are not updated. > > > > Thanks > > Numan > > > >> > >> Thoughts? > >> > >> Thanks a lot, > >> Daniel > >> ___ > >> discuss mailing list > >> disc...@openvswitch.org > >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > ___ > discuss mailing list > disc...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] Issue with failover running ovsdb-server in A/P mode with Pacemaker
Would you mind formally submitting this? It seems like the best immediate solution. On Mon, Jul 08, 2019 at 02:27:31PM +0200, Daniel Alvarez Sanchez wrote: > I tried a simple patch and it fixes the issue (see below). The > question now is, do we want to do this? I think it makes sense to drop > *all* the connections when the role changes but I'm curious to see > what other people think: > > diff --git a/ovsdb/jsonrpc-server.c b/ovsdb/jsonrpc-server.c > index 4dda63a..ddbbc2e 100644 > --- a/ovsdb/jsonrpc-server.c > +++ b/ovsdb/jsonrpc-server.c > @@ -365,7 +365,7 @@ ovsdb_jsonrpc_server_set_read_only(struct > ovsdb_jsonrpc_server *svr, > { > if (svr->read_only != read_only) { > svr->read_only = read_only; > -ovsdb_jsonrpc_server_reconnect(svr, false, > +ovsdb_jsonrpc_server_reconnect(svr, true, > xstrdup(read_only > ? "making server read-only" > : "making server > read/write")); > > > $export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach) > $ovn-nbctl ls-add sw0 > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/sync-status > state: active > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/set-active-ovsdb-server > tcp:192.0.2.2:6641 > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/connect-active-ovsdb-server > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/sync-status > state: backup > connecting: tcp:192.0.2.2:6641 > $ ovn-nbctl ls-add sw1 > ovn-nbctl: transaction error: {"details":"insert operation not allowed > when database server is in read only mode","error":"not allowed"} > > On Mon, Jul 8, 2019 at 1:25 PM Daniel Alvarez Sanchez > wrote: > > > > I *think* that it may not a bug in ovsdb-server but a problem with > > ovn-controller as it doesn't seem to be a DB change aware client. > > > > When the role changes from master to backup or viceversa, connections > > are expected to be reestablished for all clients except those that are > > not aware of db changes [0] (note the 'false' argument). This flag is > > explained here [1] and looks like since ovn-controller is not > > monitoring the Database table in the _Server database, then the > > connection with it is not re-established. This is just a blind guess > > but I can give it a shot :) > > > > [0] > > https://github.com/openvswitch/ovs/blob/403a6a0cb003f1d48b0a3cbf11a2806c45e9d076/ovsdb/jsonrpc-server.c#L368 > > [1] > > https://github.com/openvswitch/ovs/blob/403a6a0cb003f1d48b0a3cbf11a2806c45e9d076/ovsdb/jsonrpc-server.c#L450-L456 > > > > On Mon, Jul 8, 2019 at 12:45 PM Numan Siddique wrote: > > > > > > > > > > > > > > > On Mon, Jul 8, 2019 at 3:52 PM Daniel Alvarez Sanchez > > > wrote: > > >> > > >> Hi folks, > > >> > > >> While working with an OpenStack environment running OVN and > > >> ovsdb-server in A/P configuration with Pacemaker we hit an issue that > > >> has been probably around for a long time. The bug itself seems to be > > >> related with ovsdb-server not updating the read-only flag properly. > > >> > > >> With a 3 nodes cluster running ovsdb-server in active/passive mode, > > >> when we restart the master-node, pacemaker promotes another node as > > >> master and moves the associated IPAddr2 resource to it. > > >> At this point, ovn-controller instances across the cloud reconnect to > > >> the new node but there's a window where ovsdb-server is still running > > >> as backup. > > >> > > >> For those ovn-controller instances that reconnect within that window, > > >> every attempt to write in the OVSDB will fail with "operation not > > >> allowed when database server is in read only mode". This state will > > >> remain forever unless a reconnection is forced. Restarting > > >> ovn-controller or killing the connection (for example with tcpkill) > > >> will make things work again. > > >> > > >> A workaround in OVN OCF script could be to wait for the > > >> ovsdb_server_promote function to wait until we get 'running/active' on > > >> that instance. > > >> > > >> Another open question is what should clients (in this case, > > >> ovn-controller) do in such situation? Shall they log an error and > > >> attempt a reconnection (rate limited)? > > > > > > > > > Thanks for reporting this issue Daniel. > > > > > > I can easily reproduce the issue with the below commands. > > > > > > $ > > $export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach) > > > $ovn-nbctl ls-add sw0 > > > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/sync-status > > > state: active > > > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/set-active-ovsdb-server > > > tcp:192.0.2.2:6641 > > > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/connect-active-ovsdb-server > > > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/sync-status > > > state: backup > > > connecting: tcp:192.0.2.2:6641 > > > $ovn-nbctl ls-add sw1 --> This should have failed. Since OVN_NB_DAEMON > > > is set, ovn-nbctl talks to the > > >
Re: [ovs-discuss] Issue with failover running ovsdb-server in A/P mode with Pacemaker
On Mon, Jul 8, 2019 at 5:43 PM Ben Pfaff wrote: > > Would you mind formally submitting this? It seems like the best > immediate solution. Will do, thanks a lot Ben! > > On Mon, Jul 08, 2019 at 02:27:31PM +0200, Daniel Alvarez Sanchez wrote: > > I tried a simple patch and it fixes the issue (see below). The > > question now is, do we want to do this? I think it makes sense to drop > > *all* the connections when the role changes but I'm curious to see > > what other people think: > > > > diff --git a/ovsdb/jsonrpc-server.c b/ovsdb/jsonrpc-server.c > > index 4dda63a..ddbbc2e 100644 > > --- a/ovsdb/jsonrpc-server.c > > +++ b/ovsdb/jsonrpc-server.c > > @@ -365,7 +365,7 @@ ovsdb_jsonrpc_server_set_read_only(struct > > ovsdb_jsonrpc_server *svr, > > { > > if (svr->read_only != read_only) { > > svr->read_only = read_only; > > -ovsdb_jsonrpc_server_reconnect(svr, false, > > +ovsdb_jsonrpc_server_reconnect(svr, true, > > xstrdup(read_only > > ? "making server read-only" > > : "making server > > read/write")); > > > > > > $export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach) > > $ovn-nbctl ls-add sw0 > > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/sync-status > > state: active > > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/set-active-ovsdb-server > > tcp:192.0.2.2:6641 > > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/connect-active-ovsdb-server > > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/sync-status > > state: backup > > connecting: tcp:192.0.2.2:6641 > > $ ovn-nbctl ls-add sw1 > > ovn-nbctl: transaction error: {"details":"insert operation not allowed > > when database server is in read only mode","error":"not allowed"} > > > > On Mon, Jul 8, 2019 at 1:25 PM Daniel Alvarez Sanchez > > wrote: > > > > > > I *think* that it may not a bug in ovsdb-server but a problem with > > > ovn-controller as it doesn't seem to be a DB change aware client. > > > > > > When the role changes from master to backup or viceversa, connections > > > are expected to be reestablished for all clients except those that are > > > not aware of db changes [0] (note the 'false' argument). This flag is > > > explained here [1] and looks like since ovn-controller is not > > > monitoring the Database table in the _Server database, then the > > > connection with it is not re-established. This is just a blind guess > > > but I can give it a shot :) > > > > > > [0] > > > https://github.com/openvswitch/ovs/blob/403a6a0cb003f1d48b0a3cbf11a2806c45e9d076/ovsdb/jsonrpc-server.c#L368 > > > [1] > > > https://github.com/openvswitch/ovs/blob/403a6a0cb003f1d48b0a3cbf11a2806c45e9d076/ovsdb/jsonrpc-server.c#L450-L456 > > > > > > On Mon, Jul 8, 2019 at 12:45 PM Numan Siddique > > > wrote: > > > > > > > > > > > > > > > > > > > > On Mon, Jul 8, 2019 at 3:52 PM Daniel Alvarez Sanchez > > > > wrote: > > > >> > > > >> Hi folks, > > > >> > > > >> While working with an OpenStack environment running OVN and > > > >> ovsdb-server in A/P configuration with Pacemaker we hit an issue that > > > >> has been probably around for a long time. The bug itself seems to be > > > >> related with ovsdb-server not updating the read-only flag properly. > > > >> > > > >> With a 3 nodes cluster running ovsdb-server in active/passive mode, > > > >> when we restart the master-node, pacemaker promotes another node as > > > >> master and moves the associated IPAddr2 resource to it. > > > >> At this point, ovn-controller instances across the cloud reconnect to > > > >> the new node but there's a window where ovsdb-server is still running > > > >> as backup. > > > >> > > > >> For those ovn-controller instances that reconnect within that window, > > > >> every attempt to write in the OVSDB will fail with "operation not > > > >> allowed when database server is in read only mode". This state will > > > >> remain forever unless a reconnection is forced. Restarting > > > >> ovn-controller or killing the connection (for example with tcpkill) > > > >> will make things work again. > > > >> > > > >> A workaround in OVN OCF script could be to wait for the > > > >> ovsdb_server_promote function to wait until we get 'running/active' on > > > >> that instance. > > > >> > > > >> Another open question is what should clients (in this case, > > > >> ovn-controller) do in such situation? Shall they log an error and > > > >> attempt a reconnection (rate limited)? > > > > > > > > > > > > Thanks for reporting this issue Daniel. > > > > > > > > I can easily reproduce the issue with the below commands. > > > > > > > > $ > > > $export OVN_NB_DAEMON=$(ovn-nbctl --pidfile --detach) > > > > $ovn-nbctl ls-add sw0 > > > > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/sync-status > > > > state: active > > > > $ovs-appctl -t $PWD/sandbox/nb1 ovsdb-server/set-active-ovsdb-server > > > > tcp:192.0.2.2:6641 > > > > $ovs-appctl -t $PWD/sandbox/nb1