Re: [ovs-discuss] bug in vport.c file
hello everyone, could anyone please help on this. Thanks in advance. I am trying to test ovs switch with mininet by tracing the call stack in datapath module by printing some messages, and i found that ovs_vport_send() function which internally calls vport->ops->send(skb); function which is referencing rpl_dev_queue_xmit() function, which is defined in gso.c file under datapath/linux directory is not getting invoked. Which function in datapath module is responsible to transmit the packets to the host in the network? Thanks vikash On Thu, Aug 9, 2018 at 6:01 PM, Vikas Kumar wrote: > hi Team, > please help me on this issue if i am wrong please correct me and provide > some information, > > I am trying to test ovs switch with mininet by tracing the call stack in > datapath module by printing some messages, and i found that ovs_vport_send() > function which internally calls > vport->ops->send(skb); function which is referencing rpl_dev_queue_xmit() > function, which is defined in gso.c file under datapath/linux directory is > not getting invoked. > > Which function in datapath module is responsible to transmit the packets > to the host in the network? > > Thanks > vikash > ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] OvS and Opendaylight
I tried out ODL (nitrogen SR3) with the latest OVS (2.10.90) with the following configuration: sudo ovs-vsctl show c4f25e49-0e9c-4861-bd2d-e82f20a80864 Bridge "br20" Controller "tcp:10.xx.xx.xx:6653" <-- ODL is_connected: true Port "br20" Interface "br20" type: internal Bridge "br1" Port "br1" Interface "br1" type: internal Port "veth_t2" Interface "veth_t2" Port "veth_t0" Interface "veth_t0" ovs_version: "2.10.90" On the wire, I could see OVS sending out HELLO: OpenFlow 1.4 Version: 1.4 (0x05) Type: OFPT_HELLO (0) Length: 8 Transaction ID: 2 with no "element" field. ODL sends back the HELLO: OpenFlow 1.3 Version: 1.3 (0x04) Type: OFPT_HELLO (0) Length: 16 Transaction ID: 21 Element Type: OFPHET_VERSIONBITMAP (1) Length: 8 Bitmap: 0012 No further handshake messages are sent except Version 1.3 (0x4) ECHO generated by OVS and it's response by ODL. The TCP connection is not torn down and it gets into a stalemate condition with nothing else happening. -- On the other instance with same ODL and OVS but with following configuration: sudo ovs-vsctl show c4f25e49-0e9c-4861-bd2d-e82f20a80864 Bridge "br20" Port "br20" Interface "br20" type: internal Bridge "br1" Controller "tcp:10.xx.xx.xx:6653" <-- ODL is_connected: true Port "br1" Interface "br1" type: internal Port "veth_t2" Interface "veth_t2" Port "veth_t0" Interface "veth_t0" ovs_version: "2.10.90" OVS generates HELLO with the "element" field: OpenFlow 1.5 Version: 1.5 (0x06) Type: OFPT_HELLO (0) Length: 16 Transaction ID: 3 Element Type: OFPHET_VERSIONBITMAP (1) Length: 8 Bitmap: 0072 In response, ODL generates FEATURES REQUEST message to which OVS responds and the handshake and other messages seem to be exchanged normally. ODL does not generate it's HELLO message.(but this is not a problem) On Wed, Aug 8, 2018 at 11:10 AM, Ben Pfaff wrote: > On Wed, Aug 08, 2018 at 02:40:37PM +0300, Eray Guven wrote: > > Hello > > > > Somehow, I cannot use last version of OvS ( 2.9.2) with controller ( > > OpendayLight Nitrogen) . I can work with older versions of OvS easily so > I > > dont think source of the problem is ODL here.Yet, I need to work with > last > > version of OvS at this moment. I couldnt find any bug or release note > > related wtih that. Do you have an idea what problem could be ? > > > > https://docs.opendaylight.org/en/stable-carbon/submodules/ > netvirt/docs/user-guide/support.html > > : Clearly states that OvS 2.9 supported with ODL > > > > No errors or no crashes. Just controller doesnt work with network. > > Can you identify the earliest version of OVS that has problems? > Ideally, this would be a commit ID via "git bisect". > ___ > discuss mailing list > disc...@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] Possible data loss of OVSDB active-backup mode
On Thu, Aug 09, 2018 at 09:32:21AM -0700, Han Zhou wrote: > On Thu, Aug 9, 2018 at 1:57 AM, aginwala wrote: > > > > > > To add on , we are using LB VIP IP and no constraint with 3 nodes as Han > mentioned earlier where active node have syncs from invalid IP and rest > two nodes sync from LB VIP IP. Also, I was able to get some logs from one > node that triggered: > https://github.com/openvswitch/ovs/blob/master/ovsdb/ovsdb-server.c#L460 > > > > 2018-08-04T01:43:39.914Z|03230|reconnect|DBG|tcp:10.189.208.16:50686: > entering RECONNECT > > 2018-08-04T01:43:39.914Z|03231|ovsdb_jsonrpc_server|INFO|tcp: > 10.189.208.16:50686: disconnecting (removing OVN_Northbound database due to > server termination) > > 2018-08-04T01:43:39.932Z|03232|ovsdb_jsonrpc_server|INFO|tcp: > 10.189.208.21:56160: disconnecting (removing _Server database due to server > termination) > > 20 > > > > I am not sure if sync_from on active node too via some invalid ip is > causing some flaw when all are down during the race condition in this > corner case. > > > > > > > > > > > > On Thu, Aug 9, 2018 at 1:35 AM Numan Siddique wrote: > >> > >> > >> > >> On Thu, Aug 9, 2018 at 1:07 AM Ben Pfaff wrote: > >>> > >>> On Wed, Aug 08, 2018 at 12:18:10PM -0700, Han Zhou wrote: > >>> > On Wed, Aug 8, 2018 at 11:24 AM, Ben Pfaff wrote: > >>> > > > >>> > > On Wed, Aug 08, 2018 at 12:37:04AM -0700, Han Zhou wrote: > >>> > > > Hi, > >>> > > > > >>> > > > We found an issue in our testing (thanks aginwala) with > active-backup > >>> > mode > >>> > > > in OVN setup. > >>> > > > In the 3 node setup with pacemaker, after stopping pacemaker on > all > >>> > three > >>> > > > nodes (simulate a complete shutdown), and then if starting all of > them > >>> > > > simultaneously, there is a good chance that the whole DB content > gets > >>> > lost. > >>> > > > > >>> > > > After studying the replication code, it seems there is a phase > that the > >>> > > > backup node deletes all its data and wait for data to be synced > from the > >>> > > > active node: > >>> > > > > https://github.com/openvswitch/ovs/blob/master/ovsdb/replication.c#L306 > >>> > > > > >>> > > > At this state, if the node was set to active, then all data is > gone for > >>> > the > >>> > > > whole cluster. This can happen in different situations. In the > test > >>> > > > scenario mentioned above it is very likely to happen, since > pacemaker > >>> > just > >>> > > > randomly select one as master, not knowing the internal sync > state of > >>> > each > >>> > > > node. It could also happen when failover happens right after a new > >>> > backup > >>> > > > is started, although less likely in real environment, so starting > up > >>> > node > >>> > > > one by one may largely reduce the probability. > >>> > > > > >>> > > > Does this analysis make sense? We will do more tests to verify the > >>> > > > conclusion, but would like to share with community for > discussions and > >>> > > > suggestions. Once this happens it is very critical - even more > serious > >>> > than > >>> > > > just no HA. Without HA it is just control plane outage, but this > would > >>> > be > >>> > > > data plane outage because OVS flows will be removed accordingly > since > >>> > the > >>> > > > data is considered as deleted from ovn-controller point of view. > >>> > > > > >>> > > > We understand that active-standby is not the ideal HA mechanism > and > >>> > > > clustering is the future, and we are also testing the clustering > with > >>> > the > >>> > > > latest patch. But it would be good if this problem can be > addressed with > >>> > > > some quick fix, such as keep a copy of the old data somewhere > until the > >>> > > > first sync finishes? > >>> > > > >>> > > This does seem like a plausible bug, and at first glance I believe > that > >>> > > you're correct about the race here. I guess that the correct > behavior > >>> > > must be to keep the original data until a new copy of the data has > been > >>> > > received, and only then atomically replace the original by the new. > >>> > > > >>> > > Is this something you have time and ability to fix? > >>> > > >>> > Thanks Ben for quick response. I guess I will not have time until I > send > >>> > out next series for incremental processing :) > >>> > It would be good if someone can help and then please reply this email > if > >>> > he/she starts working on it so that we will not end up with > overlapping > >>> > work. > >> > >> > >> I will give a shot at fixing this issue. > >> > >> In the case of tripleo we haven't hit this issue. I haven't tested this > scenario. > >> I will test it out. One difference when compared to your setup is > tripleo uses > >> IPAddr2 resource and a collocation constraint set. > >> > >> Thanks > >> Numan > >> > > Thanks Numan for helping on this. I think IPAddr2 should have same problem, > if my previous analysis was right, unless using IPAddr2 would result in > pacemaker always electing the node that is configured with the master IP as > the master
Re: [ovs-discuss] equal generation_id
On Thu, Aug 09, 2018 at 02:09:22PM -0400, Logan Blyth wrote: > Hello, > I am testing a master / slave controller configuration and have run into a > question. The spec says that the function used to determine if a > generation_id is stale or not is > distance(GEN_ID_X, cached_generation_id) < 0) > what is supposed to happen if distance = 0? > > In my testing wireshark showed a loop of role_request, role_reply messages > to each controller. > 'ovs-vsctl list controller' showed one controller as slave and one a > master, but I didn't see the corresponding message sent across the wire, so > both of my controllers thought they became master. > > I apologize if this issue has been raised before on the list, and I missed > it. I did search the FAQ for OFPT_ROLE_REQUEST and MASTER but didn't find > anything that covered this. It looks like OVS implements exactly the algorithm from the spec. ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
[ovs-discuss] equal generation_id
Hello, I am testing a master / slave controller configuration and have run into a question. The spec says that the function used to determine if a generation_id is stale or not is distance(GEN_ID_X, cached_generation_id) < 0) what is supposed to happen if distance = 0? In my testing wireshark showed a loop of role_request, role_reply messages to each controller. 'ovs-vsctl list controller' showed one controller as slave and one a master, but I didn't see the corresponding message sent across the wire, so both of my controllers thought they became master. I apologize if this issue has been raised before on the list, and I missed it. I did search the FAQ for OFPT_ROLE_REQUEST and MASTER but didn't find anything that covered this. Logan -- -- Logan Blyth ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] Possible data loss of OVSDB active-backup mode
On Thu, Aug 9, 2018 at 1:57 AM, aginwala wrote: > > > To add on , we are using LB VIP IP and no constraint with 3 nodes as Han mentioned earlier where active node have syncs from invalid IP and rest two nodes sync from LB VIP IP. Also, I was able to get some logs from one node that triggered: https://github.com/openvswitch/ovs/blob/master/ovsdb/ovsdb-server.c#L460 > > 2018-08-04T01:43:39.914Z|03230|reconnect|DBG|tcp:10.189.208.16:50686: entering RECONNECT > 2018-08-04T01:43:39.914Z|03231|ovsdb_jsonrpc_server|INFO|tcp: 10.189.208.16:50686: disconnecting (removing OVN_Northbound database due to server termination) > 2018-08-04T01:43:39.932Z|03232|ovsdb_jsonrpc_server|INFO|tcp: 10.189.208.21:56160: disconnecting (removing _Server database due to server termination) > 20 > > I am not sure if sync_from on active node too via some invalid ip is causing some flaw when all are down during the race condition in this corner case. > > > > > > On Thu, Aug 9, 2018 at 1:35 AM Numan Siddique wrote: >> >> >> >> On Thu, Aug 9, 2018 at 1:07 AM Ben Pfaff wrote: >>> >>> On Wed, Aug 08, 2018 at 12:18:10PM -0700, Han Zhou wrote: >>> > On Wed, Aug 8, 2018 at 11:24 AM, Ben Pfaff wrote: >>> > > >>> > > On Wed, Aug 08, 2018 at 12:37:04AM -0700, Han Zhou wrote: >>> > > > Hi, >>> > > > >>> > > > We found an issue in our testing (thanks aginwala) with active-backup >>> > mode >>> > > > in OVN setup. >>> > > > In the 3 node setup with pacemaker, after stopping pacemaker on all >>> > three >>> > > > nodes (simulate a complete shutdown), and then if starting all of them >>> > > > simultaneously, there is a good chance that the whole DB content gets >>> > lost. >>> > > > >>> > > > After studying the replication code, it seems there is a phase that the >>> > > > backup node deletes all its data and wait for data to be synced from the >>> > > > active node: >>> > > > https://github.com/openvswitch/ovs/blob/master/ovsdb/replication.c#L306 >>> > > > >>> > > > At this state, if the node was set to active, then all data is gone for >>> > the >>> > > > whole cluster. This can happen in different situations. In the test >>> > > > scenario mentioned above it is very likely to happen, since pacemaker >>> > just >>> > > > randomly select one as master, not knowing the internal sync state of >>> > each >>> > > > node. It could also happen when failover happens right after a new >>> > backup >>> > > > is started, although less likely in real environment, so starting up >>> > node >>> > > > one by one may largely reduce the probability. >>> > > > >>> > > > Does this analysis make sense? We will do more tests to verify the >>> > > > conclusion, but would like to share with community for discussions and >>> > > > suggestions. Once this happens it is very critical - even more serious >>> > than >>> > > > just no HA. Without HA it is just control plane outage, but this would >>> > be >>> > > > data plane outage because OVS flows will be removed accordingly since >>> > the >>> > > > data is considered as deleted from ovn-controller point of view. >>> > > > >>> > > > We understand that active-standby is not the ideal HA mechanism and >>> > > > clustering is the future, and we are also testing the clustering with >>> > the >>> > > > latest patch. But it would be good if this problem can be addressed with >>> > > > some quick fix, such as keep a copy of the old data somewhere until the >>> > > > first sync finishes? >>> > > >>> > > This does seem like a plausible bug, and at first glance I believe that >>> > > you're correct about the race here. I guess that the correct behavior >>> > > must be to keep the original data until a new copy of the data has been >>> > > received, and only then atomically replace the original by the new. >>> > > >>> > > Is this something you have time and ability to fix? >>> > >>> > Thanks Ben for quick response. I guess I will not have time until I send >>> > out next series for incremental processing :) >>> > It would be good if someone can help and then please reply this email if >>> > he/she starts working on it so that we will not end up with overlapping >>> > work. >> >> >> I will give a shot at fixing this issue. >> >> In the case of tripleo we haven't hit this issue. I haven't tested this scenario. >> I will test it out. One difference when compared to your setup is tripleo uses >> IPAddr2 resource and a collocation constraint set. >> >> Thanks >> Numan >> Thanks Numan for helping on this. I think IPAddr2 should have same problem, if my previous analysis was right, unless using IPAddr2 would result in pacemaker always electing the node that is configured with the master IP as the master when starting pacemaker on all nodes again. Ali, thanks for the information. Just to clarify that the log "removing xxx database due to server termination" is not related to this issue. It might be misleading but it doesn't mean deleting content of database. It is just doing clean-up of internal data structure before exiting. The cod
Re: [ovs-discuss] The kernel module does not support meters
Thank you very much Ben Pfaff. it helped me a lot. Thanks Vikash Justin Pettit Aug 7 (2 days ago) Right, the OVS out-of-tree kernel module (the one that ships with OVS) doesn'.. Did you read the FAQ? Q: I get an error like this when I configure Open vSwitch: configure: error: Linux kernel in is version , but version newer than is not supported (please refer to the FAQ for advice) What should I do? A: You have the following options: - Use the Linux kernel module supplied with the kernel that you are using. (See also the following FAQ.) - If there is a newer released version of Open vSwitch, consider building that one, because it may support the kernel that you are building against. (To find out, consult the table in the previous FAQ.) - The Open vSwitch "master" branch may support the kernel that you are using, so consider building the kernel module from "master". All versions of Open vSwitch userspace are compatible with all versions of the Open vSwitch kernel module, so you do not have to use the kernel module from one source along with the userspace programs from the same source. On Tue, Aug 7, 2018 at 10:12 PM, Ben Pfaff wrote: > Did you read the FAQ? > > Q: I get an error like this when I configure Open vSwitch: > > configure: error: Linux kernel in is version , but > version newer than is not supported (please refer to the > FAQ for advice) > > What should I do? > > A: You have the following options: > > - Use the Linux kernel module supplied with the kernel that you are > using. > (See also the following FAQ.) > > - If there is a newer released version of Open vSwitch, consider > building > that one, because it may support the kernel that you are building > against. (To find out, consult the table in the previous FAQ.) > > - The Open vSwitch "master" branch may support the kernel that you are > using, so consider building the kernel module from "master". > > All versions of Open vSwitch userspace are compatible with all > versions of > the Open vSwitch kernel module, so you do not have to use the kernel > module > from one source along with the userspace programs from the same source. > > On Tue, Aug 07, 2018 at 03:06:44PM +0530, Vikas Kumar wrote: > > Thanks for your reply Justin, > > I had Linux kernel 4.15 version earlier, but when i was trying to > configure > > the ovs master version, in that case i was getting some other error. > > please see below my previous conversation, > > > > Vikas Kumar > > Aug 2 (5 days ago) > > to bugs > > hi Team, > > i am using ubuntu 16.0. i am trying to configure the ovs source code, > but i > > am getting the below error message. please help me on this > > > > configure: error: Linux kernel in /lib/modules/4.15.0-29-generic/build > is > > version 4.15.18, but version newer than 4.14.x is not supported (please > > refer to the FAQ for advice) > > > > Regards > > vikash > > Darrell Ball > > Aug 2 (5 days ago) > > to me, bugs > > > > Means your kernel has been upgraded to 4.15 in your Xenial environment > > (check uname –r) > > > > > > > > Latest OVS release supports up to 4.14 > > > > > > > > http://docs.openvswitch.org/en/latest/faq/releases/ > > > > Second question. > > > > > > > > > > > > *From: * on behalf of Vikas Kumar < > > vikassanm...@gmail.com> > > *Date: *Thursday, August 2, 2018 at 2:08 AM > > *To: *"b...@openvswitch.org" > > *Subject: *[ovs-discuss] Bug in configuring of Ovs > > > > > > Actually i want to dump the ovs flows for my investigation. > > > > Thanks > > Vikash > > > > On Tue, Aug 7, 2018 at 2:17 PM, Justin Pettit wrote: > > > > > > > > > On Aug 6, 2018, at 8:49 PM, Vikas Kumar > wrote: > > > > > > > > hi Team, > > > > kindly help me on this, when i am typing sudo ovs-dpctl dump-flows > > > command, i am getting the below error > > > > |1|dpif_netlink|INFO|The kernel module does not support meters. > > > > > > > > I am using the below ubuntu version: > > > > > > > > 4.14.13-041413-generic > > > > > > Meters were introduced to the Linux kernel in 4.15, so earlier versions > > > don't support them. Unfortunately, all the released upstream kernels > have > > > a bug in them that prevents meters from being used properly. A patch > was > > > recently accepted upstream, which means that new releases (including > > > maintained older kernels) should receive the fix. > > > > > > The OVS 2.10 out-of-tree kernel module will contain meters on all > > > supported kernels. > > > > > > All of that said, unless you need meters, you can just ignore that > > > message; it's just informational. > > > > > > --Justin > > > > > > > > > > > > ___ > > discuss mailing list > > disc...@openvswitch.org > > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss > > ___ discuss mailing list d
[ovs-discuss] bug in vport.c file
hi Team, please help me on this issue if i am wrong please correct me and provide some information, I am trying to test ovs switch with mininet by tracing the call stack in datapath module by printing some messages, and i found that ovs_vport_send() function which internally calls vport->ops->send(skb); function which is referencing rpl_dev_queue_xmit() function, which is defined in gso.c file under datapath/linux directory is not getting invoked. Which function in datapath module is responsible to transmit the packets to the host in the network? Thanks vikash ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
Re: [ovs-discuss] Possible data loss of OVSDB active-backup mode
To add on , we are using LB VIP IP and no constraint with 3 nodes as Han mentioned earlier where active node have syncs from invalid IP and rest two nodes sync from LB VIP IP. Also, I was able to get some logs from one node that triggered: https://github.com/openvswitch/ovs/blob/master/ovsdb/ovsdb-server.c#L460 2018-08-04T01:43:39.914Z|03230|reconnect|DBG|tcp:10.189.208.16:50686: entering RECONNECT 2018-08-04T01:43:39.914Z|03231|ovsdb_jsonrpc_server|INFO|tcp: 10.189.208.16:50686: disconnecting (removing OVN_Northbound database due to server termination) 2018-08-04T01:43:39.932Z|03232|ovsdb_jsonrpc_server|INFO|tcp: 10.189.208.21:56160: disconnecting (removing _Server database due to server termination) 20 I am not sure if sync_from on active node too via some invalid ip is causing some flaw when all are down during the race condition in this corner case. On Thu, Aug 9, 2018 at 1:35 AM Numan Siddique wrote: > > > On Thu, Aug 9, 2018 at 1:07 AM Ben Pfaff wrote: > >> On Wed, Aug 08, 2018 at 12:18:10PM -0700, Han Zhou wrote: >> > On Wed, Aug 8, 2018 at 11:24 AM, Ben Pfaff wrote: >> > > >> > > On Wed, Aug 08, 2018 at 12:37:04AM -0700, Han Zhou wrote: >> > > > Hi, >> > > > >> > > > We found an issue in our testing (thanks aginwala) with >> active-backup >> > mode >> > > > in OVN setup. >> > > > In the 3 node setup with pacemaker, after stopping pacemaker on all >> > three >> > > > nodes (simulate a complete shutdown), and then if starting all of >> them >> > > > simultaneously, there is a good chance that the whole DB content >> gets >> > lost. >> > > > >> > > > After studying the replication code, it seems there is a phase that >> the >> > > > backup node deletes all its data and wait for data to be synced >> from the >> > > > active node: >> > > > >> https://github.com/openvswitch/ovs/blob/master/ovsdb/replication.c#L306 >> > > > >> > > > At this state, if the node was set to active, then all data is gone >> for >> > the >> > > > whole cluster. This can happen in different situations. In the test >> > > > scenario mentioned above it is very likely to happen, since >> pacemaker >> > just >> > > > randomly select one as master, not knowing the internal sync state >> of >> > each >> > > > node. It could also happen when failover happens right after a new >> > backup >> > > > is started, although less likely in real environment, so starting up >> > node >> > > > one by one may largely reduce the probability. >> > > > >> > > > Does this analysis make sense? We will do more tests to verify the >> > > > conclusion, but would like to share with community for discussions >> and >> > > > suggestions. Once this happens it is very critical - even more >> serious >> > than >> > > > just no HA. Without HA it is just control plane outage, but this >> would >> > be >> > > > data plane outage because OVS flows will be removed accordingly >> since >> > the >> > > > data is considered as deleted from ovn-controller point of view. >> > > > >> > > > We understand that active-standby is not the ideal HA mechanism and >> > > > clustering is the future, and we are also testing the clustering >> with >> > the >> > > > latest patch. But it would be good if this problem can be addressed >> with >> > > > some quick fix, such as keep a copy of the old data somewhere until >> the >> > > > first sync finishes? >> > > >> > > This does seem like a plausible bug, and at first glance I believe >> that >> > > you're correct about the race here. I guess that the correct behavior >> > > must be to keep the original data until a new copy of the data has >> been >> > > received, and only then atomically replace the original by the new. >> > > >> > > Is this something you have time and ability to fix? >> > >> > Thanks Ben for quick response. I guess I will not have time until I send >> > out next series for incremental processing :) >> > It would be good if someone can help and then please reply this email if >> > he/she starts working on it so that we will not end up with overlapping >> > work. >> > > I will give a shot at fixing this issue. > > In the case of tripleo we haven't hit this issue. I haven't tested this > scenario. > I will test it out. One difference when compared to your setup is tripleo > uses > IPAddr2 resource and a collocation constraint set. > > Thanks > Numan > > > >> > >> > One more thing that confuses me in the code is: >> > >> https://github.com/openvswitch/ovs/blob/master/ovsdb/ovsdb-server.c#L213-L216 >> > Does this code just change the ovsdb-server instance from backup to >> active >> > when the connection to active is lost? This isn't the behavior >> described by >> > manpage of active-standby mode. Isn't the mode supposed to be changed >> only >> > by management softeware/human? >> >> replication_is_alive() is based on jsonrpc_session_is_alive(), which >> returns true if the session is not permanently dead, that is, if it's >> currently connected or trying to reconnect. It really only returns >> false if a
Re: [ovs-discuss] Possible data loss of OVSDB active-backup mode
On Thu, Aug 9, 2018 at 1:07 AM Ben Pfaff wrote: > On Wed, Aug 08, 2018 at 12:18:10PM -0700, Han Zhou wrote: > > On Wed, Aug 8, 2018 at 11:24 AM, Ben Pfaff wrote: > > > > > > On Wed, Aug 08, 2018 at 12:37:04AM -0700, Han Zhou wrote: > > > > Hi, > > > > > > > > We found an issue in our testing (thanks aginwala) with active-backup > > mode > > > > in OVN setup. > > > > In the 3 node setup with pacemaker, after stopping pacemaker on all > > three > > > > nodes (simulate a complete shutdown), and then if starting all of > them > > > > simultaneously, there is a good chance that the whole DB content gets > > lost. > > > > > > > > After studying the replication code, it seems there is a phase that > the > > > > backup node deletes all its data and wait for data to be synced from > the > > > > active node: > > > > > https://github.com/openvswitch/ovs/blob/master/ovsdb/replication.c#L306 > > > > > > > > At this state, if the node was set to active, then all data is gone > for > > the > > > > whole cluster. This can happen in different situations. In the test > > > > scenario mentioned above it is very likely to happen, since pacemaker > > just > > > > randomly select one as master, not knowing the internal sync state of > > each > > > > node. It could also happen when failover happens right after a new > > backup > > > > is started, although less likely in real environment, so starting up > > node > > > > one by one may largely reduce the probability. > > > > > > > > Does this analysis make sense? We will do more tests to verify the > > > > conclusion, but would like to share with community for discussions > and > > > > suggestions. Once this happens it is very critical - even more > serious > > than > > > > just no HA. Without HA it is just control plane outage, but this > would > > be > > > > data plane outage because OVS flows will be removed accordingly since > > the > > > > data is considered as deleted from ovn-controller point of view. > > > > > > > > We understand that active-standby is not the ideal HA mechanism and > > > > clustering is the future, and we are also testing the clustering with > > the > > > > latest patch. But it would be good if this problem can be addressed > with > > > > some quick fix, such as keep a copy of the old data somewhere until > the > > > > first sync finishes? > > > > > > This does seem like a plausible bug, and at first glance I believe that > > > you're correct about the race here. I guess that the correct behavior > > > must be to keep the original data until a new copy of the data has been > > > received, and only then atomically replace the original by the new. > > > > > > Is this something you have time and ability to fix? > > > > Thanks Ben for quick response. I guess I will not have time until I send > > out next series for incremental processing :) > > It would be good if someone can help and then please reply this email if > > he/she starts working on it so that we will not end up with overlapping > > work. > I will give a shot at fixing this issue. In the case of tripleo we haven't hit this issue. I haven't tested this scenario. I will test it out. One difference when compared to your setup is tripleo uses IPAddr2 resource and a collocation constraint set. Thanks Numan > > > > One more thing that confuses me in the code is: > > > https://github.com/openvswitch/ovs/blob/master/ovsdb/ovsdb-server.c#L213-L216 > > Does this code just change the ovsdb-server instance from backup to > active > > when the connection to active is lost? This isn't the behavior described > by > > manpage of active-standby mode. Isn't the mode supposed to be changed > only > > by management softeware/human? > > replication_is_alive() is based on jsonrpc_session_is_alive(), which > returns true if the session is not permanently dead, that is, if it's > currently connected or trying to reconnect. It really only returns > false if a maximum number of retries was configured (by default, there > is none) and that number has been exceeded. > > I'm actually not sure whether there's a way that this particular > jsonrpc_session_is_alive() could ever return false. Maybe only if the > session was started with a remote that fails parsing? Maybe it should > not be checked at all. > ___ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss