:) The only thing is while using pacemaker, if the node that pacemaker if pointing to is down, all the active/standby northd nodes have to be updated to new node from the cluster. But will dig in more to see what else I can find.
@Ben: Any suggestions further? Regards, On Wed, Mar 21, 2018 at 10:22 AM, Han Zhou <zhou...@gmail.com> wrote: > > > On Wed, Mar 21, 2018 at 9:49 AM, aginwala <aginw...@asu.edu> wrote: > >> Thanks Numan: >> >> Yup agree with the locking part. For now; yes I am running northd on one >> node. I might right a script to monitor northd in cluster so that if the >> node where it's running goes down, script can spin up northd on one other >> active nodes as a dirty hack. >> >> The "dirty hack" is pacemaker :) > > >> Sure, will await for the inputs from Ben too on this and see how complex >> would it be to roll out this feature. >> >> >> Regards, >> >> >> On Wed, Mar 21, 2018 at 5:43 AM, Numan Siddique <nusid...@redhat.com> >> wrote: >> >>> Hi Aliasgar, >>> >>> ovsdb-server maintains locks per each connection and not across the db. >>> A workaround for you now would be to configure all the ovn-northd instances >>> to connect to one ovsdb-server if you want to have active/standy. >>> >>> Probably Ben can answer if there is a plan to support ovsdb locks across >>> the db. We also need this support in networking-ovn as it also uses ovsdb >>> locks. >>> >>> Thanks >>> Numan >>> >>> >>> On Wed, Mar 21, 2018 at 1:40 PM, aginwala <aginw...@asu.edu> wrote: >>> >>>> Hi Numan: >>>> >>>> Just figured out that ovn-northd is running as active on all 3 nodes >>>> instead of one active instance as I continued to test further which results >>>> in db errors as per logs. >>>> >>>> >>>> # on node 3, I run ovn-nbctl ls-add ls2 ; it populates below logs in >>>> ovn-north >>>> 2018-03-21T06:01:59.442Z|00007|ovsdb_idl|WARN|transaction error: >>>> {"details":"Transaction causes multiple rows in \"Datapath_Binding\" table >>>> to have identical values (1) for index on column \"tunnel_key\". First >>>> row, with UUID 8c5d9342-2b90-4229-8ea1-001a733a915c, was inserted by >>>> this transaction. Second row, with UUID >>>> 8e06f919-4cc7-4ffc-9a79-20ce6663b683, >>>> existed in the database before this transaction and was not modified by the >>>> transaction.","error":"constraint violation"} >>>> >>>> In southbound datapath list, 2 duplicate records gets created for same >>>> switch. >>>> >>>> # ovn-sbctl list Datapath >>>> _uuid : b270ae30-3458-445f-95d2-b14e8ebddd01 >>>> external_ids : >>>> {logical-switch="4d6674e3-ff9f-4f38-b050-0fa9bec9e34d", >>>> name="ls2"} >>>> tunnel_key : 2 >>>> >>>> _uuid : 8e06f919-4cc7-4ffc-9a79-20ce6663b683 >>>> external_ids : >>>> {logical-switch="4d6674e3-ff9f-4f38-b050-0fa9bec9e34d", >>>> name="ls2"} >>>> tunnel_key : 1 >>>> >>>> >>>> >>>> # on nodes 1 and 2 where northd is running, it gives below error: >>>> 2018-03-21T06:01:59.437Z|00008|ovsdb_idl|WARN|transaction error: >>>> {"details":"cannot delete Datapath_Binding row >>>> 8e06f919-4cc7-4ffc-9a79-20ce6663b683 because of 17 remaining >>>> reference(s)","error":"referential integrity violation"} >>>> >>>> As per commit message, for northd I re-tried setting --ovnnb-db="tcp: >>>> 10.169.125.152:6641,tcp:10.169.125.131:6641,tcp:10.148.181.162:6641" >>>> and --ovnsb-db="tcp:10.169.125.152:6642,tcp:10.169.125.131:6642,tcp: >>>> 10.148.181.162:6642" and it did not help either. >>>> >>>> There is no issue if I keep running only one instance of northd on any >>>> of these 3 nodes. Hence, wanted to know is there something else >>>> missing here to make only one northd instance as active and rest as >>>> standby? >>>> >>>> >>>> Regards, >>>> >>>> On Thu, Mar 15, 2018 at 3:09 AM, Numan Siddique <nusid...@redhat.com> >>>> wrote: >>>> >>>>> That's great >>>>> >>>>> Numan >>>>> >>>>> >>>>> On Thu, Mar 15, 2018 at 2:57 AM, aginwala <aginw...@asu.edu> wrote: >>>>> >>>>>> Hi Numan: >>>>>> >>>>>> I tried on new nodes (kernel : 4.4.0-104-generic , Ubuntu 16.04)with >>>>>> fresh installation and it worked super fine for both sb and nb dbs. Seems >>>>>> like some kernel issue on the previous nodes when I re-installed raft >>>>>> patch >>>>>> as I was running different ovs version on those nodes before. >>>>>> >>>>>> >>>>>> For 2 HVs, I now set ovn-remote="tcp:10.169.125.152:6642, tcp: >>>>>> 10.169.125.131:6642, tcp:10.148.181.162:6642" and started >>>>>> controller and it works super fine. >>>>>> >>>>>> >>>>>> Did some failover testing by rebooting/killing the leader ( >>>>>> 10.169.125.152) and bringing it back up and it works as expected. >>>>>> Nothing weird noted so far. >>>>>> >>>>>> # check-cluster gives below data one of the node(10.148.181.162) post >>>>>> leader failure >>>>>> >>>>>> ovsdb-tool check-cluster /etc/openvswitch/ovnsb_db.db >>>>>> ovsdb-tool: leader /etc/openvswitch/ovnsb_db.db for term 2 has log >>>>>> entries only up to index 18446744073709551615, but index 9 was committed >>>>>> in >>>>>> a previous term (e.g. by /etc/openvswitch/ovnsb_db.db) >>>>>> >>>>>> >>>>>> For check-cluster, are we planning to add more output showing which >>>>>> node is active(leader), etc in upcoming versions ? >>>>>> >>>>>> >>>>>> Thanks a ton for helping sort this out. I think the patch looks good >>>>>> to be merged post addressing of the comments by Justin along with the man >>>>>> page details for ovsdb-tool. >>>>>> >>>>>> >>>>>> I will do some more crash testing for the cluster along with the >>>>>> scale test and keep you posted if something unexpected is noted. >>>>>> >>>>>> >>>>>> >>>>>> Regards, >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Mar 13, 2018 at 11:07 PM, Numan Siddique <nusid...@redhat.com >>>>>> > wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Mar 14, 2018 at 7:51 AM, aginwala <aginw...@asu.edu> wrote: >>>>>>> >>>>>>>> Sure. >>>>>>>> >>>>>>>> To add on , I also ran for nb db too using different port and >>>>>>>> Node2 crashes with same error : >>>>>>>> # Node 2 >>>>>>>> /usr/share/openvswitch/scripts/ovn-ctl --db-nb-addr=10.99.152.138 >>>>>>>> --db-nb-port=6641 --db-nb-cluster-remote-addr="tcp: >>>>>>>> 10.99.152.148:6645" --db-nb-cluster-local-addr="tcp: >>>>>>>> 10.99.152.138:6645" start_nb_ovsdb >>>>>>>> ovsdb-server: ovsdb error: /etc/openvswitch/ovnnb_db.db: cannot >>>>>>>> identify file type >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> Hi Aliasgar, >>>>>>> >>>>>>> It worked for me. Can you delete the old db files in >>>>>>> /etc/openvswitch/ and try running the commands again ? >>>>>>> >>>>>>> Below are the commands I ran in my setup. >>>>>>> >>>>>>> Node 1 >>>>>>> ------- >>>>>>> sudo /usr/share/openvswitch/scripts/ovn-ctl >>>>>>> --db-sb-addr=192.168.121.91 --db-sb-port=6642 >>>>>>> --db-sb-create-insecure-remote=yes >>>>>>> --db-sb-cluster-local-addr=tcp:192.168.121.91:6644 start_sb_ovsdb >>>>>>> >>>>>>> Node 2 >>>>>>> --------- >>>>>>> sudo /usr/share/openvswitch/scripts/ovn-ctl >>>>>>> --db-sb-addr=192.168.121.87 --db-sb-port=6642 >>>>>>> --db-sb-create-insecure-remote=yes >>>>>>> --db-sb-cluster-local-addr="tcp:192.168.121.87:6644" >>>>>>> --db-sb-cluster-remote-addr="tcp:192.168.121.91:6644" >>>>>>> start_sb_ovsdb >>>>>>> >>>>>>> Node 3 >>>>>>> --------- >>>>>>> sudo /usr/share/openvswitch/scripts/ovn-ctl >>>>>>> --db-sb-addr=192.168.121.78 --db-sb-port=6642 >>>>>>> --db-sb-create-insecure-remote=yes >>>>>>> --db-sb-cluster-local-addr="tcp:192.168.121.78:6644" >>>>>>> --db-sb-cluster-remote-addr="tcp:192.168.121.91:6644" >>>>>>> start_sb_ovsdb >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thanks >>>>>>> Numan >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> On Tue, Mar 13, 2018 at 9:40 AM, Numan Siddique < >>>>>>>> nusid...@redhat.com> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Mar 13, 2018 at 9:46 PM, aginwala <aginw...@asu.edu> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks Numan for the response. >>>>>>>>>> >>>>>>>>>> There is no command start_cluster_sb_ovsdb in the source code >>>>>>>>>> too. Is that in a separate commit somewhere? Hence, I used >>>>>>>>>> start_sb_ovsdb >>>>>>>>>> which I think would not be a right choice? >>>>>>>>>> >>>>>>>>> >>>>>>>>> Sorry, I meant start_sb_ovsdb. Strange that it didn't work for >>>>>>>>> you. Let me try it out again and update this thread. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Numan >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> # Node1 came up as expected. >>>>>>>>>> ovn-ctl --db-sb-addr=10.99.152.148 --db-sb-port=6642 >>>>>>>>>> --db-sb-create-insecure-remote=yes --db-sb-cluster-local-addr="tc >>>>>>>>>> p:10.99.152.148:6644" start_sb_ovsdb. >>>>>>>>>> >>>>>>>>>> # verifying its a clustered db with ovsdb-tool db-local-address >>>>>>>>>> /etc/openvswitch/ovnsb_db.db >>>>>>>>>> tcp:10.99.152.148:6644 >>>>>>>>>> # ovn-sbctl show works fine and chassis are being populated >>>>>>>>>> correctly. >>>>>>>>>> >>>>>>>>>> #Node 2 fails with error: >>>>>>>>>> /usr/share/openvswitch/scripts/ovn-ctl >>>>>>>>>> --db-sb-addr=10.99.152.138 --db-sb-port=6642 >>>>>>>>>> --db-sb-create-insecure-remote=yes >>>>>>>>>> --db-sb-cluster-remote-addr="tcp:10.99.152.148:6644" >>>>>>>>>> --db-sb-cluster-local-addr="tcp:10.99.152.138:6644" >>>>>>>>>> start_sb_ovsdb >>>>>>>>>> ovsdb-server: ovsdb error: /etc/openvswitch/ovnsb_db.db: cannot >>>>>>>>>> identify file type >>>>>>>>>> >>>>>>>>>> # So i did start the sb db the usual way using start_ovsdb to >>>>>>>>>> just get the db file created and killed the sb pid and re-ran the >>>>>>>>>> command >>>>>>>>>> which gave actual error where it complains for join-cluster command >>>>>>>>>> that is >>>>>>>>>> being called internally >>>>>>>>>> /usr/share/openvswitch/scripts/ovn-ctl >>>>>>>>>> --db-sb-addr=10.99.152.138 --db-sb-port=6642 >>>>>>>>>> --db-sb-create-insecure-remote=yes >>>>>>>>>> --db-sb-cluster-remote-addr="tcp:10.99.152.148:6644" >>>>>>>>>> --db-sb-cluster-local-addr="tcp:10.99.152.138:6644" >>>>>>>>>> start_sb_ovsdb >>>>>>>>>> ovsdb-tool: /etc/openvswitch/ovnsb_db.db: not a clustered database >>>>>>>>>> * Backing up database to /etc/openvswitch/ovnsb_db.db.b >>>>>>>>>> ackup1.15.0-70426956 >>>>>>>>>> ovsdb-tool: 'join-cluster' command requires at least 4 arguments >>>>>>>>>> * Creating cluster database /etc/openvswitch/ovnsb_db.db from >>>>>>>>>> existing one >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> # based on above error I killed the sb db pid again and try to >>>>>>>>>> create a local cluster on node then re-ran the join operation as >>>>>>>>>> per the >>>>>>>>>> source code function. >>>>>>>>>> ovsdb-tool join-cluster /etc/openvswitch/ovnsb_db.db >>>>>>>>>> OVN_Southbound tcp:10.99.152.138:6644 tcp:10.99.152.148:6644 >>>>>>>>>> which still complains >>>>>>>>>> ovsdb-tool: I/O error: /etc/openvswitch/ovnsb_db.db: create >>>>>>>>>> failed (File exists) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> # Node 3: I did not try as I am assuming the same failure as node >>>>>>>>>> 2 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Let me know may know further. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Mar 13, 2018 at 3:08 AM, Numan Siddique < >>>>>>>>>> nusid...@redhat.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Aliasgar, >>>>>>>>>>> >>>>>>>>>>> On Tue, Mar 13, 2018 at 7:11 AM, aginwala <aginw...@asu.edu> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Ben/Noman: >>>>>>>>>>>> >>>>>>>>>>>> I am trying to setup 3 node southbound db cluster using raft10 >>>>>>>>>>>> <https://patchwork.ozlabs.org/patch/854298/> in review. >>>>>>>>>>>> >>>>>>>>>>>> # Node 1 create-cluster >>>>>>>>>>>> ovsdb-tool create-cluster /etc/openvswitch/ovnsb_db.db >>>>>>>>>>>> /root/ovs-reviews/ovn/ovn-sb.ovsschema tcp:10.99.152.148:6642 >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> A different port is used for RAFT. So you have to choose another >>>>>>>>>>> port like 6644 for example. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> # Node 2 >>>>>>>>>>>> ovsdb-tool join-cluster /etc/openvswitch/ovnsb_db.db >>>>>>>>>>>> OVN_Southbound tcp:10.99.152.138:6642 tcp:10.99.152.148:6642 --cid >>>>>>>>>>>> 5dfcb678-bb1d-4377-b02d-a380edec2982 >>>>>>>>>>>> >>>>>>>>>>>> #Node 3 >>>>>>>>>>>> ovsdb-tool join-cluster /etc/openvswitch/ovnsb_db.db >>>>>>>>>>>> OVN_Southbound tcp:10.99.152.101:6642 tcp:10.99.152.138:6642 >>>>>>>>>>>> tcp:10.99.152.148:6642 --cid 5dfcb678-bb1d-4377-b02d-a380ed >>>>>>>>>>>> ec2982 >>>>>>>>>>>> >>>>>>>>>>>> # ovn remote is set to all 3 nodes >>>>>>>>>>>> external_ids:ovn-remote="tcp:10.99.152.148:6642, tcp: >>>>>>>>>>>> 10.99.152.138:6642, tcp:10.99.152.101:6642" >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> # Starting sb db on node 1 using below command on node 1: >>>>>>>>>>>> >>>>>>>>>>>> ovsdb-server --detach --monitor -vconsole:off -vraft -vjsonrpc >>>>>>>>>>>> --log-file=/var/log/openvswitch/ovsdb-server-sb.log >>>>>>>>>>>> --pidfile=/var/run/openvswitch/ovnsb_db.pid >>>>>>>>>>>> --remote=db:OVN_Southbound,SB_Global,connections >>>>>>>>>>>> --unixctl=ovnsb_db.ctl >>>>>>>>>>>> --private-key=db:OVN_Southbound,SSL,private_key >>>>>>>>>>>> --certificate=db:OVN_Southbound,SSL,certificate >>>>>>>>>>>> --ca-cert=db:OVN_Southbound,SSL,ca_cert >>>>>>>>>>>> --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols >>>>>>>>>>>> --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers >>>>>>>>>>>> --remote=punix:/var/run/openvswitch/ovnsb_db.sock >>>>>>>>>>>> /etc/openvswitch/ovnsb_db.db >>>>>>>>>>>> >>>>>>>>>>>> # check-cluster is returning nothing >>>>>>>>>>>> ovsdb-tool check-cluster /etc/openvswitch/ovnsb_db.db >>>>>>>>>>>> >>>>>>>>>>>> # ovsdb-server-sb.log below shows the leader is elected with >>>>>>>>>>>> only one server and there are rbac related debug logs with rpc >>>>>>>>>>>> replies and >>>>>>>>>>>> empty params with no errors >>>>>>>>>>>> >>>>>>>>>>>> 2018-03-13T01:12:02Z|00002|raft|DBG|server 63d1 added to >>>>>>>>>>>> configuration >>>>>>>>>>>> 2018-03-13T01:12:02Z|00003|raft|INFO|term 6: starting election >>>>>>>>>>>> 2018-03-13T01:12:02Z|00004|raft|INFO|term 6: elected leader by >>>>>>>>>>>> 1+ of 1 servers >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Now Starting the ovsdb-server on the other clusters fails >>>>>>>>>>>> saying >>>>>>>>>>>> ovsdb-server: ovsdb error: /etc/openvswitch/ovnsb_db.db: cannot >>>>>>>>>>>> identify file type >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Also noticed that man ovsdb-tool is missing cluster details. >>>>>>>>>>>> Might want to address it in the same patch or different. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Please advise to what is missing here for running ovn-sbctl >>>>>>>>>>>> show as this command hangs. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I think you can use the ovn-ctl command "start_cluster_sb_ovsdb" >>>>>>>>>>> for your testing (atleast for now) >>>>>>>>>>> >>>>>>>>>>> For your setup, I think you can start the cluster as >>>>>>>>>>> >>>>>>>>>>> # Node 1 >>>>>>>>>>> ovn-ctl --db-sb-addr=10.99.152.148 --db-sb-port=6642 >>>>>>>>>>> --db-sb-create-insecure-remote=yes >>>>>>>>>>> --db-sb-cluster-local-addr="tcp:10.99.152.148:6644" >>>>>>>>>>> start_cluster_sb_ovsdb >>>>>>>>>>> >>>>>>>>>>> # Node 2 >>>>>>>>>>> ovn-ctl --db-sb-addr=10.99.152.138 --db-sb-port=6642 >>>>>>>>>>> --db-sb-create-insecure-remote=yes >>>>>>>>>>> --db-sb-cluster-local-addr="tcp:10.99.152.138:6644" >>>>>>>>>>> --db-sb-cluster-remote-addr="tcp:10.99.152.148:6644" >>>>>>>>>>> start_cluster_sb_ovsdb >>>>>>>>>>> >>>>>>>>>>> # Node 3 >>>>>>>>>>> ovn-ctl --db-sb-addr=10.99.152.101 --db-sb-port=6642 >>>>>>>>>>> --db-sb-create-insecure-remote=yes >>>>>>>>>>> --db-sb-cluster-local-addr="tcp:10.99.152.101:6644" >>>>>>>>>>> --db-sb-cluster-remote-addr="tcp:10.99.152.148:6644" start_c >>>>>>>>>>> luster_sb_ovsdb >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Let me know how it goes. >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Numan >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> discuss mailing list >>>>>>>>>>>> disc...@openvswitch.org >>>>>>>>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >> _______________________________________________ >> discuss mailing list >> disc...@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss >> >> >
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss