On Fri, May 18, 2018 at 11:53 PM, aginwala <aginw...@asu.edu> wrote: > > > On Thu, May 17, 2018 at 11:23 PM, Numan Siddique <nusid...@redhat.com> > wrote: > >> >> >> On Fri, May 18, 2018 at 4:24 AM, aginwala <aginw...@asu.edu> wrote: >> >>> Hi: >>> >>> I tried and it didnt help where Ip resource is always showing stopped >>> where my private VIP IP is 192.168.220.108 >>> # kernel panic on active node >>> root@test7:~# echo c > /proc/sysrq-trigger >>> >>> >>> root@test6:~# crm stat >>> Last updated: Thu May 17 22:46:38 2018 Last change: Thu May 17 22:45:03 >>> 2018 by root via cibadmin on test6 >>> Stack: corosync >>> Current DC: test7 (version 1.1.14-70404b0) - partition with quorum >>> 2 nodes and 3 resources configured >>> >>> Online: [ test6 test7 ] >>> >>> Full list of resources: >>> >>> VirtualIP (ocf::heartbeat:IPaddr2): Started test7 >>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>> Masters: [ test7 ] >>> Slaves: [ test6 ] >>> >>> root@test6:~# crm stat >>> Last updated: Thu May 17 22:46:38 2018 Last change: Thu May 17 22:45:03 >>> 2018 by root via cibadmin on test6 >>> Stack: corosync >>> Current DC: test6 (version 1.1.14-70404b0) - partition WITHOUT quorum >>> 2 nodes and 3 resources configured >>> >>> Online: [ test6 ] >>> OFFLINE: [ test7 ] >>> >>> Full list of resources: >>> >>> VirtualIP (ocf::heartbeat:IPaddr2): Stopped >>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>> Slaves: [ test6 ] >>> Stopped: [ test7 ] >>> >>> root@test6:~# crm stat >>> Last updated: Thu May 17 22:49:26 2018 Last change: Thu May 17 22:45:03 >>> 2018 by root via cibadmin on test6 >>> Stack: corosync >>> Current DC: test6 (version 1.1.14-70404b0) - partition WITHOUT quorum >>> 2 nodes and 3 resources configured >>> >>> Online: [ test6 ] >>> OFFLINE: [ test7 ] >>> >>> Full list of resources: >>> >>> VirtualIP (ocf::heartbeat:IPaddr2): Stopped >>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>> Stopped: [ test6 test7 ] >>> >>> I think this change not needed or something else is wrong when using >>> virtual IP resource. >>> >> >> Hi Aliasgar, I think you haven't created the resource properly. Or >> haven't set the colocation constraints properly. What pcs/crm commands you >> used to create OVN db resources ? >> Can you share the output of "pcs resource show ovndb_servers" and "pcs >> constraint" >> In case of tripleo we create resource like this - >> https://github.com/openstack/puppet-tripleo/blob/master/ma >> nifests/profile/pacemaker/ovn_northd.pp#L80 >> > > >>>>> # I am using the same commands suggested upstream in the ovs > document to create resource: > I am skipping manage northd option with default inactivity probe interval > http://docs.openvswitch.org/en/latest/topics/integration/#ha > -for-ovn-db-servers-using-pacemaker > # cat pcs_with_ipaddr2.sh > pcs resource create VirtualIP ocf:heartbeat:IPaddr2 \ > params ip="192.168.220.108" op monitor interval="30s" > pcs resource create ovndb_servers ocf:ovn:ovndb-servers \ > master_ip="192.168.220.108" \ > op monitor interval="10s" \ > op monitor role=Master interval="15s" --debug > pcs resource master ovndb_servers-master ovndb_servers \ > meta notify="true" > pcs constraint order promote ovndb_servers-master then VirtualIP >
I think ordering should be reversed. We want pacemaker to start IPAddr2 resource first and then start ovndb_servers resource. May be we need to update the document. Can you please try with the command "pcs constraint order VirtualIP then ovndb_servers-master". I think that's why in your setup, IPAddr2 resource is not started. Thanks Numan > pcs constraint colocation add VirtualIP with master ovndb_servers-master \ > score=INFINITY > > # pcs resource show ovndb_servers > Resource: ovndb_servers (class=ocf provider=ovn type=ovndb-servers) > Attributes: master_ip=192.168.220.108 > Operations: start interval=0s timeout=30s (ovndb_servers-start-interval- > 0s) > stop interval=0s timeout=20s (ovndb_servers-stop-interval-0 > s) > promote interval=0s timeout=50s > (ovndb_servers-promote-interval-0s) > demote interval=0s timeout=50s (ovndb_servers-demote-interval > -0s) > monitor interval=10s (ovndb_servers-monitor-interval-10s) > monitor interval=15s role=Master > (ovndb_servers-monitor-interval-15s) > # pcs constraint > Location Constraints: > Ordering Constraints: > promote ovndb_servers-master then start VirtualIP (kind:Mandatory) > Colocation Constraints: > VirtualIP with ovndb_servers-master (score:INFINITY) (rsc-role:Started) > (with-rsc-role:Master) > >> >> >>> >>> May we you need a similar promotion logic that we have for LB with >>> pacemaker in the discussion (will submit formal patch soon). I did test >>> with kernel panic with LB code change and it works fine where node2 gets >>> promoted. Below works fine for LB even if there is kernel panic without >>> this change: >>> >> >> This issue is not seen all the time. I have another setup where I don't >> see this issue at all. The issue is seen when the IPAddr2 resource is moved >> to another slave node and ovsdb-server's start reporting as master as soon >> as the IP address is configured. >> >> When the issue is seen we hit the code here - >> https://github.com/openvswitch/ovs/blob/master/ovn/utiliti >> es/ovndb-servers.ocf#L412. Ideally when promot action is called, ovsdb >> servers will be running as slaves/standby and the promote action promotes >> them to master. But when the issue is seen, the ovsdb servers report the >> status as active. Because of which we don't complete the full promote >> action and return at L412. And later when notify action is called, we >> demote the servers because of this - https://github.com/openvswit >> ch/ovs/blob/master/ovn/utilities/ovndb-servers.ocf#L176 >> >> >>> Yes I agree! As you said settings work fine in one cluster and if you > use other cluster with same settings, you may see surprises . > > >> For the use case like your's (where load balancer VIP is used), you may >> not see this issue at all since you will not be using the IPaddr2 resource >> as master ip. >> > >>> Correct, I just wanted to update both the settings to let you know > pacemaker behavior with IPaddr2 vs LB VIP IP. > >> >> >>> root@test-pace1-2365293:~# echo c > /proc/sysrq-trigger >>> root@test-pace2-2365308:~# crm stat >>> Last updated: Thu May 17 15:15:45 2018 Last change: Wed May 16 23:10:52 >>> 2018 by root via cibadmin on test-pace2-2365308 >>> Stack: corosync >>> Current DC: test-pace1-2365293 (version 1.1.14-70404b0) - partition with >>> quorum >>> 2 nodes and 2 resources configured >>> >>> Online: [ test-pace1-2365293 test-pace2-2365308 ] >>> >>> Full list of resources: >>> >>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>> Masters: [ test-pace1-2365293 ] >>> Slaves: [ test-pace2-2365308 ] >>> >>> root@test-pace2-2365308:~# crm stat >>> Last updated: Thu May 17 15:15:45 2018 Last change: Wed May 16 23:10:52 >>> 2018 by root via cibadmin on test-pace2-2365308 >>> Stack: corosync >>> Current DC: test-pace2-2365308 (version 1.1.14-70404b0) - partition >>> WITHOUT quorum >>> 2 nodes and 2 resources configured >>> >>> Online: [ test-pace2-2365308 ] >>> OFFLINE: [ test-pace1-2365293 ] >>> >>> Full list of resources: >>> >>> Master/Slave Set: ovndb_servers-master [ovndb_servers] >>> Slaves: [ test-pace2-2365308 ] >>> Stopped: [ test-pace1-2365293 ] >>> >>> root@test-pace2-2365308:~# ps aux | grep ovs >>> root 15175 0.0 0.0 18048 372 ? Ss 15:15 0:00 >>> ovsdb-server: monitoring pid 15176 (healthy) >>> root 15176 0.0 0.0 18312 4096 ? S 15:15 0:00 >>> ovsdb-server -vconsole:off -vfile:info >>> --log-file=/var/log/openvswitch/ovsdb-server-nb.log >>> --remote=punix:/var/run/openvswitch/ovnnb_db.sock >>> --pidfile=/var/run/openvswitch/ovnnb_db.pid --unixctl=ovnnb_db.ctl >>> --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections >>> --private-key=db:OVN_Northbound,SSL,private_key >>> --certificate=db:OVN_Northbound,SSL,certificate >>> --ca-cert=db:OVN_Northbound,SSL,ca_cert >>> --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols >>> --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers >>> --remote=ptcp:6641:0.0.0.0 --sync-from=tcp:192.0.2.254:6641 >>> /etc/openvswitch/ovnnb_db.db >>> root 15184 0.0 0.0 18048 376 ? Ss 15:15 0:00 >>> ovsdb-server: monitoring pid 15185 (healthy) >>> root 15185 0.0 0.0 18300 4480 ? S 15:15 0:00 >>> ovsdb-server -vconsole:off -vfile:info >>> --log-file=/var/log/openvswitch/ovsdb-server-sb.log >>> --remote=punix:/var/run/openvswitch/ovnsb_db.sock >>> --pidfile=/var/run/openvswitch/ovnsb_db.pid --unixctl=ovnsb_db.ctl >>> --detach --monitor --remote=db:OVN_Southbound,SB_Global,connections >>> --private-key=db:OVN_Southbound,SSL,private_key >>> --certificate=db:OVN_Southbound,SSL,certificate >>> --ca-cert=db:OVN_Southbound,SSL,ca_cert >>> --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols >>> --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers >>> --remote=ptcp:6642:0.0.0.0 --sync-from=tcp:192.0.2.254:6642 >>> /etc/openvswitch/ovnsb_db.db >>> root 15398 0.0 0.0 12940 972 pts/0 S+ 15:15 0:00 grep >>> --color=auto ovs >>> >>> >>>I just want to point out that I am also seeing below errors when >>> setting target with master IP using ipaddr2 resource too! >>> 2018-05-17T21:58:51.889Z|00011|ovsdb_jsonrpc_server|ERR|ptcp:6641: >>> 192.168.220.108: listen failed: Cannot assign requested address >>> 2018-05-17T21:58:51.889Z|00012|socket_util|ERR|6641:192.168.220.108: >>> bind: Cannot assign requested address >>> That needs to be handled too since existing code do throw this error! >>> Only if I skip setting target then it the error is gone.? >>> >> >> In the case of tripleo, we handle this error by setting the sysctl >> value net.ipv4.ip_nonlocal_bind to 1 - https://github.com/openstack >> /puppet-tripleo/blob/master/manifests/profile/pacemaker/ovn_northd.pp#L67 >> >>> Sweet, I can try to set this to get rid of socket error. >> >> >>> >>> >>> >>> Regards, >>> Aliasgar >>> >>> >>> On Thu, May 17, 2018 at 3:04 AM, <nusid...@redhat.com> wrote: >>> >>>> From: Numan Siddique <nusid...@redhat.com> >>>> >>>> When a node 'A' in the pacemaker cluster running OVN db servers in >>>> master is >>>> brought down ungracefully ('echo b > /proc/sysrq_trigger' for example), >>>> pacemaker >>>> is not able to promote any other node to master in the cluster. When >>>> pacemaker selects >>>> a node B for instance to promote, it moves the IPAddr2 resource (i.e >>>> the master ip) >>>> to node 'B'. As soon the node is configured with the IP address, when >>>> the issue is >>>> seen, the OVN db servers which were running as standy earlier, >>>> transitions to active. >>>> Ideally this should not have happened. The ovsdb-servers are expected >>>> to remain in >>>> standby until there are promoted. (This needs separate investigation). >>>> When the pacemaker >>>> calls the OVN OCF script's promote action, the ovsdb_server_promot >>>> function returns >>>> almost immediately without recording the present master. And later in >>>> the notify action >>>> it demotes back the OVN db servers since the last known master doesn't >>>> match with >>>> node 'B's hostname. This results in pacemaker promoting/demoting in a >>>> loop. >>>> >>>> This patch fixes the issue by not returning immediately when promote >>>> action is >>>> called if the OVN db servers are running as active. Now it would >>>> continue with >>>> the ovsdb_server_promot function and records the new master by setting >>>> proper >>>> master score ($CRM_MASTER -N $host_name -v ${master_score}) >>>> >>>> This issue is not seen when a node is brought down gracefully as >>>> pacemaker before >>>> promoting a node, calls stop, start and then promote actions. Not sure >>>> why pacemaker >>>> doesn't call stop, start and promote actions when a node is reset >>>> ungracefully. >>>> >>>> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1579025 >>>> Signed-off-by: Numan Siddique <nusid...@redhat.com> >>>> --- >>>> ovn/utilities/ovndb-servers.ocf | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/ovn/utilities/ovndb-servers.ocf >>>> b/ovn/utilities/ovndb-servers.ocf >>>> index 164b6bce6..23dc70056 100755 >>>> --- a/ovn/utilities/ovndb-servers.ocf >>>> +++ b/ovn/utilities/ovndb-servers.ocf >>>> @@ -409,7 +409,7 @@ ovsdb_server_promote() { >>>> rc=$? >>>> case $rc in >>>> ${OCF_SUCCESS}) ;; >>>> - ${OCF_RUNNING_MASTER}) return ${OCF_SUCCESS};; >>>> + ${OCF_RUNNING_MASTER}) ;; >>>> *) >>>> ovsdb_server_master_update $OCF_RUNNING_MASTER >>>> return ${rc} >>>> -- >>>> 2.17.0 >>>> >>>> _______________________________________________ >>>> dev mailing list >>>> d...@openvswitch.org >>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >>>> >>> >>> >> > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev