Re: [ovs-dev] [PATCH] ovn pacemaker: Fix the promotion issue in other cluster nodes when the master node is reset

aginwala Fri, 18 May 2018 11:23:53 -0700

On Thu, May 17, 2018 at 11:23 PM, Numan Siddique <nusid...@redhat.com>
wrote:


>
>
> On Fri, May 18, 2018 at 4:24 AM, aginwala <aginw...@asu.edu> wrote:
>
>> Hi:
>>
>> I tried and it didnt help where Ip resource is always showing stopped
>> where my private VIP IP is 192.168.220.108
>> # kernel panic on  active node
>> root@test7:~# echo c > /proc/sysrq-trigger
>>
>>
>> root@test6:~# crm stat
>> Last updated: Thu May 17 22:46:38 2018 Last change: Thu May 17 22:45:03
>> 2018 by root via cibadmin on test6
>> Stack: corosync
>> Current DC: test7 (version 1.1.14-70404b0) - partition with quorum
>> 2 nodes and 3 resources configured
>>
>> Online: [ test6 test7 ]
>>
>> Full list of resources:
>>
>>  VirtualIP (ocf::heartbeat:IPaddr2): Started test7
>>  Master/Slave Set: ovndb_servers-master [ovndb_servers]
>>      Masters: [ test7 ]
>>      Slaves: [ test6 ]
>>
>> root@test6:~# crm stat
>> Last updated: Thu May 17 22:46:38 2018 Last change: Thu May 17 22:45:03
>> 2018 by root via cibadmin on test6
>> Stack: corosync
>> Current DC: test6 (version 1.1.14-70404b0) - partition WITHOUT quorum
>> 2 nodes and 3 resources configured
>>
>> Online: [ test6 ]
>> OFFLINE: [ test7 ]
>>
>> Full list of resources:
>>
>>  VirtualIP (ocf::heartbeat:IPaddr2): Stopped
>>  Master/Slave Set: ovndb_servers-master [ovndb_servers]
>>      Slaves: [ test6 ]
>>      Stopped: [ test7 ]
>>
>> root@test6:~# crm stat
>> Last updated: Thu May 17 22:49:26 2018 Last change: Thu May 17 22:45:03
>> 2018 by root via cibadmin on test6
>> Stack: corosync
>> Current DC: test6 (version 1.1.14-70404b0) - partition WITHOUT quorum
>> 2 nodes and 3 resources configured
>>
>> Online: [ test6 ]
>> OFFLINE: [ test7 ]
>>
>> Full list of resources:
>>
>>  VirtualIP (ocf::heartbeat:IPaddr2): Stopped
>>  Master/Slave Set: ovndb_servers-master [ovndb_servers]
>>      Stopped: [ test6 test7 ]
>>
>> I think this change not needed or something else is wrong when using
>> virtual IP resource.
>>
>
> Hi Aliasgar, I think you haven't created the resource properly. Or haven't
> set the  colocation constraints properly. What pcs/crm commands you used to
> create OVN db resources ?
> Can you share the output of "pcs resource show ovndb_servers" and "pcs
> constraint"
> In case of tripleo we create resource like this - https://github.com/
> openstack/puppet-tripleo/blob/master/manifests/profile/
> pacemaker/ovn_northd.pp#L80
>

>>>>> # I am using the same commands suggested upstream in the ovs document
to create resource:
I am skipping manage northd option with default inactivity probe interval
http://docs.openvswitch.org/en/latest/topics/integration/#ha-for-ovn-db-servers-using-pacemaker
# cat pcs_with_ipaddr2.sh
pcs resource create VirtualIP ocf:heartbeat:IPaddr2 \
  params ip="192.168.220.108" op monitor interval="30s"
pcs resource create ovndb_servers ocf:ovn:ovndb-servers \
     master_ip="192.168.220.108" \
     op monitor interval="10s" \
     op monitor role=Master interval="15s" --debug
pcs resource master ovndb_servers-master ovndb_servers \
    meta notify="true"
pcs constraint order promote ovndb_servers-master then VirtualIP
pcs constraint colocation add VirtualIP with master ovndb_servers-master \
    score=INFINITY

# pcs resource show ovndb_servers
 Resource: ovndb_servers (class=ocf provider=ovn type=ovndb-servers)
  Attributes: master_ip=192.168.220.108
  Operations: start interval=0s timeout=30s
(ovndb_servers-start-interval-0s)
              stop interval=0s timeout=20s (ovndb_servers-stop-interval-0s)
              promote interval=0s timeout=50s
(ovndb_servers-promote-interval-0s)
              demote interval=0s timeout=50s
(ovndb_servers-demote-interval-0s)
              monitor interval=10s (ovndb_servers-monitor-interval-10s)
              monitor interval=15s role=Master
(ovndb_servers-monitor-interval-15s)
# pcs constraint
Location Constraints:
Ordering Constraints:
  promote ovndb_servers-master then start VirtualIP (kind:Mandatory)
Colocation Constraints:
  VirtualIP with ovndb_servers-master (score:INFINITY) (rsc-role:Started)
(with-rsc-role:Master)

>
>
>>
>> May we you need a similar promotion logic that we have for LB with
>> pacemaker in the discussion (will submit formal patch soon). I did test
>> with kernel panic with LB code change and it works fine where node2 gets
>> promoted. Below works fine for LB even if there is kernel panic without
>> this change:
>>
>
> This issue is not seen all the time. I have another setup where I don't
> see this issue at all. The issue is seen when the IPAddr2 resource is moved
> to another slave node and ovsdb-server's start reporting as master as soon
> as the IP address is configured.
>
> When the issue is seen we  hit the code here - https://github.com/
> openvswitch/ovs/blob/master/ovn/utilities/ovndb-servers.ocf#L412. Ideally
> when promot action is called, ovsdb servers will be running as
> slaves/standby and the promote action promotes them to master. But when the
> issue is seen, the ovsdb servers report the status as active. Because of
> which we don't complete the full promote action and return at L412. And
> later when notify action is called, we demote the servers because of this -
> https://github.com/openvswitch/ovs/blob/master/
> ovn/utilities/ovndb-servers.ocf#L176
>
> >>> Yes I agree! As you said settings work fine in one cluster and if you
use other cluster with same settings, you may see surprises .


> For the use case like your's (where load balancer VIP is used), you may
> not see this issue at all since you will not be using the IPaddr2 resource
> as master ip.
>
>>> Correct, I just wanted to update both the settings to let you know
pacemaker behavior with IPaddr2 vs LB VIP IP.

>
>
>> root@test-pace1-2365293:~# echo c > /proc/sysrq-trigger
>> root@test-pace2-2365308:~# crm stat
>> Last updated: Thu May 17 15:15:45 2018 Last change: Wed May 16 23:10:52
>> 2018 by root via cibadmin on test-pace2-2365308
>> Stack: corosync
>> Current DC: test-pace1-2365293 (version 1.1.14-70404b0) - partition with
>> quorum
>> 2 nodes and 2 resources configured
>>
>> Online: [ test-pace1-2365293 test-pace2-2365308 ]
>>
>> Full list of resources:
>>
>>  Master/Slave Set: ovndb_servers-master [ovndb_servers]
>>      Masters: [ test-pace1-2365293 ]
>>      Slaves: [ test-pace2-2365308 ]
>>
>> root@test-pace2-2365308:~# crm stat
>> Last updated: Thu May 17 15:15:45 2018 Last change: Wed May 16 23:10:52
>> 2018 by root via cibadmin on test-pace2-2365308
>> Stack: corosync
>> Current DC: test-pace2-2365308 (version 1.1.14-70404b0) - partition
>> WITHOUT quorum
>> 2 nodes and 2 resources configured
>>
>> Online: [ test-pace2-2365308 ]
>> OFFLINE: [ test-pace1-2365293 ]
>>
>> Full list of resources:
>>
>>  Master/Slave Set: ovndb_servers-master [ovndb_servers]
>>      Slaves: [ test-pace2-2365308 ]
>>      Stopped: [ test-pace1-2365293 ]
>>
>> root@test-pace2-2365308:~# ps aux | grep ovs
>> root     15175  0.0  0.0  18048   372 ?        Ss   15:15   0:00
>> ovsdb-server: monitoring pid 15176 (healthy)
>> root     15176  0.0  0.0  18312  4096 ?        S    15:15   0:00
>> ovsdb-server -vconsole:off -vfile:info 
>> --log-file=/var/log/openvswitch/ovsdb-server-nb.log
>> --remote=punix:/var/run/openvswitch/ovnnb_db.sock
>> --pidfile=/var/run/openvswitch/ovnnb_db.pid --unixctl=ovnnb_db.ctl
>> --detach --monitor --remote=db:OVN_Northbound,NB_Global,connections
>> --private-key=db:OVN_Northbound,SSL,private_key
>> --certificate=db:OVN_Northbound,SSL,certificate
>> --ca-cert=db:OVN_Northbound,SSL,ca_cert 
>> --ssl-protocols=db:OVN_Northbound,SSL,ssl_protocols
>> --ssl-ciphers=db:OVN_Northbound,SSL,ssl_ciphers
>> --remote=ptcp:6641:0.0.0.0 --sync-from=tcp:192.0.2.254:6641
>> /etc/openvswitch/ovnnb_db.db
>> root     15184  0.0  0.0  18048   376 ?        Ss   15:15   0:00
>> ovsdb-server: monitoring pid 15185 (healthy)
>> root     15185  0.0  0.0  18300  4480 ?        S    15:15   0:00
>> ovsdb-server -vconsole:off -vfile:info 
>> --log-file=/var/log/openvswitch/ovsdb-server-sb.log
>> --remote=punix:/var/run/openvswitch/ovnsb_db.sock
>> --pidfile=/var/run/openvswitch/ovnsb_db.pid --unixctl=ovnsb_db.ctl
>> --detach --monitor --remote=db:OVN_Southbound,SB_Global,connections
>> --private-key=db:OVN_Southbound,SSL,private_key
>> --certificate=db:OVN_Southbound,SSL,certificate
>> --ca-cert=db:OVN_Southbound,SSL,ca_cert 
>> --ssl-protocols=db:OVN_Southbound,SSL,ssl_protocols
>> --ssl-ciphers=db:OVN_Southbound,SSL,ssl_ciphers
>> --remote=ptcp:6642:0.0.0.0 --sync-from=tcp:192.0.2.254:6642
>> /etc/openvswitch/ovnsb_db.db
>> root     15398  0.0  0.0  12940   972 pts/0    S+   15:15   0:00 grep
>> --color=auto ovs
>>
>> >>>I just want to point out that I am also seeing below errors when
>> setting target with master IP using ipaddr2 resource too!
>> 2018-05-17T21:58:51.889Z|00011|ovsdb_jsonrpc_server|ERR|ptcp:6641:
>> 192.168.220.108: listen failed: Cannot assign requested address
>> 2018-05-17T21:58:51.889Z|00012|socket_util|ERR|6641:192.168.220.108:
>> bind: Cannot assign requested address
>> That needs to be handled too since existing code do throw this error!
>> Only if I skip setting target then it the error is gone.?
>>
>
> In the case of tripleo, we handle this error by setting the sysctl
> value net.ipv4.ip_nonlocal_bind to 1 - https://github.com/
> openstack/puppet-tripleo/blob/master/manifests/profile/
> pacemaker/ovn_northd.pp#L67
> >>> Sweet, I can try to set this to get rid of socket error.
>
>
>>
>>
>>
>> Regards,
>> Aliasgar
>>
>>
>> On Thu, May 17, 2018 at 3:04 AM, <nusid...@redhat.com> wrote:
>>
>>> From: Numan Siddique <nusid...@redhat.com>
>>>
>>> When a node 'A' in the pacemaker cluster running OVN db servers in
>>> master is
>>> brought down ungracefully ('echo b > /proc/sysrq_trigger' for example),
>>> pacemaker
>>> is not able to promote any other node to master in the cluster. When
>>> pacemaker selects
>>> a node B for instance to promote, it moves the IPAddr2 resource (i.e the
>>> master ip)
>>> to node 'B'. As soon the node is configured with the IP address, when
>>> the issue is
>>> seen, the OVN db servers which were running as standy earlier,
>>> transitions to active.
>>> Ideally this should not have happened. The ovsdb-servers are expected to
>>> remain in
>>> standby until there are promoted. (This needs separate investigation).
>>> When the pacemaker
>>> calls the OVN OCF script's promote action, the ovsdb_server_promot
>>> function returns
>>> almost immediately without recording the present master. And later in
>>> the notify action
>>> it demotes back the OVN db servers since the last known master doesn't
>>> match with
>>> node 'B's hostname. This results in pacemaker promoting/demoting in a
>>> loop.
>>>
>>> This patch fixes the issue by not returning immediately when promote
>>> action is
>>> called if the OVN db servers are running as active. Now it would
>>> continue with
>>> the ovsdb_server_promot function and records the new master by setting
>>> proper
>>> master score ($CRM_MASTER -N $host_name -v ${master_score})
>>>
>>> This issue is not seen when a node is brought down gracefully as
>>> pacemaker before
>>> promoting a node, calls stop, start and then promote actions. Not sure
>>> why pacemaker
>>> doesn't call stop, start and promote actions when a node is reset
>>> ungracefully.
>>>
>>> Reported-at: https://bugzilla.redhat.com/show_bug.cgi?id=1579025
>>> Signed-off-by: Numan Siddique <nusid...@redhat.com>
>>> ---
>>>  ovn/utilities/ovndb-servers.ocf | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/ovn/utilities/ovndb-servers.ocf
>>> b/ovn/utilities/ovndb-servers.ocf
>>> index 164b6bce6..23dc70056 100755
>>> --- a/ovn/utilities/ovndb-servers.ocf
>>> +++ b/ovn/utilities/ovndb-servers.ocf
>>> @@ -409,7 +409,7 @@ ovsdb_server_promote() {
>>>      rc=$?
>>>      case $rc in
>>>          ${OCF_SUCCESS}) ;;
>>> -        ${OCF_RUNNING_MASTER}) return ${OCF_SUCCESS};;
>>> +        ${OCF_RUNNING_MASTER}) ;;
>>>          *)
>>>              ovsdb_server_master_update $OCF_RUNNING_MASTER
>>>              return ${rc}
>>> --
>>> 2.17.0
>>>
>>> _______________________________________________
>>> dev mailing list
>>> d...@openvswitch.org
>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>
>>
>>
>
_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] ovn pacemaker: Fix the promotion issue in other cluster nodes when the master node is reset

Reply via email to