> Classic fence loop.  Try this doc:
> https://access.redhat.com/site/solutions/272913
I don't have a Red Hat subscription (apologies if that is expected to
participate in this list).

My understanding was, that this scenario should only happen in the case
that networking between the two nodes does not work properly. Would you
mind explaining why it happens to me where the nodes can (and do)
communicate with each other and the post_join_delay is very high?

On Wed, Sep 11, 2013 at 01:03:07PM +0200, Pascal Ehlert wrote:

>> Hi,
>> I have recently setup an HA cluster with two nodes, IPMI based fencing
>> and no quorum disk. Things worked nicely during the first tests, but to my
>> very annoyance it blew up last night when I did another test of shutting
>> down the network interface on my secondary node (node 2).
>> The node was fenced as expected and came back online. This however
>> resulted in an immediate fencing of the other node.
>> Fencing went back and forth until I manually powered of node 2 and let
>> node 1 a few minutes to settle down.
>> Now when I switch node 2 back on, it looks like it joins the cluster and
>> is kicked out immediately again, which again results in fencing of node
>> 2. I have purposely set the post_join_delay to a high value, but it
>> didn't help.
>> Below are my cluster.conf and log files. My own guess would be that the
>> problem is associated with the fact that the node tries to do a stateful
>> merge, when it really should be joining without state after a clean
>> reboot. (see fence_tool dump line 9).
>> --------------
>> root@rmg-de-1:~# cat /etc/pve/cluster.conf
>> <?xml version="1.0"?>
>> <cluster config_version="14" name="rmg-de-cl1">
>>   <cman expected_votes="1" keyfile="/var/lib/pve-cluster/corosync.authkey" 
>> two_node="1"/>
>>   <fencedevices>
>>     <fencedevice agent="fence_ipmilan" ipaddr="10.xx.xx.11" login="FENCING" 
>> name="fenceNode1" passwd="abc"/>
>>     <fencedevice agent="fence_ipmilan" ipaddr="10.xx.xx.12" login="FENCING" 
>> name="fenceNode2" passwd="abc"/>
>>   </fencedevices>
>>   <clusternodes>
>>     <clusternode name="rmg-de-1" nodeid="1" votes="1">
>>       <fence>
>>         <method name="1">
>>           <device action="reboot" name="fenceNode1"/>
>>         </method>
>>       </fence>
>>     </clusternode>
>>     <clusternode name="rmg-de-2" nodeid="2" votes="1">
>>       <fence>
>>         <method name="1">
>>           <device action="reboot" name="fenceNode2"/>
>>         </method>
>>       </fence>
>>     </clusternode>
>>   </clusternodes>
>>   <fence_daemon post_join_delay="360" />
>>   <rm>
>>     <pvevm autostart="1" vmid="101"/>
>>     <pvevm autostart="1" vmid="100"/>
>>     <pvevm autostart="1" vmid="104"/>
>>     <pvevm autostart="1" vmid="103"/>
>>     <pvevm autostart="1" vmid="102"/>
>>   </rm>
>> </cluster>
>> --------------
>> --------------
>> root@rmg-de-1:~# fence_tool dump | tail -n 40
>> 1378890849 daemon node 1 max run
>> 1378890849 daemon node 1 join 1378855487 left 0 local quorum 1378855487
>> 1378890849 receive_start 1:12 len 152
>> 1378890849 match_change 1:12 matches cg 12
>> 1378890849 wait_messages cg 12 need 1 of 2
>> 1378890850 receive_protocol from 2 max run
>> 1378890850 daemon node 2 max run
>> 1378890850 daemon node 2 join 1378890849 left 1378859110 local quorum 
>> 1378855487
>> 1378890850 daemon node 2 stateful merge
>> 1378890850 daemon node 2 kill due to stateful merge
>> 1378890850 telling cman to remove nodeid 2 from cluster
>> 1378890862 cluster node 2 removed seq 832
>> 1378890862 fenced:daemon conf 1 0 1 memb 1 join left 2
>> 1378890862 fenced:daemon ring 1:832 1 memb 1
>> 1378890862 fenced:default conf 1 0 1 memb 1 join left 2
>> 1378890862 add_change cg 13 remove nodeid 2 reason 3
>> 1378890862 add_change cg 13 m 1 j 0 r 1 f 1
>> 1378890862 add_victims node 2
>> 1378890862 check_ringid cluster 832 cpg 1:828
>> 1378890862 fenced:default ring 1:832 1 memb 1
>> 1378890862 check_ringid done cluster 832 cpg 1:832
>> 1378890862 check_quorum done
>> 1378890862 send_start 1:13 flags 2 started 6 m 1 j 0 r 1 f 1
>> 1378890862 cpg_mcast_joined retried 1 start
>> 1378890862 receive_start 1:13 len 152
>> 1378890862 match_change 1:13 skip cg 12 already start
>> 1378890862 match_change 1:13 matches cg 13
>> 1378890862 wait_messages cg 13 got all 1
>> 1378890862 set_master from 1 to complete node 1
>> 1378890862 delay post_join_delay 360 quorate_from_last_update 0
>> 1378891222 delay of 360s leaves 1 victims
>> 1378891222 rmg-de-2 not a cluster member after 360 sec post_join_delay
>> 1378891222 fencing node rmg-de-2
>> 1378891236 fence rmg-de-2 dev 0.0 agent fence_ipmilan result: success
>> 1378891236 fence rmg-de-2 success
>> 1378891236 send_victim_done cg 13 flags 2 victim nodeid 2
>> 1378891236 send_complete 1:13 flags 2 started 6 m 1 j 0 r 1 f 1
>> 1378891236 receive_victim_done 1:13 flags 2 len 80
>> 1378891236 receive_victim_done 1:13 remove victim 2 time 1378891236 how 1
>> 1378891236 receive_complete 1:13 len 152:
>> --------------
>> --------------
>> root@rmg-de-1:~# tail -n 100 /var/log/cluster/corosync.log
>> Sep 11 11:14:09 corosync [CLM   ] CLM CONFIGURATION CHANGE
>> Sep 11 11:14:09 corosync [CLM   ] New Configuration:
>> Sep 11 11:14:09 corosync [CLM   ]     r(0) ip(10.xx.xx.1)
>> Sep 11 11:14:09 corosync [CLM   ] Members Left:
>> Sep 11 11:14:09 corosync [CLM   ] Members Joined:
>> Sep 11 11:14:09 corosync [CLM   ] CLM CONFIGURATION CHANGE
>> Sep 11 11:14:09 corosync [CLM   ] New Configuration:
>> Sep 11 11:14:09 corosync [CLM   ]     r(0) ip(10.xx.xx.1)
>> Sep 11 11:14:09 corosync [CLM   ]     r(0) ip(10.xx.xx.2)
>> Sep 11 11:14:09 corosync [CLM   ] Members Left:
>> Sep 11 11:14:09 corosync [CLM   ] Members Joined:
>> Sep 11 11:14:09 corosync [CLM   ]     r(0) ip(10.xx.xx.2)
>> Sep 11 11:14:09 corosync [TOTEM ] A processor joined or left the membership 
>> and a new membership was formed.
>> Sep 11 11:14:09 corosync [QUORUM] Members[2]: 1 2
>> Sep 11 11:14:09 corosync [QUORUM] Members[2]: 1 2
>> Sep 11 11:14:09 corosync [CPG   ] chosen downlist: sender r(0) 
>> ip(10.xx.xx.1) ; members(old:1 left:0)
>> Sep 11 11:14:09 corosync [MAIN  ] Completed service synchronization, ready 
>> to provide service.
>> Sep 11 11:14:20 corosync [TOTEM ] A processor failed, forming new 
>> configuration.
>> Sep 11 11:14:22 corosync [CLM   ] CLM CONFIGURATION CHANGE
>> Sep 11 11:14:22 corosync [CLM   ] New Configuration:
>> Sep 11 11:14:22 corosync [CLM   ]     r(0) ip(10.xx.xx.1)
>> Sep 11 11:14:22 corosync [CLM   ] Members Left:
>> Sep 11 11:14:22 corosync [CLM   ]     r(0) ip(10.xx.xx.2)
>> Sep 11 11:14:22 corosync [CLM   ] Members Joined:
>> Sep 11 11:14:22 corosync [QUORUM] Members[1]: 1
>> Sep 11 11:14:22 corosync [CLM   ] CLM CONFIGURATION CHANGE
>> Sep 11 11:14:22 corosync [CLM   ] New Configuration:
>> Sep 11 11:14:22 corosync [CLM   ]     r(0) ip(10.xx.xx.1)
>> Sep 11 11:14:22 corosync [CLM   ] Members Left:
>> Sep 11 11:14:22 corosync [CLM   ] Members Joined:
>> Sep 11 11:14:22 corosync [TOTEM ] A processor joined or left the membership 
>> and a new membership was formed.
>> Sep 11 11:14:22 corosync [CPG   ] chosen downlist: sender r(0) 
>> ip(10.xx.xx.1) ; members(old:2 left:1)
>> Sep 11 11:14:22 corosync [MAIN  ] Completed service synchronization, ready 
>> to provide service.
>> --------------
>> --------------
>> root@rmg-de-1:~# dlm_tool ls
>> dlm lockspaces
>> name          rgmanager
>> id            0x5231f3eb
>> flags         0x00000000
>> change        member 1 joined 0 remove 1 failed 1 seq 12,13
>> members       1
>> --------------
>> Unfortunately I only have the output of the currently operational node,
>> as the other one is fenced very quickly and the logs are hard to
>> retrieve. If someone has an idea however, I'll do my best to provide
>> these as well.
>> Thanks,
>> Pascal
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster@redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster

Linux-cluster mailing list

Reply via email to