[Pacemaker] Being fenced node is killed again and again even the connection is recovered!

2010-05-14 Thread Javen Wu
Hi Folks,

I setup a three nodes cluster with SBD STONITH configured.
After I manually isolate one node by running "ifconfig eth1 down" on the
node. The node is fenced as expected.
But after reboot, even the network is recovered, the node is killed again
once I start openais&pacemaker.
I saw the state of the node become from OFFLINE to ONLINE from `crm_mon -n`
before being killed. And I saw SBD slot from reset->clear->reset.

I attached the syslog and corosync log.
And my CIB configuration is very simple.

Could you help me check what's the problem? In my mind, it's not expected
behaviour.

===%

Re: [Pacemaker] Being fenced node is killed again and again even the connection is recovered!

2010-05-14 Thread Javen Wu
I forget mention the version I used.
I used SLES11-SP1-HAE Beta5
Pacemaker 1.0.7
Corosync 1.2.0
Cluster Glue 1.0.3


2010/5/14 Javen Wu 

> Hi Folks,
>
> I setup a three nodes cluster with SBD STONITH configured.
> After I manually isolate one node by running "ifconfig eth1 down" on the
> node. The node is fenced as expected.
> But after reboot, even the network is recovered, the node is killed again
> once I start openais&pacemaker.
> I saw the state of the node become from OFFLINE to ONLINE from `crm_mon -n`
> before being killed. And I saw SBD slot from reset->clear->reset.
>
> I attached the syslog and corosync log.
> And my CIB configuration is very simple.
>
> Could you help me check what's the problem? In my mind, it's not expected
> behaviour.
>
> ===%
>  admin_epoch="0" epoch="349" num_updates="99" cib-last-written="Fri May 14
> 14:50:21 2010" dc-uuid="vm209">
>   
> 
>   
>  value="1.1.1-530add2a3721a0ecccb24660a97dbfdaa3e68f51"/>
>  name="cluster-infrastructure" value="openais"/>
>  name="expected-quorum-votes" value="3"/>
>   
> 
> 
>   
>   
>   
> 
> 
>   
> 
>   
>  name="sbd_device" value="/dev/sdc"/>
>   
>   
>  name="monitor"/>
>   
> 
>   
> 
> 
> 
> 
>   
>   
>  crmd="online" join="member" expected="member"
> crm-debug-origin="post_cache_update" shutdown="0">
>   
> 
>value="true"/>
> 
>   
>   
> 
>class="stonith">
>  crm-debug-origin="build_active_RAs" crm_feature_set="3.0.1"
> transition-key="4:1:7:f0adcb5c-10d1-4525-b094-b5ab1f776ee0"
> transition-magic="0:7;4:1:7:f0adcb5c-10d1-4525-b094-b5ab1f776ee0"
> call-id="2" rc-code="7" op-status="0" interval="0" last-run="1273820137"
> last-rc-change="1273820137" exec-time="60" queue-time="0"
> op-digest="4c3fd39434577fbb6540606d808ed050"/>
>  crm-debug-origin="build_active_RAs" crm_feature_set="3.0.1"
> transition-key="5:1:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0"
> transition-magic="0:0;5:1:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0"
> call-id="3" rc-code="0" op-status="0" interval="0" last-run="1273820137"
> last-rc-change="1273820137" exec-time="10" queue-time="0"
> op-digest="4c3fd39434577fbb6540606d808ed050"/>
>  operation="monitor" crm-debug-origin="build_active_RAs"
> crm_feature_set="3.0.1"
> transition-key="6:2:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0"
> transition-magic="0:0;6:2:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0"
> call-id="4" rc-code="0" op-status="0" interval="2" last-run="1273822956"
> last-rc-change="1273820137" exec-time="1170" queue-time="0"
> op-digest="4029bbaef749649e82d602afb46dd872"/>
>   
> 
>   
> 
>  crmd="offline" crm-debug-origin="send_stonith_update" join="down"
> expected="down" shutdown="0"/>
>  crmd="online" crm-debug-origin="post_cache_update" join="member"
> expected="member" shutdown="0">
>   
> 
>value="true"/>
> 
>   
>   
> 
>class="stonith">
>  crm-debug-origin="build_active_RAs" crm_feature_set="3.0.1"
> transition-key="8:5:7:f0adcb5c-10d1-4525-b094-b5ab1f776ee0"
> transition-magic="0:7;8:5:7:f0adcb5c-10d1-4525-b094-b5ab1f776ee0"
> call-id="2" rc-code="7" op-status="0" interval="0" last-run="1273820388"
> last-rc-change="1273820388" exec-time="20" queue-time="0"
> op-digest="4c3fd39434577fbb6540606d808ed050"/>
>  crm-debug-origin="build_active_RAs" crm_feature_set="3.0.1"
> transition-key="13:5:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0"
> transition-magic="0:0;13:5:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0"
> call-id="3" rc-code="0" op-status="0" interval="0" last-run="1273820388"
> last-rc-change="1273820388" exec-time="10" queue-time="0"
> op-digest="4c3fd39434577fbb6540606d808ed050"/>
>  operation="monitor" crm-debug-origin="build_active_RAs"
> crm_feature_set="3.0.1"
> transition-key="14:5:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0"
> transition-magic="0:0;14:5:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0"
> call-id="4" rc-code="0" op-status="0" interval="2" last-run="1273822976"
> last-rc-change="1273820389" exec-time="1040" queue-time="0"
> op-digest="4029bbaef749649e82d602afb46dd872"/>
>   
> 
>   
> 
>   
> 
>
>
>
>
>


-- 
Javen Wu
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Re: [Pacemaker] Being fenced node is killed again and again even the connection is recovered!

2010-05-14 Thread Steven Dake
ifconfig eth0 down is not a valid test case.  that will likely lead to
bad things happening.

I recommend using iptables to test the software.

Also Corosync 1.2.2 is out which fixes bugs vs corosync 1.2.0.

Regards
-steve

On Fri, 2010-05-14 at 18:02 +0800, Javen Wu wrote:
> I forget mention the version I used. 
> I used SLES11-SP1-HAE Beta5
> Pacemaker 1.0.7
> Corosync 1.2.0
> Cluster Glue 1.0.3
> 
> 
> 2010/5/14 Javen Wu 
> Hi Folks,
> 
> I setup a three nodes cluster with SBD STONITH configured.
> After I manually isolate one node by running "ifconfig eth1
> down" on the node. The node is fenced as expected.
> But after reboot, even the network is recovered, the node is
> killed again once I start openais&pacemaker.
> I saw the state of the node become from OFFLINE to ONLINE from
> `crm_mon -n` before being killed. And I saw SBD slot from
> reset->clear->reset.
> 
> I attached the syslog and corosync log.
> And my CIB configuration is very simple.
> 
> Could you help me check what's the problem? In my mind, it's
> not expected behaviour.
> 
> ===% 
>  have-quorum="1" admin_epoch="0" epoch="349" num_updates="99"
> cib-last-written="Fri May 14 14:50:21 2010" dc-uuid="vm209">
>   
> 
>   
>  name="dc-version"
> value="1.1.1-530add2a3721a0ecccb24660a97dbfdaa3e68f51"/>
>  id="cib-bootstrap-options-cluster-infrastructure"
> name="cluster-infrastructure" value="openais"/>
>  id="cib-bootstrap-options-expected-quorum-votes"
> name="expected-quorum-votes" value="3"/>
>   
> 
> 
>   
>   
>   
> 
> 
>   
>  type="external/sbd">
>id="sbd-fencing-instance_attributes">
>  id="sbd-fencing-instance_attributes-sbd_device"
> name="sbd_device" value="/dev/sdc"/>
>   
>   
>  name="monitor"/>
>   
> 
>   
> 
> 
> 
> 
>   
>   
>  in_ccm="true" crmd="online" join="member" expected="member"
> crm-debug-origin="post_cache_update" shutdown="0">
>   
> 
>name="probe_complete" value="true"/>
> 
>   
>   
> 
>class="stonith">
>  operation="monitor" crm-debug-origin="build_active_RAs"
> crm_feature_set="3.0.1"
> transition-key="4:1:7:f0adcb5c-10d1-4525-b094-b5ab1f776ee0"
> transition-magic="0:7;4:1:7:f0adcb5c-10d1-4525-b094-b5ab1f776ee0" 
> call-id="2" rc-code="7" op-status="0" interval="0" last-run="1273820137" 
> last-rc-change="1273820137" exec-time="60" queue-time="0" 
> op-digest="4c3fd39434577fbb6540606d808ed050"/>
>  operation="start" crm-debug-origin="build_active_RAs"
> crm_feature_set="3.0.1"
> transition-key="5:1:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0"
> transition-magic="0:0;5:1:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0" 
> call-id="3" rc-code="0" op-status="0" interval="0" last-run="1273820137" 
> last-rc-change="1273820137" exec-time="10" queue-time="0" 
> op-digest="4c3fd39434577fbb6540606d808ed050"/>
>  operation="monitor" crm-debug-origin="build_active_RAs"
> crm_feature_set="3.0.1"
> transition-key="6:2:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0"
> transition-magic="0:0;6:2:0:f0adcb5c-10d1-4525-b094-b5ab1f776ee0" 
> call-id="4" rc-code="0" op-status="0" interval="2" last-run="1273822956" 
> last-rc-change="1273820137" exec-time="1170" queue-time="0" 
> op-digest="4029bbaef749649e82d602afb46dd872"/>
>   
> 
>   
> 
>  in_ccm="false" crmd="offline"
> crm-debug-origin="send_stonith_update" join="down"
> expected="down" shutdown="0"/>
>  in_ccm="true" crmd="online"
> crm-debug-origin="post_cache_update" join="member"
> expected="member" shutdown="0">
>   
> 
>name="probe_complete" value="true"/>
> 
>   
>   
> 
>class="stonith">
>  operation="monitor" crm-debug-origin="build_active_RAs"
> crm_feature_set="3.0.1"
> transition-key="8:5:7:f0adcb5c-10d1-4525-b094-b5ab1f776ee0"
> transition-mag

Re: [Pacemaker] Being fenced node is killed again and again even the connection is recovered!

2010-05-14 Thread Lars Marowsky-Bree
On 2010-05-14T18:02:15, Javen Wu  wrote:

> I forget mention the version I used.
> I used SLES11-SP1-HAE Beta5

Beta5 is quite outdated. if you are a participant of the beta program,
please update.

Also, please use the beta program mailing list for discussing this.


Regards,
Lars

-- 
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf