Hi, I am evaluating the SLES HA Extension 11 SP3 product. The cluster consists of 2-nodes (active/passive), using SBD stonith resource on a shared SAN disk. Configuration according to http://www.linux-ha.org/wiki/SBD_Fencing
The SBD daemon is running on both nodes, and the stontih resource (defined as primitive) is running on one node only. There is also a monitor operation for the stonith resource (interval=36000, timeout=20) I am having some problems getting failover/fencing to work as expected in the following scenario: - Node 1 is running the resources that I created (except stonith) - Node 2 is running the stonith resource - Disconnect Node 2 from the network by bringing the interface down - Node 2 status changes to UNCLEAN (offline), but the stonith resource does not switch over to Node 1 and Node 2 does not reboot as I would expect. - Checking the logs on Node 1, I notice the following: Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: pe_fence_node: Node slesha1n2i-u will be fenced because the node is no longer part of the cluster Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: determine_online_status: Node slesha1n2i-u is unclean Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: custom_action: Action stonith_sbd_stop_0 on slesha1n2i-u is unrunnable (offline) Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: stage6: Scheduling Node slesha1n2i-u for STONITH Aug 1 12:00:01 slesha1n1i-u pengine[8915]: notice: LogActions: Move stonith_sbd (Started slesha1n2i-u -> slesha1n1i-u) ... Aug 1 12:00:01 slesha1n1i-u crmd[8916]: notice: te_fence_node: Executing reboot fencing operation (24) on slesha1n2i-u (timeout=60000) Aug 1 12:00:01 slesha1n1i-u stonith-ng[8912]: notice: handle_request: Client crmd.8916.3144546f wants to fence (reboot) 'slesha1n2i-u' with device '(any)' Aug 1 12:00:01 slesha1n1i-u stonith-ng[8912]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for slesha1n2i-u: 8c00ff7b-2986-4b2a-8b4a-760e8346349b (0) Aug 1 12:00:01 slesha1n1i-u stonith-ng[8912]: error: remote_op_done: Operation reboot of slesha1n2i-u by slesha1n1i-u for crmd.8916@slesha1n1i-u.8c00ff7b: No route to host Aug 1 12:00:01 slesha1n1i-u crmd[8916]: notice: tengine_stonith_callback: Stonith operation 3/24:3:0:8a0f32b2-f91c-4cdf-9cee-1ba9b6e187ab: No route to host (-113) Aug 1 12:00:01 slesha1n1i-u crmd[8916]: notice: tengine_stonith_callback: Stonith operation 3 for slesha1n2i-u failed (No route to host): aborting transition. Aug 1 12:00:01 slesha1n1i-u crmd[8916]: notice: tengine_stonith_notify: Peer slesha1n2i-u was not terminated (st_notify_fence) by slesha1n1i-u for slesha1n1i-u: No route to host (ref=8c00ff7b-2986-4b2a-8b4a-760e8346349b) by client crmd.8916 Aug 1 12:00:01 slesha1n1i-u crmd[8916]: notice: run_graph: Transition 3 (Complete=1, Pending=0, Fired=0, Skipped=5, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-15.bz2): Stopped Aug 1 12:00:01 slesha1n1i-u pengine[8915]: notice: unpack_config: On loss of CCM Quorum: Ignore Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: pe_fence_node: Node slesha1n2i-u will be fenced because the node is no longer part of the cluster Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: determine_online_status: Node slesha1n2i-u is unclean Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: custom_action: Action stonith_sbd_stop_0 on slesha1n2i-u is unrunnable (offline) Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: stage6: Scheduling Node slesha1n2i-u for STONITH Aug 1 12:00:01 slesha1n1i-u pengine[8915]: notice: LogActions: Move stonith_sbd (Started slesha1n2i-u -> slesha1n1i-u) ... Aug 1 12:00:02 slesha1n1i-u crmd[8916]: notice: too_many_st_failures: Too many failures to fence slesha1n2i-u (11), giving up - Then I bring Node 1 online again and start the cluster service, checking logs: Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] CLM CONFIGURATION CHANGE Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] New Configuration: Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] r(0) ip(x.x.x.x) Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] Members Left: Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] Members Joined: Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 376: memb=1, new=0, lost=0 Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: pcmk_peer_update: memb: slesha1n1i-u 168824371 Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] CLM CONFIGURATION CHANGE Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] New Configuration: Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] r(0) ip(x.x.x.x) Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] r(0) ip(y.y.y.y) Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] Members Left: Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] Members Joined: Aug 1 12:31:13 slesha1n1i-u cib[8911]: notice: ais_dispatch_message: Membership 376: quorum acquired Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] r(0) ip(y.y.y.y) Aug 1 12:31:13 slesha1n1i-u crmd[8916]: notice: ais_dispatch_message: Membership 376: quorum acquired Aug 1 12:31:13 slesha1n1i-u cib[8911]: notice: crm_update_peer_state: crm_update_ais_node: Node slesha1n2i-u[168824372] - state is now member (was lost) Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 376: memb=2, new=1, lost=0 Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: update_member: Node 168824372/slesha1n2i-u is now: member Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: pcmk_peer_update: NEW: slesha1n2i-u 168824372 Aug 1 12:31:13 slesha1n1i-u crmd[8916]: notice: crm_update_peer_state: crm_update_ais_node: Node slesha1n2i-u[168824372] - state is now member (was lost) Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: pcmk_peer_update: MEMB: slesha1n1i-u 168824371 Aug 1 12:31:13 slesha1n1i-u crmd[8916]: notice: peer_update_callback: Node return implies stonith of slesha1n2i-u (action 24) completed Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: pcmk_peer_update: MEMB: slesha1n2i-u 168824372 Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: send_member_notification: Sending membership update 376 to 2 children Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: update_member: 0x69f2f0 Node 168824372 (slesha1n2i-u) born on: 376 Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: send_member_notification: Sending membership update 376 to 2 children Aug 1 12:31:13 slesha1n1i-u crmd[8916]: notice: crm_update_quorum: Updating quorum status to true (call=119) Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CPG ] chosen downlist: sender r(0) ip(x.x.x.x) ; members(old:1 left:0) Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [MAIN ] Completed service synchronization, ready to provide service. Aug 1 12:31:13 slesha1n1i-u crmd[8916]: notice: too_many_st_failures: Too many failures to fence slesha1n2i-u (13), giving up Aug 1 12:31:13 slesha1n1i-u crmd[8916]: notice: too_many_st_failures: Too many failures to fence slesha1n2i-u (13), giving up Aug 1 12:31:14 slesha1n1i-u mgmtd: [8917]: info: CIB query: cib Aug 1 12:31:14 slesha1n1i-u corosync[8905]: [pcmk ] info: update_member: Node slesha1n2i-u now has process list: 00000000000000000000000000151302 (1381122) Aug 1 12:31:14 slesha1n1i-u corosync[8905]: [pcmk ] info: send_member_notification: Sending membership update 376 to 2 children Aug 1 12:31:14 slesha1n1i-u corosync[8905]: [pcmk ] info: update_member: Node slesha1n2i-u now has process list: 00000000000000000000000000141302 (1315586) Aug 1 12:31:14 slesha1n1i-u corosync[8905]: [pcmk ] info: send_member_notification: Sending membership update 376 to 2 children Aug 1 12:31:14 slesha1n1i-u corosync[8905]: [pcmk ] info: update_member: Node slesha1n2i-u now has process list: 00000000000000000000000000101302 (1053442) Aug 1 12:31:14 slesha1n1i-u corosync[8905]: [pcmk ] info: send_member_notification: Sending membership update 376 to 2 children Aug 1 12:31:15 slesha1n1i-u crmd[8916]: notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_HA_MESSAGE origin=route_message ] Aug 1 12:31:15 slesha1n1i-u crmd[8916]: notice: too_many_st_failures: Too many failures to fence slesha1n2i-u (13), giving up Aug 1 12:31:15 slesha1n1i-u crmd[8916]: notice: too_many_st_failures: Too many failures to fence slesha1n2i-u (13), giving up - Cluster status changes to Online for both nodes, but the stonith resource won't start on any of the nodes. - Trying to start the resource manually, but no success. - Trying to restart the corosync process on Node 1 (rcopenais restart), but it hangs forever. Checking logs: Aug 1 12:42:08 slesha1n1i-u corosync[8905]: [SERV ] Unloading all Corosync service engines. Aug 1 12:42:08 slesha1n1i-u corosync[8905]: [pcmk ] notice: pcmk_shutdown: Shuting down Pacemaker Aug 1 12:42:08 slesha1n1i-u corosync[8905]: [pcmk ] notice: stop_child: Sent -15 to mgmtd: [8917] Aug 1 12:42:08 slesha1n1i-u mgmtd: [8917]: info: mgmtd is shutting down Aug 1 12:42:08 slesha1n1i-u mgmtd: [8917]: info: final_crm: client_id=1 cib_name=live Aug 1 12:42:08 slesha1n1i-u mgmtd: [8917]: info: final_crm: client_id=2 cib_name=live Aug 1 12:42:08 slesha1n1i-u mgmtd: [8917]: debug: [mgmtd] stopped Aug 1 12:42:08 slesha1n1i-u corosync[8905]: [pcmk ] notice: pcmk_shutdown: mgmtd confirmed stopped Aug 1 12:42:08 slesha1n1i-u corosync[8905]: [pcmk ] notice: stop_child: Sent -15 to crmd: [8916] Aug 1 12:42:08 slesha1n1i-u crmd[8916]: notice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms Aug 1 12:42:08 slesha1n1i-u attrd[8914]: notice: attrd_trigger_update: Sending flush op to all hosts for: shutdown (1375353728) Aug 1 12:42:08 slesha1n1i-u attrd[8914]: notice: attrd_perform_update: Sent update 22: shutdown=1375353728 Aug 1 12:42:08 slesha1n1i-u crmd[8916]: notice: too_many_st_failures: Too many failures to fence slesha1n2i-u (13), giving up Aug 1 12:42:08 slesha1n1i-u crmd[8916]: warning: do_log: FSA: Input I_TE_SUCCESS from abort_transition_graph() received in state S_POLICY_ENGINE Aug 1 12:42:38 slesha1n1i-u corosync[8905]: [pcmk ] notice: pcmk_shutdown: Still waiting for crmd (pid=8916, seq=6) to terminate... Aug 1 12:43:08 slesha1n1i-u corosync[8905]: [pcmk ] notice: pcmk_shutdown: Still waiting for crmd (pid=8916, seq=6) to terminate... ... - Finally I kill the corosync process on Node 1 (killall -9 corosync), then corsoync restarts. - Checking status. All resources are up and running on Node 1, and the stonith resource is running on Node 2 again. I have tested the same scenario several times. Sometimes the fencing mechaism works as expected, but other times the stonith resource is not transferred to Node 1 - as described here. So I need some assistance to overcome this problem.... -- Best regards Jan
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org