Hi Jan,
first of all I don't know the SBD-Fencing-Infrastructure (just read the article linked by you). But as far as I understand the "normal" fencing (initiated on behalf of pacemaker) is done in the following way. SBD fencing resoure (agent) is writing a request for self-stonithing into one or more SBD partitions where the SBD-daemon is listening and hopefully reacting on. So, I'm pretty sure (without knowing) that you have to configure the stonith agent in a way that pacemaker knows howto talk to the stonith agent to kill a certain cluster node. What is the problem in you scenario: The agent which should be contacted to stonith the node2 is/was running on node2 and can't be connected anymore. Because of that stonith agent configuration is most of the times done the following way in a two node cluster: On every node runs a stonith agent. The stonith agent is configured to stonith the OTHER node. You have to be sure that this is technically always possible. This can be achieved with resource clones or - which is IMHO simpler - in a 2-node-environment with two stonith resources and a negative colocation constraint. As far as I know there is also a self-stonith-safty-belt implemented in a way that a stonith agent on a node to be shot is never contacted. (Do I remember correct?) I'm sure this may solve your problem. Best regards Andreas Mock Von: Jan Christian Kaldestad [mailto:janc...@gmail.com] Gesendet: Donnerstag, 1. August 2013 15:46 An: pacemaker@oss.clusterlabs.org Betreff: [Pacemaker] Problems with SBD fencing Hi, I am evaluating the SLES HA Extension 11 SP3 product. The cluster consists of 2-nodes (active/passive), using SBD stonith resource on a shared SAN disk. Configuration according to http://www.linux-ha.org/wiki/SBD_Fencing The SBD daemon is running on both nodes, and the stontih resource (defined as primitive) is running on one node only. There is also a monitor operation for the stonith resource (interval=36000, timeout=20) I am having some problems getting failover/fencing to work as expected in the following scenario: - Node 1 is running the resources that I created (except stonith) - Node 2 is running the stonith resource - Disconnect Node 2 from the network by bringing the interface down - Node 2 status changes to UNCLEAN (offline), but the stonith resource does not switch over to Node 1 and Node 2 does not reboot as I would expect. - Checking the logs on Node 1, I notice the following: Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: pe_fence_node: Node slesha1n2i-u will be fenced because the node is no longer part of the cluster Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: determine_online_status: Node slesha1n2i-u is unclean Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: custom_action: Action stonith_sbd_stop_0 on slesha1n2i-u is unrunnable (offline) Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: stage6: Scheduling Node slesha1n2i-u for STONITH Aug 1 12:00:01 slesha1n1i-u pengine[8915]: notice: LogActions: Move stonith_sbd (Started slesha1n2i-u -> slesha1n1i-u) ... Aug 1 12:00:01 slesha1n1i-u crmd[8916]: notice: te_fence_node: Executing reboot fencing operation (24) on slesha1n2i-u (timeout=60000) Aug 1 12:00:01 slesha1n1i-u stonith-ng[8912]: notice: handle_request: Client crmd.8916.3144546f wants to fence (reboot) 'slesha1n2i-u' with device '(any)' Aug 1 12:00:01 slesha1n1i-u stonith-ng[8912]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for slesha1n2i-u: 8c00ff7b-2986-4b2a-8b4a-760e8346349b (0) Aug 1 12:00:01 slesha1n1i-u stonith-ng[8912]: error: remote_op_done: Operation reboot of slesha1n2i-u by slesha1n1i-u for crmd.8916@slesha1n1i-u.8c00ff7b: No route to host Aug 1 12:00:01 slesha1n1i-u crmd[8916]: notice: tengine_stonith_callback: Stonith operation 3/24:3:0:8a0f32b2-f91c-4cdf-9cee-1ba9b6e187ab: No route to host (-113) Aug 1 12:00:01 slesha1n1i-u crmd[8916]: notice: tengine_stonith_callback: Stonith operation 3 for slesha1n2i-u failed (No route to host): aborting transition. Aug 1 12:00:01 slesha1n1i-u crmd[8916]: notice: tengine_stonith_notify: Peer slesha1n2i-u was not terminated (st_notify_fence) by slesha1n1i-u for slesha1n1i-u: No route to host (ref=8c00ff7b-2986-4b2a-8b4a-760e8346349b) by client crmd.8916 Aug 1 12:00:01 slesha1n1i-u crmd[8916]: notice: run_graph: Transition 3 (Complete=1, Pending=0, Fired=0, Skipped=5, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-15.bz2): Stopped Aug 1 12:00:01 slesha1n1i-u pengine[8915]: notice: unpack_config: On loss of CCM Quorum: Ignore Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: pe_fence_node: Node slesha1n2i-u will be fenced because the node is no longer part of the cluster Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: determine_online_status: Node slesha1n2i-u is unclean Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: custom_action: Action stonith_sbd_stop_0 on slesha1n2i-u is unrunnable (offline) Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: stage6: Scheduling Node slesha1n2i-u for STONITH Aug 1 12:00:01 slesha1n1i-u pengine[8915]: notice: LogActions: Move stonith_sbd (Started slesha1n2i-u -> slesha1n1i-u) ... Aug 1 12:00:02 slesha1n1i-u crmd[8916]: notice: too_many_st_failures: Too many failures to fence slesha1n2i-u (11), giving up - Then I bring Node 1 online again and start the cluster service, checking logs: Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] CLM CONFIGURATION CHANGE Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] New Configuration: Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] r(0) ip(x.x.x.x) Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] Members Left: Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] Members Joined: Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 376: memb=1, new=0, lost=0 Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: pcmk_peer_update: memb: slesha1n1i-u 168824371 Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] CLM CONFIGURATION CHANGE Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] New Configuration: Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] r(0) ip(x.x.x.x) Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] r(0) ip(y.y.y.y) Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] Members Left: Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] Members Joined: Aug 1 12:31:13 slesha1n1i-u cib[8911]: notice: ais_dispatch_message: Membership 376: quorum acquired Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] r(0) ip(y.y.y.y) Aug 1 12:31:13 slesha1n1i-u crmd[8916]: notice: ais_dispatch_message: Membership 376: quorum acquired Aug 1 12:31:13 slesha1n1i-u cib[8911]: notice: crm_update_peer_state: crm_update_ais_node: Node slesha1n2i-u[168824372] - state is now member (was lost) Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 376: memb=2, new=1, lost=0 Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: update_member: Node 168824372/slesha1n2i-u is now: member Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: pcmk_peer_update: NEW: slesha1n2i-u 168824372 Aug 1 12:31:13 slesha1n1i-u crmd[8916]: notice: crm_update_peer_state: crm_update_ais_node: Node slesha1n2i-u[168824372] - state is now member (was lost) Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: pcmk_peer_update: MEMB: slesha1n1i-u 168824371 Aug 1 12:31:13 slesha1n1i-u crmd[8916]: notice: peer_update_callback: Node return implies stonith of slesha1n2i-u (action 24) completed Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: pcmk_peer_update: MEMB: slesha1n2i-u 168824372 Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: send_member_notification: Sending membership update 376 to 2 children Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: update_member: 0x69f2f0 Node 168824372 (slesha1n2i-u) born on: 376 Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: send_member_notification: Sending membership update 376 to 2 children Aug 1 12:31:13 slesha1n1i-u crmd[8916]: notice: crm_update_quorum: Updating quorum status to true (call=119) Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CPG ] chosen downlist: sender r(0) ip(x.x.x.x) ; members(old:1 left:0) Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [MAIN ] Completed service synchronization, ready to provide service. Aug 1 12:31:13 slesha1n1i-u crmd[8916]: notice: too_many_st_failures: Too many failures to fence slesha1n2i-u (13), giving up Aug 1 12:31:13 slesha1n1i-u crmd[8916]: notice: too_many_st_failures: Too many failures to fence slesha1n2i-u (13), giving up Aug 1 12:31:14 slesha1n1i-u mgmtd: [8917]: info: CIB query: cib Aug 1 12:31:14 slesha1n1i-u corosync[8905]: [pcmk ] info: update_member: Node slesha1n2i-u now has process list: 00000000000000000000000000151302 (1381122) Aug 1 12:31:14 slesha1n1i-u corosync[8905]: [pcmk ] info: send_member_notification: Sending membership update 376 to 2 children Aug 1 12:31:14 slesha1n1i-u corosync[8905]: [pcmk ] info: update_member: Node slesha1n2i-u now has process list: 00000000000000000000000000141302 (1315586) Aug 1 12:31:14 slesha1n1i-u corosync[8905]: [pcmk ] info: send_member_notification: Sending membership update 376 to 2 children Aug 1 12:31:14 slesha1n1i-u corosync[8905]: [pcmk ] info: update_member: Node slesha1n2i-u now has process list: 00000000000000000000000000101302 (1053442) Aug 1 12:31:14 slesha1n1i-u corosync[8905]: [pcmk ] info: send_member_notification: Sending membership update 376 to 2 children Aug 1 12:31:15 slesha1n1i-u crmd[8916]: notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_HA_MESSAGE origin=route_message ] Aug 1 12:31:15 slesha1n1i-u crmd[8916]: notice: too_many_st_failures: Too many failures to fence slesha1n2i-u (13), giving up Aug 1 12:31:15 slesha1n1i-u crmd[8916]: notice: too_many_st_failures: Too many failures to fence slesha1n2i-u (13), giving up - Cluster status changes to Online for both nodes, but the stonith resource won't start on any of the nodes. - Trying to start the resource manually, but no success. - Trying to restart the corosync process on Node 1 (rcopenais restart), but it hangs forever. Checking logs: Aug 1 12:42:08 slesha1n1i-u corosync[8905]: [SERV ] Unloading all Corosync service engines. Aug 1 12:42:08 slesha1n1i-u corosync[8905]: [pcmk ] notice: pcmk_shutdown: Shuting down Pacemaker Aug 1 12:42:08 slesha1n1i-u corosync[8905]: [pcmk ] notice: stop_child: Sent -15 to mgmtd: [8917] Aug 1 12:42:08 slesha1n1i-u mgmtd: [8917]: info: mgmtd is shutting down Aug 1 12:42:08 slesha1n1i-u mgmtd: [8917]: info: final_crm: client_id=1 cib_name=live Aug 1 12:42:08 slesha1n1i-u mgmtd: [8917]: info: final_crm: client_id=2 cib_name=live Aug 1 12:42:08 slesha1n1i-u mgmtd: [8917]: debug: [mgmtd] stopped Aug 1 12:42:08 slesha1n1i-u corosync[8905]: [pcmk ] notice: pcmk_shutdown: mgmtd confirmed stopped Aug 1 12:42:08 slesha1n1i-u corosync[8905]: [pcmk ] notice: stop_child: Sent -15 to crmd: [8916] Aug 1 12:42:08 slesha1n1i-u crmd[8916]: notice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms Aug 1 12:42:08 slesha1n1i-u attrd[8914]: notice: attrd_trigger_update: Sending flush op to all hosts for: shutdown (1375353728) Aug 1 12:42:08 slesha1n1i-u attrd[8914]: notice: attrd_perform_update: Sent update 22: shutdown=1375353728 Aug 1 12:42:08 slesha1n1i-u crmd[8916]: notice: too_many_st_failures: Too many failures to fence slesha1n2i-u (13), giving up Aug 1 12:42:08 slesha1n1i-u crmd[8916]: warning: do_log: FSA: Input I_TE_SUCCESS from abort_transition_graph() received in state S_POLICY_ENGINE Aug 1 12:42:38 slesha1n1i-u corosync[8905]: [pcmk ] notice: pcmk_shutdown: Still waiting for crmd (pid=8916, seq=6) to terminate... Aug 1 12:43:08 slesha1n1i-u corosync[8905]: [pcmk ] notice: pcmk_shutdown: Still waiting for crmd (pid=8916, seq=6) to terminate... ... - Finally I kill the corosync process on Node 1 (killall -9 corosync), then corsoync restarts. - Checking status. All resources are up and running on Node 1, and the stonith resource is running on Node 2 again. I have tested the same scenario several times. Sometimes the fencing mechaism works as expected, but other times the stonith resource is not transferred to Node 1 - as described here. So I need some assistance to overcome this problem.... -- Best regards Jan
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org