Hi Dejan, can you explain how the SDB agent works, when this resource is running on exactly that node which has to be stonithed?
Thank you in advance. Best regards Andreas Mock -----Ursprüngliche Nachricht----- Von: Dejan Muhamedagic [mailto:deja...@fastmail.fm] Gesendet: Dienstag, 6. August 2013 11:15 An: The Pacemaker cluster resource manager Betreff: Re: [Pacemaker] Problems with SBD fencing Hi, On Thu, Aug 01, 2013 at 07:58:55PM +0200, Jan Christian Kaldestad wrote: > Thanks for the explanation. But I'm quite confused about the SBD stonith > resource configuration, as the SBD fencing wiki clearly states: > "The sbd agent does not need to and should not be cloned. If all of your > nodes run SBD, as is most likely, not even a monitor action provides a real > benefit, since the daemon would suicide the node if there was a problem. " > > and also this thread > http://oss.clusterlabs.org/pipermail/pacemaker/2012-March/013507.htmlmention > that there should be only one SBD resource configured. > > Can someone please clarify? Should I configure 2 separate SBD resources, > one for each cluster node? No. One sbd resource is sufficient. Thanks, Dejan > > -- > Best regards > Jan > > > On Thu, Aug 1, 2013 at 6:47 PM, Andreas Mock <andreas.m...@web.de> wrote: > > > Hi Jan,**** > > > > ** ** > > > > first of all I don't know the SBD-Fencing-Infrastructure (just read the*** > > * > > > > article linked by you). But as far as I understand the "normal" fencing*** > > * > > > > (initiated on behalf of pacemaker) is done in the following way.**** > > > > ** ** > > > > SBD fencing resoure (agent) is writing a request for self-stonithing into* > > *** > > > > one or more SBD partitions where the SBD-daemon is listening and hopefully > > **** > > > > reacting on.**** > > > > So, I'm pretty sure (without knowing) that you have to configure the**** > > > > stonith agent in a way that pacemaker knows howto talk to the stonith agent > > **** > > > > to kill a certain cluster node.**** > > > > What is the problem in you scenario: The agent which should be contacted** > > ** > > > > to stonith the node2 is/was running on node2 and can't be connected > > anymore.**** > > > > ** ** > > > > Because of that stonith agent configuration is most of the times done the* > > *** > > > > following way in a two node cluster:**** > > > > On every node runs a stonith agent. The stonith agent is configured to**** > > > > stonith the OTHER node. You have to be sure that this is technically **** > > > > always possible.**** > > > > This can be achieved with resource clones or - which is IMHO simpler - in > > **** > > > > a 2-node-environment with two stonith resources and a negative colocation* > > *** > > > > constraint.**** > > > > ** ** > > > > As far as I know there is also a self-stonith-safty-belt implemented**** > > > > in a way that a stonith agent on a node to be shot is never contacted.**** > > > > (Do I remember correct?)**** > > > > ** ** > > > > I'm sure this may solve your problem.**** > > > > ** ** > > > > Best regards**** > > > > Andreas Mock**** > > > > ** ** > > > > ** ** > > > > *Von:* Jan Christian Kaldestad [mailto:janc...@gmail.com] > > *Gesendet:* Donnerstag, 1. August 2013 15:46 > > *An:* pacemaker@oss.clusterlabs.org > > *Betreff:* [Pacemaker] Problems with SBD fencing**** > > > > ** ** > > > > Hi,**** > > > > > > I am evaluating the SLES HA Extension 11 SP3 product. The cluster > > consists of 2-nodes (active/passive), using SBD stonith resource on a > > shared SAN disk. Configuration according to > > http://www.linux-ha.org/wiki/SBD_Fencing**** > > > > The SBD daemon is running on both nodes, and the stontih resource (defined > > as primitive) is running on one node only. > > There is also a monitor operation for the stonith resource > > (interval=36000, timeout=20)**** > > > > I am having some problems getting failover/fencing to work as expected in > > the following scenario: > > - Node 1 is running the resources that I created (except stonith) > > - Node 2 is running the stonith resource > > - Disconnect Node 2 from the network by bringing the interface down > > - Node 2 status changes to UNCLEAN (offline), but the stonith resource > > does not switch over to Node 1 and Node 2 does not reboot as I would expect. > > - Checking the logs on Node 1, I notice the following: > > Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: pe_fence_node: Node > > slesha1n2i-u will be fenced because the node is no longer part of the > > cluster > > Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: > > determine_online_status: Node slesha1n2i-u is unclean > > Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: custom_action: > > Action stonith_sbd_stop_0 on slesha1n2i-u is unrunnable (offline) > > Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: stage6: Scheduling > > Node slesha1n2i-u for STONITH > > Aug 1 12:00:01 slesha1n1i-u pengine[8915]: notice: LogActions: Move > > stonith_sbd (Started slesha1n2i-u -> slesha1n1i-u) > > ... > > Aug 1 12:00:01 slesha1n1i-u crmd[8916]: notice: te_fence_node: > > Executing reboot fencing operation (24) on slesha1n2i-u (timeout=60000) > > Aug 1 12:00:01 slesha1n1i-u stonith-ng[8912]: notice: handle_request: > > Client crmd.8916.3144546f wants to fence (reboot) 'slesha1n2i-u' with > > device '(any)' > > Aug 1 12:00:01 slesha1n1i-u stonith-ng[8912]: notice: > > initiate_remote_stonith_op: Initiating remote operation reboot for > > slesha1n2i-u: 8c00ff7b-2986-4b2a-8b4a-760e8346349b (0) > > Aug 1 12:00:01 slesha1n1i-u stonith-ng[8912]: error: remote_op_done: > > Operation reboot of slesha1n2i-u by slesha1n1i-u for > > crmd.8916@slesha1n1i-u.8c00ff7b: No route to host > > Aug 1 12:00:01 slesha1n1i-u crmd[8916]: notice: > > tengine_stonith_callback: Stonith operation > > 3/24:3:0:8a0f32b2-f91c-4cdf-9cee-1ba9b6e187ab: No route to host (-113) > > Aug 1 12:00:01 slesha1n1i-u crmd[8916]: notice: > > tengine_stonith_callback: Stonith operation 3 for slesha1n2i-u failed (No > > route to host): aborting transition. > > Aug 1 12:00:01 slesha1n1i-u crmd[8916]: notice: > > tengine_stonith_notify: Peer slesha1n2i-u was not terminated > > (st_notify_fence) by slesha1n1i-u for slesha1n1i-u: No route to host > > (ref=8c00ff7b-2986-4b2a-8b4a-760e8346349b) by client crmd.8916 > > Aug 1 12:00:01 slesha1n1i-u crmd[8916]: notice: run_graph: Transition > > 3 (Complete=1, Pending=0, Fired=0, Skipped=5, Incomplete=0, > > Source=/var/lib/pacemaker/pengine/pe-warn-15.bz2): Stopped > > Aug 1 12:00:01 slesha1n1i-u pengine[8915]: notice: unpack_config: On > > loss of CCM Quorum: Ignore > > Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: pe_fence_node: Node > > slesha1n2i-u will be fenced because the node is no longer part of the > > cluster > > Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: > > determine_online_status: Node slesha1n2i-u is unclean > > Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: custom_action: > > Action stonith_sbd_stop_0 on slesha1n2i-u is unrunnable (offline) > > Aug 1 12:00:01 slesha1n1i-u pengine[8915]: warning: stage6: Scheduling > > Node slesha1n2i-u for STONITH > > Aug 1 12:00:01 slesha1n1i-u pengine[8915]: notice: LogActions: Move > > stonith_sbd (Started slesha1n2i-u -> slesha1n1i-u) > > ... > > Aug 1 12:00:02 slesha1n1i-u crmd[8916]: notice: too_many_st_failures: > > Too many failures to fence slesha1n2i-u (11), giving up > > **** > > > > **** > > > > > > - Then I bring Node 1 online again and start the cluster service, checking > > logs: > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] CLM CONFIGURATION > > CHANGE > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] New Configuration: > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] r(0) ip(x.x.x.x) > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] Members Left: > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] Members Joined: > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] notice: > > pcmk_peer_update: Transitional membership event on ring 376: memb=1, new=0, > > lost=0 > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: > > pcmk_peer_update: memb: slesha1n1i-u 168824371 > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] CLM CONFIGURATION > > CHANGE > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] New Configuration: > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] r(0) ip(x.x.x.x) > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] r(0) ip(y.y.y.y) > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] Members Left: > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] Members Joined: > > Aug 1 12:31:13 slesha1n1i-u cib[8911]: notice: ais_dispatch_message: > > Membership 376: quorum acquired > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CLM ] r(0) ip(y.y.y.y) > > Aug 1 12:31:13 slesha1n1i-u crmd[8916]: notice: ais_dispatch_message: > > Membership 376: quorum acquired > > Aug 1 12:31:13 slesha1n1i-u cib[8911]: notice: crm_update_peer_state: > > crm_update_ais_node: Node slesha1n2i-u[168824372] - state is now member > > (was lost) > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] notice: > > pcmk_peer_update: Stable membership event on ring 376: memb=2, new=1, lost=0 > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: > > update_member: Node 168824372/slesha1n2i-u is now: member > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: > > pcmk_peer_update: NEW: slesha1n2i-u 168824372 > > Aug 1 12:31:13 slesha1n1i-u crmd[8916]: notice: crm_update_peer_state: > > crm_update_ais_node: Node slesha1n2i-u[168824372] - state is now member > > (was lost) > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: > > pcmk_peer_update: MEMB: slesha1n1i-u 168824371 > > Aug 1 12:31:13 slesha1n1i-u crmd[8916]: notice: peer_update_callback: > > Node return implies stonith of slesha1n2i-u (action 24) completed > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: > > pcmk_peer_update: MEMB: slesha1n2i-u 168824372 > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: > > send_member_notification: Sending membership update 376 to 2 children > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [TOTEM ] A processor joined > > or left the membership and a new membership was formed. > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: > > update_member: 0x69f2f0 Node 168824372 (slesha1n2i-u) born on: 376 > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [pcmk ] info: > > send_member_notification: Sending membership update 376 to 2 children > > Aug 1 12:31:13 slesha1n1i-u crmd[8916]: notice: crm_update_quorum: > > Updating quorum status to true (call=119) > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [CPG ] chosen downlist: > > sender r(0) ip(x.x.x.x) ; members(old:1 left:0) > > Aug 1 12:31:13 slesha1n1i-u corosync[8905]: [MAIN ] Completed service > > synchronization, ready to provide service. > > Aug 1 12:31:13 slesha1n1i-u crmd[8916]: notice: too_many_st_failures: > > Too many failures to fence slesha1n2i-u (13), giving up > > Aug 1 12:31:13 slesha1n1i-u crmd[8916]: notice: too_many_st_failures: > > Too many failures to fence slesha1n2i-u (13), giving up > > Aug 1 12:31:14 slesha1n1i-u mgmtd: [8917]: info: CIB query: cib > > Aug 1 12:31:14 slesha1n1i-u corosync[8905]: [pcmk ] info: > > update_member: Node slesha1n2i-u now has process list: > > 00000000000000000000000000151302 (1381122) > > Aug 1 12:31:14 slesha1n1i-u corosync[8905]: [pcmk ] info: > > send_member_notification: Sending membership update 376 to 2 children > > Aug 1 12:31:14 slesha1n1i-u corosync[8905]: [pcmk ] info: > > update_member: Node slesha1n2i-u now has process list: > > 00000000000000000000000000141302 (1315586) > > Aug 1 12:31:14 slesha1n1i-u corosync[8905]: [pcmk ] info: > > send_member_notification: Sending membership update 376 to 2 children > > Aug 1 12:31:14 slesha1n1i-u corosync[8905]: [pcmk ] info: > > update_member: Node slesha1n2i-u now has process list: > > 00000000000000000000000000101302 (1053442) > > Aug 1 12:31:14 slesha1n1i-u corosync[8905]: [pcmk ] info: > > send_member_notification: Sending membership update 376 to 2 children > > Aug 1 12:31:15 slesha1n1i-u crmd[8916]: notice: do_state_transition: > > State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN > > cause=C_HA_MESSAGE origin=route_message ] > > Aug 1 12:31:15 slesha1n1i-u crmd[8916]: notice: too_many_st_failures: > > Too many failures to fence slesha1n2i-u (13), giving up > > Aug 1 12:31:15 slesha1n1i-u crmd[8916]: notice: too_many_st_failures: > > Too many failures to fence slesha1n2i-u (13), giving up**** > > > > - Cluster status changes to Online for both nodes, but the stonith > > resource won't start on any of the nodes. > > - Trying to start the resource manually, but no success. > > - Trying to restart the corosync process on Node 1 (rcopenais restart), > > but it hangs forever. Checking logs: > > Aug 1 12:42:08 slesha1n1i-u corosync[8905]: [SERV ] Unloading all > > Corosync service engines. > > Aug 1 12:42:08 slesha1n1i-u corosync[8905]: [pcmk ] notice: > > pcmk_shutdown: Shuting down Pacemaker > > Aug 1 12:42:08 slesha1n1i-u corosync[8905]: [pcmk ] notice: > > stop_child: Sent -15 to mgmtd: [8917] > > Aug 1 12:42:08 slesha1n1i-u mgmtd: [8917]: info: mgmtd is shutting down > > Aug 1 12:42:08 slesha1n1i-u mgmtd: [8917]: info: final_crm: client_id=1 > > cib_name=live > > Aug 1 12:42:08 slesha1n1i-u mgmtd: [8917]: info: final_crm: client_id=2 > > cib_name=live > > Aug 1 12:42:08 slesha1n1i-u mgmtd: [8917]: debug: [mgmtd] stopped > > Aug 1 12:42:08 slesha1n1i-u corosync[8905]: [pcmk ] notice: > > pcmk_shutdown: mgmtd confirmed stopped > > Aug 1 12:42:08 slesha1n1i-u corosync[8905]: [pcmk ] notice: > > stop_child: Sent -15 to crmd: [8916] > > Aug 1 12:42:08 slesha1n1i-u crmd[8916]: notice: crm_shutdown: > > Requesting shutdown, upper limit is 1200000ms > > Aug 1 12:42:08 slesha1n1i-u attrd[8914]: notice: attrd_trigger_update: > > Sending flush op to all hosts for: shutdown (1375353728) > > Aug 1 12:42:08 slesha1n1i-u attrd[8914]: notice: attrd_perform_update: > > Sent update 22: shutdown=1375353728 > > Aug 1 12:42:08 slesha1n1i-u crmd[8916]: notice: too_many_st_failures: > > Too many failures to fence slesha1n2i-u (13), giving up > > Aug 1 12:42:08 slesha1n1i-u crmd[8916]: warning: do_log: FSA: Input > > I_TE_SUCCESS from abort_transition_graph() received in state S_POLICY_ENGINE > > Aug 1 12:42:38 slesha1n1i-u corosync[8905]: [pcmk ] notice: > > pcmk_shutdown: Still waiting for crmd (pid=8916, seq=6) to terminate... > > Aug 1 12:43:08 slesha1n1i-u corosync[8905]: [pcmk ] notice: > > pcmk_shutdown: Still waiting for crmd (pid=8916, seq=6) to terminate...*** > > * > > > > ...**** > > > > - Finally I kill the corosync process on Node 1 (killall -9 corosync), > > then corsoync restarts. > > - Checking status. All resources are up and running on Node 1, and the > > stonith resource is running on Node 2 again.**** > > > > **** > > > > I have tested the same scenario several times. Sometimes the fencing > > mechaism works as expected, but other times the stonith resource is not > > transferred to Node 1 - as described here. So I need some assistance to > > overcome this problem....**** > > > > **** > > > > -- > > Best regards > > Jan **** > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > > > > -- > mvh > Jan Christian > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org