Hello again. We have not received any response so I thought I would send this once more. Can anyone provide any guidance for us? Is there a way to get more output from the STONITH daemon to help us determine why it might not issue the reset command to our plugin script?
Thanks much, Joe ________________________________ From: Gruher, Joseph R Sent: Monday, February 02, 2009 3:26 PM To: General Linux-HA mailing list Cc: Gruher, Joseph R; Liu, Zheng-yang Subject: Help with STONITH Plugin Hello- We are developing our own STONITH plugin for our blade server and having an issue we are hoping this list can help us with. Our STONITH plugin script works well some of the time (the bad node is reset and the resources fail over) but does not work some of the time (resources never fail over). We have been looking in the messages log (see examples below) and it appears in the non-working case that the reset call to our plugin never occurs, even though the getconfignames, status and gethosts calls that normally lead up to it are made. Are there any common problem that could cause this behavior? Any suggestions how we can continue to debug this problem? Other logs we should be looking in? Would it be useful to send our plugin or any other files from the system? Thanks very much for any and all input. We are testing with SLES10.2 x64. -Joe When it works: Jan 28 11:33:09 25node2 crmd: [6325]: info: do_lrm_rsc_op: Performing op=STONITH_v1.1_start_0 key=16:47:676b4bb5-523b-49c5-a5f3-a228f2af8149) Jan 28 11:33:09 25node2 lrmd: [6322]: info: rsc:STONITH_v1.1: start Jan 28 11:33:09 25node2 lrmd: [14675]: info: Try to start STONITH resource <rsc_id=STONITH_v1.1> : Device=external/MFSYS_STONITH_PLUGIN_v1.1 Jan 28 11:33:09 25node2 MFSYS_STONITH_PLUGIN_v1.1[14676]: getconfignames (node2slot=; slot=) Jan 28 11:33:09 25node2 MFSYS_STONITH_PLUGIN_v1.1[14676]: exiting script with an rc=0 Jan 28 11:33:09 25node2 MFSYS_STONITH_PLUGIN_v1.1[14687]: status (node2slot=25node1=1,25node2=2,25node3=3; slot=) Jan 28 11:33:09 25node2 ccm: [6320]: debug: quorum plugin: majority Jan 28 11:33:09 25node2 ccm: [6320]: debug: cluster:linux-ha, member_count=2, member_quorum_votes=200 Jan 28 11:33:09 25node2 cib: [6321]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm Jan 28 11:33:09 25node2 ccm: [6320]: debug: total_node_count=3, total_quorum_votes=300 Jan 28 11:33:09 25node2 cib: [6321]: info: mem_handle_event: no mbr_track info Jan 28 11:33:09 25node2 ccm: [6320]: debug: quorum plugin: majority Jan 28 11:33:09 25node2 cib: [6321]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm Jan 28 11:33:09 25node2 ccm: [6320]: debug: cluster:linux-ha, member_count=2, member_quorum_votes=200 Jan 28 11:33:09 25node2 crmd: [6325]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm Jan 28 11:33:09 25node2 cib: [6321]: info: mem_handle_event: instance=16, nodes=2, new=0, lost=1, n_idx=0, new_idx=2, old_idx=5 Jan 28 11:33:09 25node2 ccm: [6320]: debug: total_node_count=3, total_quorum_votes=300 Jan 28 11:33:09 25node2 crmd: [6325]: info: mem_handle_event: no mbr_track info Jan 28 11:33:09 25node2 cib: [6321]: info: cib_ccm_msg_callback: LOST: 25node1 Jan 28 11:33:09 25node2 crmd: [6325]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm Jan 28 11:33:09 25node2 cib: [6321]: info: cib_ccm_msg_callback: PEER: 25node2 Jan 28 11:33:09 25node2 crmd: [6325]: info: mem_handle_event: instance=16, nodes=2, new=0, lost=1, n_idx=0, new_idx=2, old_idx=5 Jan 28 11:33:09 25node2 cib: [6321]: info: cib_ccm_msg_callback: PEER: 25node3 Jan 28 11:33:09 25node2 crmd: [6325]: info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=16) Jan 28 11:33:09 25node2 crmd: [6325]: info: ccm_event_detail: NEW MEMBERSHIP: trans=16, nodes=2, new=0, lost=1 n_idx=0, new_idx=2, old_idx=5 Jan 28 11:33:09 25node2 crmd: [6325]: info: ccm_event_detail: CURRENT: 25node2 [nodeid=1, born=2] Jan 28 11:33:09 25node2 crmd: [6325]: info: ccm_event_detail: CURRENT: 25node3 [nodeid=2, born=3] Jan 28 11:33:09 25node2 crmd: [6325]: info: ccm_event_detail: LOST: 25node1 [nodeid=0, born=15] Jan 28 11:33:09 25node2 mgmtd: [6326]: ERROR: unpack_rsc_op: Remapping WebServer_monitor_0 (rc=1) on 25node2 to an ERROR Jan 28 11:33:09 25node2 mgmtd: [6326]: ERROR: unpack_rsc_op: Remapping WebServer_monitor_0 (rc=1) on 25node3 to an ERROR Jan 28 11:33:09 25node2 mgmtd: [6326]: ERROR: unpack_rsc_op: Remapping WebServer_monitor_0 (rc=1) on 25node2 to an ERROR Jan 28 11:33:09 25node2 mgmtd: [6326]: ERROR: unpack_rsc_op: Remapping WebServer_monitor_0 (rc=1) on 25node3 to an ERROR Jan 28 11:33:10 25node2 haclient: on_event: from message queue: evt:cib_changed Jan 28 11:33:11 25node2 MFSYS_STONITH_PLUGIN_v1.1[14687]: exiting script with an rc=0 Jan 28 11:33:12 25node2 MFSYS_STONITH_PLUGIN_v1.1[14703]: gethosts (node2slot=25node1=1,25node2=2,25node3=3; slot=) Jan 28 11:33:12 25node2 MFSYS_STONITH_PLUGIN_v1.1[14703]: exiting script with an rc=0 Jan 28 11:33:12 25node2 crmd: [6325]: info: process_lrm_event: LRM operation STONITH_v1.1_start_0 (call=24, rc=0) complete Jan 28 11:33:12 25node2 tengine: [6860]: info: match_graph_event: Action STONITH_v1.1_start_0 (16) confirmed on 25node2 (rc=0) Jan 28 11:33:12 25node2 tengine: [6860]: info: te_pseudo_action: Pseudo action 17 fired and confirmed Jan 28 11:33:12 25node2 tengine: [6860]: info: te_fence_node: Executing reboot fencing operation (18) on 25node1 (timeout=50000) Jan 28 11:33:12 25node2 haclient: on_event:evt:cib_changed Jan 28 11:33:12 25node2 stonithd: [6323]: info: client tengine [pid: 6860] want a STONITH operation RESET to node 25node1. Jan 28 11:33:12 25node2 stonithd: [6323]: info: stonith_operate_locally::2368: sending fencing op (RESET) for 25node1 to device external (rsc_id=STONITH_v1.1, pid=14713) Jan 28 11:33:12 25node2 MFSYS_STONITH_PLUGIN_v1.1[14714]: reset 25node1 (node2slot=25node1=1,25node2=2,25node3=3; slot=1) Jan 28 11:33:12 25node2 MFSYS_STONITH_PLUGIN_v1.1[14714]: retry=0 bladeState=-1 powerstate=-32 Jan 28 11:33:12 25node2 MFSYS_STONITH_PLUGIN_v1.1[14714]: exiting script with an rc=0 When it fails: Jan 30 15:54:27 vs-cb03-5cl crmd: [5116]: info: do_lrm_rsc_op: Performing op=CBSTONITH_monitor_0 key=4:8:469db639-f549-4582-9735-2b5e89d147c2) Jan 30 15:54:27 vs-cb03-5cl lrmd: [5113]: info: rsc:CBSTONITH: monitor Jan 30 15:54:27 vs-cb03-5cl crmd: [5116]: info: process_lrm_event: LRM operation CBSTONITH_monitor_0 (call=20, rc=7) complete Jan 30 15:54:27 vs-cb03-5cl cib: [30766]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig) Jan 30 15:54:27 vs-cb03-5cl cib: [30766]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig) Jan 30 15:54:27 vs-cb03-5cl cib: [30766]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml.last (digest: /var/lib/heartbeat/crm/cib.xml.sig.last) Jan 30 15:54:27 vs-cb03-5cl cib: [30766]: info: write_cib_contents: Wrote version 0.1175.3 of the CIB to disk (digest: 73c66f86b10cd0b1bc58a3beab870faa) Jan 30 15:54:27 vs-cb03-5cl cib: [30766]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig) Jan 30 15:54:27 vs-cb03-5cl cib: [30766]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml.last (digest: /var/lib/heartbeat/crm/cib.xml.sig.last) Jan 30 15:54:28 vs-cb03-5cl crmd: [5116]: info: do_lrm_rsc_op: Performing op=CBSTONITH_start_0 key=19:8:469db639-f549-4582-9735-2b5e89d147c2) Jan 30 15:54:28 vs-cb03-5cl lrmd: [5113]: info: rsc:CBSTONITH: start Jan 30 15:54:28 vs-cb03-5cl lrmd: [30768]: info: Try to start STONITH resource <rsc_id=CBSTONITH> : Device=external/MFSYS_STONITH_PLUGIN_v1.1 Jan 30 15:54:28 vs-cb03-5cl cib: [30767]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig) Jan 30 15:54:28 vs-cb03-5cl cib: [30767]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig) Jan 30 15:54:28 vs-cb03-5cl cib: [30767]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml.last (digest: /var/lib/heartbeat/crm/cib.xml.sig.last) Jan 30 15:54:28 vs-cb03-5cl cib: [30767]: info: write_cib_contents: Wrote version 0.1175.5 of the CIB to disk (digest: f2c35efbf0c7066043202d754cf607bb) Jan 30 15:54:28 vs-cb03-5cl cib: [30767]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig) Jan 30 15:54:28 vs-cb03-5cl cib: [30767]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml.last (digest: /var/lib/heartbeat/crm/cib.xml.sig.last) Jan 30 15:54:29 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30769]: getconfignames (node2slot=; slot=) Jan 30 15:54:29 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30769]: exiting script with an rc=0 Jan 30 15:54:29 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30780]: status (node2slot=vs-cb03-2cl=2,vs-cb03-5cl=5,vs-cb03-6cl=6; slot=) Jan 30 15:54:29 vs-cb03-5cl cib: [30795]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig) Jan 30 15:54:29 vs-cb03-5cl cib: [30795]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig) Jan 30 15:54:29 vs-cb03-5cl cib: [30795]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml.last (digest: /var/lib/heartbeat/crm/cib.xml.sig.last) Jan 30 15:54:29 vs-cb03-5cl cib: [30795]: info: write_cib_contents: Wrote version 0.1175.7 of the CIB to disk (digest: 551519e2dfb361ccf0f4303167997435) Jan 30 15:54:29 vs-cb03-5cl cib: [30795]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig) Jan 30 15:54:29 vs-cb03-5cl cib: [30795]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml.last (digest: /var/lib/heartbeat/crm/cib.xml.sig.last) Jan 30 15:54:31 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30780]: exiting script with an rc=0 Jan 30 15:54:31 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30797]: gethosts (node2slot=vs-cb03-2cl=2,vs-cb03-5cl=5,vs-cb03-6cl=6; slot=) Jan 30 15:54:31 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30797]: JOE02: hostlist = vs-cb03-2cl vs-cb03-5cl vs-cb03-6cl Jan 30 15:54:31 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30797]: exiting script with an rc=0 Jan 30 15:54:31 vs-cb03-5cl crmd: [5116]: info: process_lrm_event: LRM operation CBSTONITH_start_0 (call=21, rc=0) complete _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems