Hello again.  We have not received any response so I thought I would send this 
once more.  Can anyone provide any guidance for us?  Is there a way to get more 
output from the STONITH daemon to help us determine why it might not issue the 
reset command to our plugin script?

Thanks much,
Joe

________________________________
From: Gruher, Joseph R
Sent: Monday, February 02, 2009 3:26 PM
To: General Linux-HA mailing list
Cc: Gruher, Joseph R; Liu, Zheng-yang
Subject: Help with STONITH Plugin

Hello-

We are developing our own STONITH plugin for our blade server and having an 
issue we are hoping this list can help us with.  Our STONITH plugin script 
works well some of the time (the bad node is reset and the resources fail over) 
but does not work some of the time (resources never fail over).  We have been 
looking in the messages log (see examples below) and it appears in the 
non-working case that the reset call to our plugin never occurs, even though 
the getconfignames, status and gethosts calls that normally lead up to it are 
made.  Are there any common problem that could cause this behavior?  Any 
suggestions how we can continue to debug this problem?  Other logs we should be 
looking in?  Would it be useful to send our plugin or any other files from the 
system?

Thanks very much for any and all input.  We are testing with SLES10.2 x64.

-Joe


When it works:

Jan 28 11:33:09 25node2 crmd: [6325]: info: do_lrm_rsc_op: Performing 
op=STONITH_v1.1_start_0 key=16:47:676b4bb5-523b-49c5-a5f3-a228f2af8149)
Jan 28 11:33:09 25node2 lrmd: [6322]: info: rsc:STONITH_v1.1: start
Jan 28 11:33:09 25node2 lrmd: [14675]: info: Try to start STONITH resource 
<rsc_id=STONITH_v1.1> : Device=external/MFSYS_STONITH_PLUGIN_v1.1
Jan 28 11:33:09 25node2 MFSYS_STONITH_PLUGIN_v1.1[14676]: getconfignames 
(node2slot=; slot=)
Jan 28 11:33:09 25node2 MFSYS_STONITH_PLUGIN_v1.1[14676]: exiting script with 
an rc=0
Jan 28 11:33:09 25node2 MFSYS_STONITH_PLUGIN_v1.1[14687]: status 
(node2slot=25node1=1,25node2=2,25node3=3; slot=)
Jan 28 11:33:09 25node2 ccm: [6320]: debug: quorum plugin: majority
Jan 28 11:33:09 25node2 ccm: [6320]: debug: cluster:linux-ha, member_count=2, 
member_quorum_votes=200
Jan 28 11:33:09 25node2 cib: [6321]: info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
Jan 28 11:33:09 25node2 ccm: [6320]: debug: total_node_count=3, 
total_quorum_votes=300
Jan 28 11:33:09 25node2 cib: [6321]: info: mem_handle_event: no mbr_track info
Jan 28 11:33:09 25node2 ccm: [6320]: debug: quorum plugin: majority
Jan 28 11:33:09 25node2 cib: [6321]: info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
Jan 28 11:33:09 25node2 ccm: [6320]: debug: cluster:linux-ha, member_count=2, 
member_quorum_votes=200
Jan 28 11:33:09 25node2 crmd: [6325]: info: mem_handle_event: Got an event 
OC_EV_MS_INVALID from ccm
Jan 28 11:33:09 25node2 cib: [6321]: info: mem_handle_event: instance=16, 
nodes=2, new=0, lost=1, n_idx=0, new_idx=2, old_idx=5
Jan 28 11:33:09 25node2 ccm: [6320]: debug: total_node_count=3, 
total_quorum_votes=300
Jan 28 11:33:09 25node2 crmd: [6325]: info: mem_handle_event: no mbr_track info
Jan 28 11:33:09 25node2 cib: [6321]: info: cib_ccm_msg_callback: LOST: 25node1
Jan 28 11:33:09 25node2 crmd: [6325]: info: mem_handle_event: Got an event 
OC_EV_MS_NEW_MEMBERSHIP from ccm
Jan 28 11:33:09 25node2 cib: [6321]: info: cib_ccm_msg_callback: PEER: 25node2
Jan 28 11:33:09 25node2 crmd: [6325]: info: mem_handle_event: instance=16, 
nodes=2, new=0, lost=1, n_idx=0, new_idx=2, old_idx=5
Jan 28 11:33:09 25node2 cib: [6321]: info: cib_ccm_msg_callback: PEER: 25node3
Jan 28 11:33:09 25node2 crmd: [6325]: info: crmd_ccm_msg_callback: Quorum 
(re)attained after event=NEW MEMBERSHIP (id=16)
Jan 28 11:33:09 25node2 crmd: [6325]: info: ccm_event_detail: NEW MEMBERSHIP: 
trans=16, nodes=2, new=0, lost=1 n_idx=0, new_idx=2, old_idx=5
Jan 28 11:33:09 25node2 crmd: [6325]: info: ccm_event_detail:     CURRENT: 
25node2 [nodeid=1, born=2]
Jan 28 11:33:09 25node2 crmd: [6325]: info: ccm_event_detail:     CURRENT: 
25node3 [nodeid=2, born=3]
Jan 28 11:33:09 25node2 crmd: [6325]: info: ccm_event_detail:     LOST:    
25node1 [nodeid=0, born=15]
Jan 28 11:33:09 25node2 mgmtd: [6326]: ERROR: unpack_rsc_op: Remapping 
WebServer_monitor_0 (rc=1) on 25node2 to an ERROR
Jan 28 11:33:09 25node2 mgmtd: [6326]: ERROR: unpack_rsc_op: Remapping 
WebServer_monitor_0 (rc=1) on 25node3 to an ERROR
Jan 28 11:33:09 25node2 mgmtd: [6326]: ERROR: unpack_rsc_op: Remapping 
WebServer_monitor_0 (rc=1) on 25node2 to an ERROR
Jan 28 11:33:09 25node2 mgmtd: [6326]: ERROR: unpack_rsc_op: Remapping 
WebServer_monitor_0 (rc=1) on 25node3 to an ERROR
Jan 28 11:33:10 25node2 haclient: on_event: from message queue: evt:cib_changed
Jan 28 11:33:11 25node2 MFSYS_STONITH_PLUGIN_v1.1[14687]: exiting script with 
an rc=0
Jan 28 11:33:12 25node2 MFSYS_STONITH_PLUGIN_v1.1[14703]: gethosts 
(node2slot=25node1=1,25node2=2,25node3=3; slot=)
Jan 28 11:33:12 25node2 MFSYS_STONITH_PLUGIN_v1.1[14703]: exiting script with 
an rc=0
Jan 28 11:33:12 25node2 crmd: [6325]: info: process_lrm_event: LRM operation 
STONITH_v1.1_start_0 (call=24, rc=0) complete
Jan 28 11:33:12 25node2 tengine: [6860]: info: match_graph_event: Action 
STONITH_v1.1_start_0 (16) confirmed on 25node2 (rc=0)
Jan 28 11:33:12 25node2 tengine: [6860]: info: te_pseudo_action: Pseudo action 
17 fired and confirmed
Jan 28 11:33:12 25node2 tengine: [6860]: info: te_fence_node: Executing reboot 
fencing operation (18) on 25node1 (timeout=50000)
Jan 28 11:33:12 25node2 haclient: on_event:evt:cib_changed
Jan 28 11:33:12 25node2 stonithd: [6323]: info: client tengine [pid: 6860] want 
a STONITH operation RESET to node 25node1.
Jan 28 11:33:12 25node2 stonithd: [6323]: info: stonith_operate_locally::2368: 
sending fencing op (RESET) for 25node1 to device external (rsc_id=STONITH_v1.1, 
pid=14713)
Jan 28 11:33:12 25node2 MFSYS_STONITH_PLUGIN_v1.1[14714]: reset 25node1 
(node2slot=25node1=1,25node2=2,25node3=3; slot=1)
Jan 28 11:33:12 25node2 MFSYS_STONITH_PLUGIN_v1.1[14714]: retry=0 bladeState=-1 
powerstate=-32
Jan 28 11:33:12 25node2 MFSYS_STONITH_PLUGIN_v1.1[14714]: exiting script with 
an rc=0


When it fails:

Jan 30 15:54:27 vs-cb03-5cl crmd: [5116]: info: do_lrm_rsc_op: Performing 
op=CBSTONITH_monitor_0 key=4:8:469db639-f549-4582-9735-2b5e89d147c2)
Jan 30 15:54:27 vs-cb03-5cl lrmd: [5113]: info: rsc:CBSTONITH: monitor
Jan 30 15:54:27 vs-cb03-5cl crmd: [5116]: info: process_lrm_event: LRM 
operation CBSTONITH_monitor_0 (call=20, rc=7) complete
Jan 30 15:54:27 vs-cb03-5cl cib: [30766]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.xml (digest: 
/var/lib/heartbeat/crm/cib.xml.sig)
Jan 30 15:54:27 vs-cb03-5cl cib: [30766]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.xml (digest: 
/var/lib/heartbeat/crm/cib.xml.sig)
Jan 30 15:54:27 vs-cb03-5cl cib: [30766]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.xml.last (digest: 
/var/lib/heartbeat/crm/cib.xml.sig.last)
Jan 30 15:54:27 vs-cb03-5cl cib: [30766]: info: write_cib_contents: Wrote 
version 0.1175.3 of the CIB to disk (digest: 73c66f86b10cd0b1bc58a3beab870faa)
Jan 30 15:54:27 vs-cb03-5cl cib: [30766]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.xml (digest: 
/var/lib/heartbeat/crm/cib.xml.sig)
Jan 30 15:54:27 vs-cb03-5cl cib: [30766]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.xml.last (digest: 
/var/lib/heartbeat/crm/cib.xml.sig.last)
Jan 30 15:54:28 vs-cb03-5cl crmd: [5116]: info: do_lrm_rsc_op: Performing 
op=CBSTONITH_start_0 key=19:8:469db639-f549-4582-9735-2b5e89d147c2)
Jan 30 15:54:28 vs-cb03-5cl lrmd: [5113]: info: rsc:CBSTONITH: start
Jan 30 15:54:28 vs-cb03-5cl lrmd: [30768]: info: Try to start STONITH resource 
<rsc_id=CBSTONITH> : Device=external/MFSYS_STONITH_PLUGIN_v1.1
Jan 30 15:54:28 vs-cb03-5cl cib: [30767]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.xml (digest: 
/var/lib/heartbeat/crm/cib.xml.sig)
Jan 30 15:54:28 vs-cb03-5cl cib: [30767]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.xml (digest: 
/var/lib/heartbeat/crm/cib.xml.sig)
Jan 30 15:54:28 vs-cb03-5cl cib: [30767]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.xml.last (digest: 
/var/lib/heartbeat/crm/cib.xml.sig.last)
Jan 30 15:54:28 vs-cb03-5cl cib: [30767]: info: write_cib_contents: Wrote 
version 0.1175.5 of the CIB to disk (digest: f2c35efbf0c7066043202d754cf607bb)
Jan 30 15:54:28 vs-cb03-5cl cib: [30767]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.xml (digest: 
/var/lib/heartbeat/crm/cib.xml.sig)
Jan 30 15:54:28 vs-cb03-5cl cib: [30767]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.xml.last (digest: 
/var/lib/heartbeat/crm/cib.xml.sig.last)
Jan 30 15:54:29 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30769]: getconfignames 
(node2slot=; slot=)
Jan 30 15:54:29 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30769]: exiting script 
with an rc=0
Jan 30 15:54:29 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30780]: status 
(node2slot=vs-cb03-2cl=2,vs-cb03-5cl=5,vs-cb03-6cl=6; slot=)
Jan 30 15:54:29 vs-cb03-5cl cib: [30795]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.xml (digest: 
/var/lib/heartbeat/crm/cib.xml.sig)
Jan 30 15:54:29 vs-cb03-5cl cib: [30795]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.xml (digest: 
/var/lib/heartbeat/crm/cib.xml.sig)
Jan 30 15:54:29 vs-cb03-5cl cib: [30795]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.xml.last (digest: 
/var/lib/heartbeat/crm/cib.xml.sig.last)
Jan 30 15:54:29 vs-cb03-5cl cib: [30795]: info: write_cib_contents: Wrote 
version 0.1175.7 of the CIB to disk (digest: 551519e2dfb361ccf0f4303167997435)
Jan 30 15:54:29 vs-cb03-5cl cib: [30795]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.xml (digest: 
/var/lib/heartbeat/crm/cib.xml.sig)
Jan 30 15:54:29 vs-cb03-5cl cib: [30795]: info: retrieveCib: Reading cluster 
configuration from: /var/lib/heartbeat/crm/cib.xml.last (digest: 
/var/lib/heartbeat/crm/cib.xml.sig.last)
Jan 30 15:54:31 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30780]: exiting script 
with an rc=0
Jan 30 15:54:31 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30797]: gethosts 
(node2slot=vs-cb03-2cl=2,vs-cb03-5cl=5,vs-cb03-6cl=6; slot=)
Jan 30 15:54:31 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30797]: JOE02: hostlist = 
vs-cb03-2cl vs-cb03-5cl vs-cb03-6cl
Jan 30 15:54:31 vs-cb03-5cl MFSYS_STONITH_PLUGIN_v1.1[30797]: exiting script 
with an rc=0
Jan 30 15:54:31 vs-cb03-5cl crmd: [5116]: info: process_lrm_event: LRM 
operation CBSTONITH_start_0 (call=21, rc=0) complete
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to