Hi, On Tue, Jun 15, 2010 at 01:15:08PM -0600, Dan Urist wrote: > I've recently had exactly the same thing happen. One (highly kludgey!) > solution I've considered is hacking a custom version of the stonith IPMI > agent that would check whether the node was at all reachable following a > stonith failure via any of the cluster interfaces reported by > cl_status (I have redundant network links), and then return true (i.e. > pretend the stonith succeeded) if it isn't. Since this is basically the > logic I would use if I were trying to debug the issue remotely, I don't > see that this would be any worse. > > Besides the obvious (potential, however unlikely, for split-brain), is > there any reason this approach wouldn't work?
That sounds like a reason good enough to me :) If you can't reach the host, you cannot know its state. Thanks, Dejan > On Tue, 15 Jun 2010 19:55:49 +0200 > Bernd Schubert <bs_li...@aakef.fastmail.fm> wrote: > > > Hello Diane, > > > > the problem is that pacemaker is not allowed to take over resources > > until stonith succeeds, as it simply does not know about the state of > > the other server. Lets assume the other node would still be up and > > running, would have mounted a shared storage device an would write to > > it, but would respond to network anymore. If pacemaker would now > > mount this device again, you would get data corruption. To protect > > you against that, it requires that stonith succeeds, or that you > > manually solve that problem. > > > > The only automatic solution would be a more reliable stonith device, > > e.g. IPMI with an extra power supply for the IPMI card or a PDU. > > > > Cheers, > > Bernd > > > > On Tuesday 15 June 2010, Schaefer, Diane E wrote: > > > Thanks for the idea. Is there any way to automatically recover > > > resources without manual intervention? > > > > > > Diane > > > > > > THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE > > > PROPRIETARY MATERIAL and is thus for use only by the intended > > > recipient. If you received this in error, please contact the sender > > > and delete the e-mail and its attachments from all computers. > > > > > > > > > -----Original Message----- > > > From: Bernd Schubert [mailto:bs_li...@aakef.fastmail.fm] > > > Sent: Tuesday, June 15, 2010 1:39 PM > > > To: pacemaker@oss.clusterlabs.org > > > Cc: Schaefer, Diane E > > > Subject: Re: [Pacemaker] abrupt power failure problem > > > > > > On Tuesday 15 June 2010, Schaefer, Diane E wrote: > > > > Hi, > > > > We are having trouble with our two node cluster after one node > > > > experiences an abrupt power failure. The resources do not seem > > > > to start on the remaining node (ie DRBD resources do not promote > > > > to master). In the log we notice: > > > > > > > > Jan 8 02:12:27 qpr4 stonithd: [6622]: info: external_run_cmd: > > > > Calling '/usr/lib64/stonith/plugins/external/ipmi reset qpr3' > > > > returned 256 Jan 8 02:12:27 qpr4 stonithd: [6622]: CRIT: > > > > external_reset_req: 'ipmi reset' for host qpr3 failed with rc 256 > > > > Jan 8 02:12:27 qpr4 stonithd: [5854]: info: failed to STONITH > > > > node qpr3 with local device stonith0 (exitcode 5), gonna try the > > > > next local device Jan 8 02:12:27 qpr4 stonithd: [5854]: info: we > > > > can't manage qpr3, broadcast request to other nodes Jan 8 > > > > 02:13:27 qpr4 stonithd: [5854]: ERROR: Failed to STONITH the node > > > > qpr3: optype=RESET, op_result=TIMEOUT > > > > > > > > Jan 8 02:13:27 qpr4 stonithd: [6763]: info: external_run_cmd: > > > > Calling '/usr/lib64/stonith/plugins/external/ipmi reset qpr3' > > > > returned 256 Jan 8 02:13:27 qpr4 stonithd: [6763]: CRIT: > > > > external_reset_req: 'ipmi reset' for host qpr3 failed with rc 256 > > > > Jan 8 02:13:27 qpr4 stonithd: [5854]: info: failed to STONITH > > > > node qpr3 with local device stonith0 (exitcode 5), gonna try the > > > > next local device Jan 8 02:13:27 qpr4 stonithd: [5854]: info: we > > > > can't manage qpr3, broadcast request to other nodes Jan 8 > > > > 02:14:27 qpr4 stonithd: [5854]: ERROR: Failed to STONITH the node > > > > qpr3: optype=RESET, op_result=TIMEOUT > > > > > > Without looking at your hb_report, this already looks pretty clear > > > - this node tries to reset the other node using IPMI and that > > > fails, of course, as the node to be reset is powered off. > > > When we had that problem in the past, we simply temporarily removed > > > the failed node from the pacemaker configuration: crm node remove > > > <node-name> > > > > > > > > > Cheers, > > > Bernd > > > > > > > > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > > > > -- > Dan Urist > dur...@ucar.edu > 303-497-2459 > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker