Re: [Linux-HA] Server becomes unresponsive after node failure

Lars Ellenberg Tue, 08 Mar 2011 12:07:56 -0800

On Tue, Mar 08, 2011 at 05:43:17PM +0100, Dejan Muhamedagic wrote:
> Hi,
> 
> On Tue, Mar 08, 2011 at 05:32:44PM +0100, Sascha Hagedorn wrote:
> > Hi Dejan,
> > 
> > thank you for your answer. I added an external/ssh stonith resource
> > to test this and it resolved the problem. It wasn't clear to me that
> > the stonith resource does more than shooting the other node.
> > Apparently some cluster parameters are being set too, so the system
> > stays clean. During the test my understanding was when I cut the
> > power of one node I don't need a stonith device to shoot it.
> 
> Hmm, I wonder how external/ssh could've solved this particular
> issue, since if you pull the plug it will never be able to fence
> that node.


Oh, that's easy.  external/ssh pings the victim, and if it does not
answer, which will be the case for a down node as well as a down link,
stonith is considered to have been successful ;-)

In the "node down" case, this will allow the cluster to proceed,
and all is well.

But in the "link down" case, this will allow the cluster to proceed,
even though the victim will continue to run it's services, causing
cluster split brain and data corruption.

That's why:

> You really need a usable stonith device. external/ssh
> is for testing only.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Server becomes unresponsive after node failure

Reply via email to