On Thu, 2009-05-07 at 12:42 +0000, Lars Marowsky-Bree wrote:
On 2009-05-06T11:45:36, darren.mans...@opengi.co.uk wrote:
> 
> > Thanks to Lars and Dominik for your help. I have read up in the
SLE-11 HA PDF (an excellent document) and I understand a lot more of it
now.
> > 
> > The crm shell is awesome. I had at first discounted it because I was
used to using the cibadmin tool. But now I can see it's power I don't
use anything else.
> > 
> > My cluster is working exactly as I want now.
> 
> Great! That's good news.
> 
> > I'm still not 100% sold on the value of STONITH or fencing in the
real
> > world but I'm off to do more reading up on it.
> 
> Well, one of the values is that if you call Novell support, they'll
> actually listen to you instead of saying "No STONITH? Gee, that's too
> bad, come back with a supported configuration" ;-)
> 

Novell support listen?! Who knew? ;-)

> Joking (even if true) aside:
> 
> In case a resource fails to stop on a migration for example, the
cluster
> is blocked and cannot continue. If you have STONITH configured, this
> will be cleared up by fencing the node, which implies the resource is
> stopped and thus can continue.
> 
> In theory, resource agents aren't ever allowed to fail the 'stop' op.
> But it can happen, if the service is truly broken, the RA has a bug,
the
> kernel is confused, the disk has gone haywire and blocks, ... So this
> error scenario cannot be recovered in software, and if you don't have
> STONITH, this can bring down your cluster.
> 
> 
> Further, in the case of a node failure, STONITH is used to ensure the
> failed nodes/minority partition is truly dead before starting services
> within the quorate partition.
> 
> You may say "But I'm using drbd so what do I care, it'll just resync",
> and that would be mostly true, but:
> 
> 1. In theory, the replication link could still be up even though
> OpenAIS/Pacemaker think the node is dead. This could cause the dreaded
> dual access to the same shared storage.
> 
> 2. The dying nodes might still hold a connection to client processes.
> They might continue writing locally, and _confirming the writes to the
> clients_. These would be overwritten on resync. While the image would
be
> consistent afterwards, you have just lost transactional integrity,
which
> is generally considered a bad thing.
> 
> 3. Somewhat obscure, the dieing node might continue writing locally
> (consider a haywire project), which increases the amount of data in
need
> of resyncing.
> 
> 4. STONITH'ing the failed node might allow it to recover from
transient
> errors, bring up the replication again, and reduce the time in
degraded
> mode w/o redundancy.
> 
> 
> So in general, fencing/STONITH is a really really good idea.
> 

So it seems. I was only thinking of it from the network connectivity
point of view, where if the cluster loses quorum they can try to kill
each other through a remote switch. My reasoning was that if they can't
see each other then it's highly likely that nothing else can either so
why would I need to fence one of them. I didn't consider the other
potential need for fencing a misbehaving node such as the examples
you've provided above. 

I think the best way for me to think of it is, instead of 

"We can't see each other, quick lets make sure only one of us is still
here in case others can." 

it's more along the lines of a quote from Robert DeNiro in Ronin:

"Whenever there is any doubt, there is no doubt" i.e. If you're unsure,
then make sure, which is a good stance for a HA cluster to take.

> 
> Regards,
>     Lars
> 
> 
Thanks again.
-- 
Darren Mansell <darren.mans...@opengi.co.uk>
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to