Re: [Linux-ha-dev] Re: [Linux-HA] Recovering from unexpected bad things - is STONITH the answer?

2007-11-07 Thread Peter R. Badovinatz

Alan Robertson wrote:

Kevin Tomlinson wrote:

On Tue, 2007-11-06 at 10:25 -0700, Alan Robertson wrote:

We now have the ComponentFail test in CTS.  Thanks Lars for getting 
it going!


And, in the process, it's showing up some kinds of problems that we 
hadn't been looking for before.  A couple examples of such problems 
can be found here:


http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1762
http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1732

The question that comes up is this:

For problems that should never happen like death of one of our 
core/key processes, is an immediate reboot of the machine the right 
recovery technique?


snip


Here's the issue:

The solution as I see it is to do one of:

a) reboot the node and clear the problem with certainty


I'm well aware that saying yeah, we did it that way isn't Good Form, 
but, well, we did it that way (also we did 'd', see below).  I refer to 
existing proprietary HA products in which I was involved designing and 
implementing.


What we found was that certain processes were indistinguishable from the 
node itself and their failure was therefore near impossible to deal with 
cleanly.  The problem was, as described here, that the OS and 
applications/services would continue on but the other nodes in the 
cluster would see it as a node failure and take recovery actions (as you 
describe here).


This was admin-controllable, as we did offer something else...



b) continue on and risk damaging your disks.

c) write some new code to recover from specific cases more
   gracefully and then test it thoroughly.

d) Try and figure out how to propagate the failure to the
top layer of the cluster, and hope you get the notice
there soon enough so that it can freeze the cluster
before the code reacts to the apparent failure
and begins to try and recover from it.


Our architecture was that our clients were in a 'process group' across 
the cluster (group services) where each was connected to our server 
process via a unix domain socket on the same node.  Across the cluster 
these process group peers were assumed to be controlling resources and 
they mediated recovery actions through group services.


The unix-domain socket breaking was (well, still is) a defined condition 
 that the client was told to handle as our process dying.  They were 
told in this case to immediately clean up and assume that their peers 
would assume that the local process was gone since the node had died and 
would be doing takeovers.


This was meant to allow some hope of avoiding taking the node down for 
real, depending on the application space.  The intent was that the local 
client would notice the death immediately, while the remote ones would 
take some time (i.e., lack of heartbeats, etc.) to notice.


In cases where the node wasn't taken down, via inittab or similar our 
processes would get automatically restarted, and we'd reintegrate into 
the cluster.  The local client process(es) would be expected to 
reconnect and have to rejoin their group(s).  Our interface manual 
described all of this.


If for some reason our processes couldn't restart (or inittab gave up 
because of too many retries) that node would stay out of the cluster.




In the current code, sometimes you'll get behavior (a) and sometimes 
you'll get behavior (b) and sometimes you'll get behavior (c).


In the particular case described by bug 1762, failure to reboot the node 
did indeed start the same resource twice.  In a cluster where you have 
shared disk (like yours for example), that would probably trash the 
filesystem.  Not a good plan unless you're tired of your current job 
;-).  I'd like to take most/all of the cases where you might get 
behavior (b) and cause them to use behavior (a).


If writing correct code and testing it were free, then (c) would 
obviously be the right choice.


Quite honestly, I don't know how to do (d) in a reliable way at all. 
It's much more difficult than it sounds.  Among other reasons, it relies 
on the components you're telling to freeze things to work correctly. 
Since resource freezes happen at the top level of the system, and the 
top layers need all the layers under them to work correctly, getting 
this right seems to be the kind of approach you could make into your 
life's work - and still never get it right.


You're right.  In the scheme I described above we (group services) 
simply washed our hands of what our clients (layers above) were able to 
do and get right...  We didn't write those.  We offered this as the only 
thing we could think of, with the hope that some clients could do things 
correctly.  It assumes that the OS, for example, is still working so the 
client can take dependable actions.


And if they weren't confident they could enable node rebooting in this 
case and let recovery happen 'normally'.




Case (c) has to be handled on a case by case basis, where 

[Linux-ha-dev] Re: [Linux-HA] Recovering from unexpected bad things - is STONITH the answer?

2007-11-07 Thread Yan Fitterer


Alan Robertson wrote:
 Yan Fitterer wrote:
 Not always. The case I have encountered (live) doesn't relate to HB
 component failure per se, but is nevertheless destructive.

 With an eDirectory load (and other database-backed software with large
 or lazily flushed write buffers would be similarly affected, IMHO), a
 hard reset of a node has a high likelihood of corrupting the database.
 This is in some cases no less destructive than allowing concurrent
 access to, say, an ext3 filesystem...
 
 If your software cannot withstand a crash, then it cannot be made
 highly-available - end of story.  Crashes will happen.  Be prepared.

This is a fine argument from an engineering perspective, but not much
use from a sysadmin POV. Heartbeat should (can and does!) help on any
kind of software. I'm simply pointing out that (for less perfect
software, amongst other reasons) the less STONITH (hard reset) potential
cases we have, the better. :) Anything to avoid STONITH (in particular
when a node isn't quite dead from the workload perspective).

 I have been pondering for a while the possibility of using some
 disk-based heartbeat to block STONITH, in cases where the STONITH target
 is still writing its disk heartbeat. This would in this case prevent
 data damage.

 In addition, I have been thinking of complementing this mechanism with a
 disk-based STONITH (otherwise known as poison pill...) so that the
 unreachable node may (if things aren't too badly broken) take its
 resources down, and stop the disk heartbeat, which would then allow the
 rest of the cluster to consider it having left the cluster safely, and
 migrate the resources.

 Not quite sure how much of a fundamental change this would be though...
 
 My suggestion for this would be to implement a full communications
 plugin module that sends packets through disk areas.  If you do this
 right, then the communications will remain fully up for all purposes.
 We've had people start this effort in the past, but it's never been
 finished and all the bugs driven out AFAIK.

Agreed. Since I can't make much headway with my other approach(es)...
(and since having thought about it, they're certainly very much inferior
to a full disk-based comms)... I happen to have a little time on my
hands this month, and an itch to do some hacking.

Can anybody point me to the patch(es) with whatever code we have around
this?

Is anybody else coding on this right now?

Thanks
Yan
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-ha-dev] Re: [Linux-HA] Recovering from unexpected bad things - is STONITH the answer?

2007-11-06 Thread Alan Robertson

Kevin Tomlinson wrote:

On Tue, 2007-11-06 at 10:25 -0700, Alan Robertson wrote:

We now have the ComponentFail test in CTS.  Thanks Lars for getting it 
going!


And, in the process, it's showing up some kinds of problems that we 
hadn't been looking for before.  A couple examples of such problems can 
be found here:


http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1762
http://old.linux-foundation.org/developer_bugzilla/show_bug.cgi?id=1732

The question that comes up is this:

For problems that should never happen like death of one of our 
core/key processes, is an immediate reboot of the machine the right 
recovery technique?


The advantages of such a choice include:
  It is fast
  It will invoke recovery paths that we exercise a lot in testing
  It is MUCH simpler than trying to recover from all these cases,
therefore almost certainly more reliable

The disadvantages of such a choice include:
  It is crude, and very annoying
  It probably shouldn't be invoked for single-node clusters (?)
  It could be criticized as being lazy
  It shouldn't be invoked if there is another simple and correct method
  Continual rebooting becomes a possibility...

We do not have a policy of doing this throughout the project, what we 
have is a few places where we do it.


I propose that we should consider making a uniform policy decision for 
the project - and specifically decide to use ungraceful reboots as our 
recovery method for key processes dying (for example: CCM, heartbeat, 
CIB, CRM).  It should work for those cases where people don't configure 
in watchdogs or explicitly define any STONITH devices, and also 
independently of quorum policies - because AFAIK it seems like the right 
choice, there's no technical reason not to do so.


My inclination is to think that this is a good approach to take for 
problems that in our best-guess judgment shouldn't happen.



I'm bringing this to both lists, so that we can hear comments both from
developers and users.


Comments please...




I would say the right thing would depend on your cluster
implementation and what is consider the right thing to do for the
applications that the cluster is monitoring.
I would propose that this action should be administrator configurable.

From a user point of view with the cluster that we are implementing we

would expect any cluster failure (internal) to either get itself back
and running or just send out an alert Help me. im not working... as we
would want our applications to continue running on the nodes. ** We dont
want a service outage just because the cluster is no longer monitoring
our applications. **
We would expect to get a 24x7 call out. Sev1 and then logon to the
cluster and see what was happening. (configured alerting)
Our applications only want a service outage if the node itself has
issues not the Cluster..


Here's the issue:

The solution as I see it is to do one of:

a) reboot the node and clear the problem with certainty

b) continue on and risk damaging your disks.

c) write some new code to recover from specific cases more
   gracefully and then test it thoroughly.

d) Try and figure out how to propagate the failure to the
top layer of the cluster, and hope you get the notice
there soon enough so that it can freeze the cluster
before the code reacts to the apparent failure
and begins to try and recover from it.

In the current code, sometimes you'll get behavior (a) and sometimes 
you'll get behavior (b) and sometimes you'll get behavior (c).


In the particular case described by bug 1762, failure to reboot the node 
did indeed start the same resource twice.  In a cluster where you have 
shared disk (like yours for example), that would probably trash the 
filesystem.  Not a good plan unless you're tired of your current job 
;-).  I'd like to take most/all of the cases where you might get 
behavior (b) and cause them to use behavior (a).


If writing correct code and testing it were free, then (c) would 
obviously be the right choice.


Quite honestly, I don't know how to do (d) in a reliable way at all. 
It's much more difficult than it sounds.  Among other reasons, it relies 
on the components you're telling to freeze things to work correctly. 
Since resource freezes happen at the top level of the system, and the 
top layers need all the layers under them to work correctly, getting 
this right seems to be the kind of approach you could make into your 
life's work - and still never get it right.


Case (c) has to be handled on a case by case basis, where you write and 
test the code for a particular failure case.  IMHO the only feasible 
_general_ answer is (a).


There are an infinite number of things that can go wrong.  So, having a 
reliable and general strategy to deal with the WTF's of the world is a 
good thing.  Of course, for those cases where we have a (c) behavior 
would not