Hi, On Thu, May 14, 2009 at 06:32:00PM +1000, Tim Serong wrote: > Greetings, > > I've written up a brief document entitled "STONITH Deathmatch Explained > (and Some Hints for Resource Agent Authors and Systems Engineers)": > > http://ourobengr.com/ha > > It's a description of causes of STONITH deathmatch in > Heartbeat/Pacemaker HA clusters, where two nodes continually shoot each > other, thus rendering the system less available than a non-HA system > would be. > > Hopefully publishing this will save at least a few people from some of > the pain myself and a couple of others experienced last year, in > particular when trying to debug resource agents that were misbehaving in > unexpected ways. > > Comments, feedback, etc. welcome.
Great document! A very funny illustration too :) Just a few remarks: - in "Causes ..." you missed to mention split-brain (no communication channels working) and, at the same time, to stress how important it is to have redundant communications :) - even though you mention that in the title, I'd still move the resource agent intricacies into another document; they are all very valuable advice, but of no concern to cluster administrators; it's also good to keep the focus on our little problem; then you'll have to find other "Things You Didn't Think Of" (or just keep the title and leave the section empty: it is important; or insert another illustration) - devote more space/thought to the part on how to avoid a "deathmatch"; there's only a mention on chkconfig within "Debugging ..." (or one can also use the "poweroff" fencing operation); also, note that this occurs only in cases reboot doesn't fix a problem (e.g. split-brain) Thanks, Dejan > Thanks, > > Tim > > > _______________________________________________ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker