On Fri, Feb 24, 2012 at 3:08 AM, David Gubler <d...@doodle.com> wrote: > Hi Jake, > > Thanks for your answer. I had another go today. > > > On 22.02.2012 00:09, Jake Smith wrote: >> >> Still probably not the nicest/cleanest solution but you could do a cronjob >> that runs 'crm resource reprobe node_name'. That will check for resources >> the cluster didn't start and prevent the cleanup actions. > > > Unfortunately that doesn't work, if the last error was a monitor timeout.
It should. Please file a bug with a hb_report tarball on bugs.clusterlabs.org. > Oddly enough I have to do "crm resource cleanup apacheClone" - not "apache" This doesn't work because there is no actual resource called "apache". Granted we could be smarter and work it out. Patch anyone? > - to fix the state of the apache resource, even though the monitor is part > of the apache resource, not the clone. If I try both variants with reprobe, > nothing happens. > > By the way, if I stop apache (/etc/init.d/apache2 stop), wait until > Pacemaker notices, and start it again, then Pacemaker also notices that > apache is back and moves the IPs accordingly! > > Why does it matter to pacemaker whether the service is shut down normally > vs. a monitor timeout? > > >> what about an 'on-fail' in the op monitor section - probably with an >> =ignore? >> More on that one here: >> >> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-resource-operations.html > > > That doesn't help - Pacemaker sometimes (it's not deterministic and often > only happens on one of the two nodes) still stops and starts apache. > > Even after reading the documentation several times, I still barely get what > on-fail=something is supposed to do. When I set e.g. "on-fail=ignore" on the > apache primitive, it has no apparent effects (dito for restart) - Pacemaker > acts exactly as if that option were not set. Which kind of makes sense: > > "The default for the stop operation is fence when STONITH is enabled and > block otherwise. All other operations default to stop." > > Thus, "ignore" equals "stop", and "stop" equals "block" (since I don't have > STONITH). So what good is "ignore", if it's just another way of saying > "block"? No, ignore means "pretend it never happened", so in the case of a monitor failure it means "pretend that everything is still happily running". > > So I *suppose* what I'm seeing is that my failed apache resource gets into > the blocked state, and since "blocked" means "don't do anything with that > resource", no surprise it doesn't recover automatically. But I still have > now clue as to how I should do this instead... I've missed the backstory, but the only way it should be able to get into a blocked state is if the stop action fails/times out and stonith is inactive or if you've specifically set on-fail=block for an op. To which the solution is "make sure stop succeeds" or "dont do that" > > Thanks, > > > David > > -- > David Gubler > Senior Software & Operations Engineer > MeetMe: http://doodle.com/david > E-Mail: d...@doodle.com > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org