On 24 Mar 2014, at 8:23 pm, Sergey A. Tachenov <stache...@runbox.com> wrote:
> At this point the second node finally realizes something is wrong there, > fences the first node and takes over. After reboot, everything looks > like it's working fine now. Needless to say, 1 hour 45 minutes is a bit > too long for a recovery. > > Got any ideas where to look? Basically I'd like Pacemaker to detect > whatever happened and migrate to another node before trying to monitor, > restart or whatever else it tried to do with those resources. > > As far as I understand, Pacemaker is supposed to restart a service as > soon as the monitor operation fails (provided that I didn't specify > on-fail for the monitor action). Why didn't it try to restart any > resources until 45 minutes later? I expected to see something like this: > > monitor fails -> restart fails -> STONITH So would I. At this point though I would suggest an upgrade: 1. Fedora 16 is EOL 2. This looks like an lrmd issue and the lrmd was rewritten for 1.1.9 3. http://blog.clusterlabs.org/blog/2014/potential-for-data-corruption-in-pacemaker-1-dot-1-6-through-1-dot-1-9/ Why not try CentOS which ships 1.1.10 via official channels?
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org