Re: [Pacemaker] Pacemaker cluster took almost 2 hours to migrate

Andrew Beekhof Thu, 03 Apr 2014 19:31:08 -0700

On 24 Mar 2014, at 8:23 pm, Sergey A. Tachenov <stache...@runbox.com> wrote:


> At this point the second node finally realizes something is wrong there, 
> fences the first node and takes over. After reboot, everything looks 
> like it's working fine now. Needless to say, 1 hour 45 minutes is a bit 
> too long for a recovery.
> 
> Got any ideas where to look? Basically I'd like Pacemaker to detect 
> whatever happened and migrate to another node before trying to monitor, 
> restart or whatever else it tried to do with those resources.
> 
> As far as I understand, Pacemaker is supposed to restart a service as 
> soon as the monitor operation fails (provided that I didn't specify 
> on-fail for the monitor action). Why didn't it try to restart any 
> resources until 45 minutes later? I expected to see something like this:
> 
> monitor fails -> restart fails -> STONITH

So would I.

At this point though I would suggest an upgrade:

1. Fedora 16 is EOL
2. This looks like an lrmd issue and the lrmd was rewritten for 1.1.9
3. 
http://blog.clusterlabs.org/blog/2014/potential-for-data-corruption-in-pacemaker-1-dot-1-6-through-1-dot-1-9/

Why not try CentOS which ships 1.1.10 via official channels?

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Pacemaker cluster took almost 2 hours to migrate

Reply via email to