[Pacemaker] When the ex-live server comes back online, it tries to failback causing a failure and restart in services

Michael Monette Thu, 16 Jan 2014 21:41:44 -0800

Hi,

I have 2 servers setup with Postgres and /dev/drbd1 is mounted at 
/var/lib/pgsql. I also have pacemaker setup and it's setup to fail back and 
forth between the 2 nodes. It works really well for the most part.


I am having this one problem and it is happening to all 4 of my clusters. If 
the "web_services" resource group is running on database-2.hehe.org and I do a 
hard reset on it, it fails over fine and within a few seconds the DB is running 
on database-1.hehe.org. I turn the system back on and everything is fine. It 
comes back online with no issue and everything continues to run normally on 
database-1. crm_mon shows no errors at all, the node simply goes into online 
status.

HOWEVER, If I do a hard shutdown on database-1(or any of my primary nodes, 
ldap-1,idp-1,acc-1), it fails over to database-2 just fine. But, when it comes 
back into online status it seems like pacemaker tries to move the resources 
back to database-1, fails and then the services get restarted on database-2 
because they are moving back.

Why is it that all of my 1st nodes are trying to take the resources back when 
they come back online but none of the 2nd nodes do this? Is there any way to 
prevent this? Can PaceMaker not check to see if the resources in the cluster 
are already running, and if so, just become an available node for the next 
time? 

I tried putting sticky resources to infinity. I have tried starting up the 
corosync/pacemaker service with the node in standby beforehand and it's always 
the same thing. Once node-1 is online, all the services on node-2 get 
interrupted trying to failback, which fails(probably just because drbd is 
already in use on the other end).

Here is my config:

node database-1.hehe.org \
        attributes standby="off"
node database-2.hehe.org \
        attributes standby="off"
primitive drbd_data ocf:linbit:drbd \
        params drbd_resource="res1" \
        op monitor interval="29s" role="Master" \
        op monitor interval="31s" role="Slave"
primitive fs_data ocf:heartbeat:Filesystem \
        params device="/dev/drbd1" directory="/var/lib/pgsql" fstype="ext4"
primitive httpd lsb:postgresql
primitive ip_httpd ocf:heartbeat:IPaddr2 \
        params ip="10.199.0.11"
group web_services fs_data ip_httpd httpd
ms ms_drbd_data drbd_data \
        meta master-max="1" master-node-max="1" clone-max="2" 
clone-node-max="1" notify="true"
colocation web_services_on_drbd inf: httpd ms_drbd_data:Master
order web_services_after_drbd inf: ms_drbd_data:promote web_services:start
property $id="cib-bootstrap-options" \
        dc-version="1.1.10-14.el6_5.1-368c726" \
        cluster-infrastructure="classic openais (with plugin)" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1389926961"

Thanks


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] When the ex-live server comes back online, it tries to failback causing a failure and restart in services

Reply via email to