The basics: Dual-primary cman+pacemaker+drbd cluster running on RHEL6.2; spec
files and versions below.

Problem: If I restart both nodes at the same time, or even just start pacemaker
on both nodes at the same time, the drbd ms resource starts, but both nodes stay
in slave mode. They'll both stay in slave mode until one of the following 
occurs:

- I manually type "crm resource cleanup <ms-resource-name>"

- 15 minutes elapse. Then the "PEngine Recheck Timer" is fired, and the ms
resources are promoted.

The key resource definitions:

primitive AdminDrbd ocf:linbit:drbd \
        params drbd_resource="admin" \
        op monitor interval="59s" role="Master" timeout="30s" \
        op monitor interval="60s" role="Slave" timeout="30s" \
        op stop interval="0" timeout="100" \
        op start interval="0" timeout="240" \
        meta target-role="Master"
ms AdminClone AdminDrbd \
        meta master-max="2" master-node-max="1" clone-max="2" \
        clone-node-max="1" notify="true" interleave="true"
# The lengthy definition of "FilesystemGroup" is in the crm pastebin below
clone FilesystemClone FilesystemGroup \
        meta interleave="true" target-role="Started"
colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
order Admin_Before_Filesystem inf: AdminClone:promote FilesystemClone:start

Note that I stuck in "target-role" options to try to solve the problem; no 
effect.

When I look in /var/log/messages, I see no error messages or indications why the
promotion should be delayed. The 'admin' drbd resource is reported as UpToDate
on both nodes. There are no error messages when I force the issue with:

crm resource cleanup AdminClone

It's as if pacemaker, at start, needs some kind of "kick" after the drbd
resource is ready to be promoted.

This is not just an abstract case for me. At my site, it's not uncommon for
there to be lengthy power outages that will bring down the cluster. Both systems
will come up when power is restored, and I need for cluster services to be
available shortly afterward, not 15 minutes later.

Any ideas?

Details:

# rpm -q kernel cman pacemaker drbd
kernel-2.6.32-220.4.1.el6.x86_64
cman-3.0.12.1-23.el6.x86_64
pacemaker-1.1.6-3.el6.x86_64
drbd-8.4.1-1.el6.x86_64

Output of crm_mon after two-node reboot or pacemaker restart:
<http://pastebin.com/jzrpCk3i>
cluster.conf: <http://pastebin.com/sJw4KBws>
"crm configure show": <http://pastebin.com/MgYCQ2JH>
"drbdadm dump all": <http://pastebin.com/NrY6bskk>
-- 
Bill Seligman             | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137                |
Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to