[Pacemaker] Pacemaker unnecessarily (?) restarts a vm on active node when other node brought out of standby

Ian Mon, 12 May 2014 13:13:29 -0700

Hi,

First message here and pretty new to this, so apologies if this is thewrong place/approach for this question. I'm struggling to describe thisproblem so searching for previous answers is tricky. Hopefully someonecan give me a pointer...


Brief summary:
--------------

Situation is: dual node cluster (CentOS 6.5), running drbd+gfs2 toprovide active/active filestore for a libvirt domain (vm).

With both nodes up, all is fine, active/active filesystem available onboth nodes, vm running on node 1


Place node 2 into standby, vm is unaffected. Good.

Bring node 2 back online, pacemaker chooses to stop the vm and gfs onnode 1 while it promotes drbd to master on node 2. Bad (not very "HA"!)

Hopefully I've just got a constraint missing/wrong (or the wholestructure!). I know there is a constraint linking the promotion of thedrbd resource to the starting of the gfs2 filesystem, but I wouldn'texpect this to trigger on node 1 in the above scenario as it's alreadypromoted?



Versions:
---------

Linux sv07 2.6.32-431.11.2.el6.centos.plus.x86_64 #1 SMP Tue Mar 2521:36:54 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux


pacemaker-libs-1.1.10-14.el6_5.2.x86_64
pacemaker-cli-1.1.10-14.el6_5.2.x86_64
pacemaker-cluster-libs-1.1.10-14.el6_5.2.x86_64
pacemaker-1.1.10-14.el6_5.2.x86_64


Configuration (abridged):
-------------------------

(I can provide full configs/logs if it isn't obvious and anyone cares todig deeper)


res_drbd_vm1 is the drbd resource

vm_storage_core_dev is a group containing the drbd resources (justres_drbd_vm1 in this config)

vm_storage_core_dev-master is the master/slave resource for drbd

res_fs_vm1 is the gfs2 filesystem resource

vm_storage_core is a group containing the gfs2 resources (justres_fs_vm1 in this config)vm_storage_core-clone is the clone resource to get the gfs2 filesystemactive on both nodes


res_vm_nfs_server is the libvirt domain (vm)

(NB, The nfs filestore this server is sharing isn't from the gfs2filesystem, but another drbd volume that is always active/passive)



# pcs resource show --full
 Master: vm_storage_core_dev-master

Meta Attrs: master-max=2 master-node-max=1 clone-max=2clone-node-max=1 notify=true

  Group: vm_storage_core_dev
   Resource: res_drbd_vm1 (class=ocf provider=linbit type=drbd)
    Attributes: drbd_resource=vm1
    Operations: monitor interval=60s (res_drbd_vm1-monitor-interval-60s)

 Clone: vm_storage_core-clone
  Group: vm_storage_core
   Resource: res_fs_vm1 (class=ocf provider=heartbeat type=Filesystem)

Attributes: device=/dev/drbd/by-res/vm1 directory=/data/vm1fstype=gfs2 options=noatime,nodiratime

    Operations: monitor interval=60s (res_fs_vm1-monitor-interval-60s)

Resource: res_vm_nfs_server (class=ocf provider=heartbeattype=VirtualDomain)

  Attributes: config=/etc/libvirt/qemu/vm09.xml

Operations: monitor interval=60s(res_vm_nfs_server-monitor-interval-60s)



# pcs constraint show

Ordering Constraints:
  promote vm_storage_core_dev-master then start vm_storage_core-clone
  start vm_storage_core-clone then start res_vm_nfs_server

Colocation Constraints:

vm_storage_core-clone with vm_storage_core_dev-master(rsc-role:Started) (with-rsc-role:Master)

  res_vm_nfs_server with vm_storage_core-clone


/var/log/messages on node 2 around the event:
---------------------------------------------

May 12 19:23:02 sv07 attrd[3156]: notice: attrd_trigger_update:Sending flush op to all hosts for: master-res_drbd_vm1 (10000)May 12 19:23:02 sv07 attrd[3156]: notice: attrd_perform_update: Sentupdate 1067: master-res_drbd_vm1=10000May 12 19:23:02 sv07 crmd[3158]: notice: do_state_transition: Statetransition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALCcause=C_FSA_INTERNAL origin=abort_transition_graph ]May 12 19:23:02 sv07 attrd[3156]: notice: attrd_trigger_update:Sending flush op to all hosts for: master-res_drbd_live (10000)May 12 19:23:02 sv07 attrd[3156]: notice: attrd_perform_update: Sentupdate 1070: master-res_drbd_live=10000May 12 19:23:02 sv07 pengine[3157]: notice: LogActions: Promoteres_drbd_vm1:0#011(Slave -> Master sv07)May 12 19:23:02 sv07 pengine[3157]: notice: LogActions: Restartres_fs_vm1:0#011(Started sv06)May 12 19:23:02 sv07 pengine[3157]: notice: LogActions: Startres_fs_vm1:1#011(sv07)May 12 19:23:02 sv07 pengine[3157]: notice: LogActions: Restartres_vm_nfs_server#011(Started sv06)May 12 19:23:02 sv07 pengine[3157]: notice: process_pe_message:Calculated Transition 140: /var/lib/pacemaker/pengine/pe-input-855.bz2May 12 19:23:02 sv07 pengine[3157]: notice: LogActions: Promoteres_drbd_vm1:0#011(Slave -> Master sv07)May 12 19:23:02 sv07 pengine[3157]: notice: LogActions: Restartres_fs_vm1:0#011(Started sv06)May 12 19:23:02 sv07 pengine[3157]: notice: LogActions: Startres_fs_vm1:1#011(sv07)May 12 19:23:02 sv07 pengine[3157]: notice: LogActions: Restartres_vm_nfs_server#011(Started sv06)May 12 19:23:02 sv07 pengine[3157]: notice: process_pe_message:Calculated Transition 141: /var/lib/pacemaker/pengine/pe-input-856.bz2May 12 19:23:02 sv07 crmd[3158]: notice: te_rsc_command: Initiatingaction 4: cancel res_drbd_vm1_cancel_60000 on sv07 (local)May 12 19:23:02 sv07 crmd[3158]: notice: te_rsc_command: Initiatingaction 250: stop res_vm_nfs_server_stop_0 on sv06May 12 19:23:02 sv07 crmd[3158]: notice: te_rsc_command: Initiatingaction 264: notify res_drbd_vm1_pre_notify_promote_0 on sv07 (local)May 12 19:23:02 sv07 crmd[3158]: notice: te_rsc_command: Initiatingaction 266: notify res_drbd_vm1_pre_notify_promote_0 on sv06May 12 19:23:02 sv07 crmd[3158]: notice: process_lrm_event: LRMoperation res_drbd_vm1_notify_0 (call=1539, rc=0, cib-update=0,confirmed=true) ok



Thanks for looking,
Ian.


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] Pacemaker unnecessarily (?) restarts a vm on active node when other node brought out of standby

Reply via email to