Hi, I am deploying pacemaker + drbd to provide a high availability storage and during the troubleshooting tests I got an strange behaviour where the colocation constraint for the remaining resources and the cloned group appear to be just ignored.
These are the constraints I have: Location Constraints: Ordering Constraints: start DRBDData-clone then start nfs (kind:Mandatory) Colocation Constraints: nfs with DRBDData-clone (score:INFINITY) Ticket Constraints: The environment: I have a two node cluster with a remote quorum device. The test was to stop the quorum device and afterwards stop the node currently running all the services ( node1 ). The expected behaviour would be that the remaining node would not be able to do anything ( partition without-quorum ) until it gets quorum. This is the output of pcs status on node2 after power off the quorum device and the node1. Some resources have been removed from the output to make this email cleaner. Cluster name: storage-drbd Cluster Summary: * Stack: corosync * Current DC: node2 (version 2.1.0-8.el8-7c3f660707) - partition WITHOUT quorum * Last updated: Mon Apr 11 12:28:06 2022 * Last change: Mon Apr 11 12:26:10 2022 by root via cibadmin on node2 * 2 nodes configured * 11 resource instances configured Node List: * Node node1: UNCLEAN (offline) * Online: [ node2 ] Full List of Resources: * fence-node1 (stonith:fence_vmware_rest): Started node2 * fence-node2 (stonith:fence_vmware_rest): Started node1 (UNCLEAN) * Clone Set: DRBDData-clone [DRBDData] (promotable): * DRBDData (ocf::linbit:drbd): Master node1 (UNCLEAN) * Slaves: [ node2 ] * Resource Group: nfs: * vip_nfs (ocf::heartbeat:IPaddr2): Started node1 (UNCLEAN) * drbd_fs (ocf::heartbeat:Filesystem): Started node1 (UNCLEAN) * nfsd (ocf::heartbeat:nfsserver): Started node1 (UNCLEAN) Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled As expected, the node 2 is without quorum and waiting. The problem hapenned when I turn the node1 back. The quorum was restablished, but the drbd master started on node1, but the nfs group started on node2, even though I have both start order and colocation to make both the Cloned Resource and the NFS group to run on the same node. Cluster name: storage-drbd Cluster Summary: * Stack: corosync * Current DC: node2 (version 2.1.0-8.el8-7c3f660707) - partition with quorum * Last updated: Mon Apr 11 12:29:08 2022 * Last change: Mon Apr 11 12:26:10 2022 by root via cibadmin on node2 * 2 nodes configured * 11 resource instances configured Node List: * Online: [ node1 node2 ] Full List of Resources: * fence-node1 (stonith:fence_vmware_rest): Started node2 * fence-node2 (stonith:fence_vmware_rest): Started node1 * Clone Set: DRBDData-clone [DRBDData] (promotable): * Masters: [ node2 ] * Slaves: [ node1 ] * Resource Group: nfs: * vip_nfs (ocf::heartbeat:IPaddr2): Started node1 * drbd_fs (ocf::heartbeat:Filesystem): FAILED node1 * nfsd (ocf::heartbeat:nfsserver): Stopped Failed Resource Actions: * drbd_fs_start_0 on node1 'error' (1): call=90, status='complete', exitreason='Couldn't mount device [/dev/drbd0] as /exports/drbd0', last-rc-change='2022-04-11 12:29:05 -03:00', queued=0ms, exec=2567ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled Can anyone explain to me why are the constraints being ignored? Running on alma linux 8.5 + pcs-0.10.10-4.el8_5.1.alma.x86_64 Thanks! Kind regards, Salatiel _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/