On 11/04/2016 01:57 PM, CART Andreas wrote: > Hi > > I have a basic 2 node active/passive cluster with Pacemaker (1.1.14 , > pcs: 0.9.148) / CMAN (3.0.12.1) / Corosync (1.4.7) on RHEL 6.8. > This cluster runs NFS on top of DRBD (8.4.4). > > Basically the system is working on both nodes and I can switch the > resources from one node to the other. > But switching resources to the other node does not work, if I try to > move just one resource and have the others follow due to the location > constraints. > > From the logged messages I see that in this “failure case” there is NO > attempt to demote/promote the DRBD clone resource. > > Here is my setup: > ================================================================== > Cluster Name: clst1 > Corosync Nodes: > ventsi-clst1-sync ventsi-clst2-sync > Pacemaker Nodes: > ventsi-clst1-sync ventsi-clst2-sync > > Resources: > Resource: IPaddrNFS (class=ocf provider=heartbeat type=IPaddr2) > Attributes: ip=xxx.xxx.xxx.xxx cidr_netmask=24 > Operations: start interval=0s timeout=20s (IPaddrNFS-start-interval-0s) > stop interval=0s timeout=20s (IPaddrNFS-stop-interval-0s) > monitor interval=5s (IPaddrNFS-monitor-interval-5s) > Resource: NFSServer (class=ocf provider=heartbeat type=nfsserver) > Attributes: nfs_shared_infodir=/var/lib/nfsserversettings/ > nfs_ip=xxx.xxx.xxx.xxx nfsd_args="-H xxx.xxx.xxx.xxx" > Operations: start interval=0s timeout=40 (NFSServer-start-interval-0s) > stop interval=0s timeout=20s (NFSServer-stop-interval-0s) > monitor interval=10s timeout=20s > (NFSServer-monitor-interval-10s) > Master: DRBDClone > Meta Attrs: master-max=1 master-node-max=1 clone-max=2 > clone-node-max=1 notify=true > Resource: DRBD (class=ocf provider=linbit type=drbd) > Attributes: drbd_resource=nfsdata > Operations: start interval=0s timeout=240 (DRBD-start-interval-0s) > promote interval=0s timeout=90 (DRBD-promote-interval-0s) > demote interval=0s timeout=90 (DRBD-demote-interval-0s) > stop interval=0s timeout=100 (DRBD-stop-interval-0s) > monitor interval=1s timeout=5 (DRBD-monitor-interval-1s) > Resource: DRBD_global_clst (class=ocf provider=heartbeat type=Filesystem) > Attributes: device=/dev/drbd1 directory=/drbdmnts/global_clst fstype=ext4 > Operations: start interval=0s timeout=60 > (DRBD_global_clst-start-interval-0s) > stop interval=0s timeout=60 > (DRBD_global_clst-stop-interval-0s) > monitor interval=20 timeout=40 > (DRBD_global_clst-monitor-interval-20) > > Stonith Devices: > Resource: ipmi-fence-clst1 (class=stonith type=fence_ipmilan) > Attributes: lanplus=1 login=foo passwd=bar action=reboot > ipaddr=yyy.yyy.yyy.yyy pcmk_host_check=static-list > pcmk_host_list=ventsi-clst1-sync auth=password timeout=30 cipher=1 > Operations: monitor interval=60s (ipmi-fence-clst1-monitor-interval-60s) > Resource: ipmi-fence-clst2 (class=stonith type=fence_ipmilan) > Attributes: lanplus=1 login=foo passwd=bar action=reboot > ipaddr=zzz.zzz.zzz.zzz pcmk_host_check=static-list > pcmk_host_list=ventsi-clst2-sync auth=password timeout=30 cipher=1 > Operations: monitor interval=60s (ipmi-fence-clst2-monitor-interval-60s) > Fencing Levels: > > Location Constraints: > Resource: ipmi-fence-clst1 > Disabled on: ventsi-clst1-sync (score:-INFINITY) > (id:location-ipmi-fence-clst1-ventsi-clst1-sync--INFINITY) > Resource: ipmi-fence-clst2 > Disabled on: ventsi-clst2-sync (score:-INFINITY) > (id:location-ipmi-fence-clst2-ventsi-clst2-sync--INFINITY) > Ordering Constraints: > start IPaddrNFS then start NFSServer (kind:Mandatory) > (id:order-IPaddrNFS-NFSServer-mandatory) > promote DRBDClone then start DRBD_global_clst (kind:Mandatory) > (id:order-DRBDClone-DRBD_global_clst-mandatory) > start DRBD_global_clst then start IPaddrNFS (kind:Mandatory) > (id:order-DRBD_global_clst-IPaddrNFS-mandatory) > Colocation Constraints: > NFSServer with IPaddrNFS (score:INFINITY) > (id:colocation-NFSServer-IPaddrNFS-INFINITY) > DRBD_global_clst with DRBDClone (score:INFINITY) > (id:colocation-DRBD_global_clst-DRBDClone-INFINITY)
It took me a while to notice it, it's easily overlooked, but the above constraint is the problem. It says DRBD_global_clst must be located where DRBDClone is running ... not necessarily where DRBDClone is master. This constraint should be created like this: pcs constraint colocation add DRBD_global_clst with master DBRDClone > IPaddrNFS with DRBD_global_clst (score:INFINITY) > (id:colocation-IPaddrNFS-DRBD_global_clst-INFINITY) > > Resources Defaults: > resource-stickiness: INFINITY > Operations Defaults: > timeout: 10s > > Cluster Properties: > cluster-infrastructure: cman > dc-version: 1.1.14-8.el6-70404b0 > have-watchdog: false > last-lrm-refresh: 1478277432 > no-quorum-policy: ignore > stonith-enabled: true > symmetric-cluster: true > ================================================================== > > Initial state is e.g. this (all resources at node1): > > Online: [ ventsi-clst1-sync ventsi-clst2-sync ] > > Full list of resources: > > ipmi-fence-clst1 (stonith:fence_ipmilan): Started > ventsi-clst2-sync > ipmi-fence-clst2 (stonith:fence_ipmilan): Started > ventsi-clst1-sync > IPaddrNFS (ocf::heartbeat:IPaddr2): Started ventsi-clst1-sync > NFSServer (ocf::heartbeat:nfsserver): Started ventsi-clst1-sync > Master/Slave Set: DRBDClone [DRBD] > Masters: [ ventsi-clst1-sync ] > Slaves: [ ventsi-clst2-sync ] > DRBD_global_clst (ocf::heartbeat:Filesystem): Started > ventsi-clst1-sync > ================================================================== > > If I shutdown the cluster at node 1 (‘pcs cluster stop’) or if I move > the DRBD clone resource (‘pcs resource move DRBDClone’) all resources > switch successfully to node2. > I.e. the demote/promote of the DRBD clone resource is working in these > cases. > > But if I try to move any other resource (e.g. ‘pcs resource move > NFSServer’) the resources NFSServer, IPaddrNFS and DRBD_global_clst are > stopped at node 1, but then already follows starting of the > DRBD_global_clst resource at node2, which fails due to the missing > demote/promote. > As far as I can see there is some follow-up attempt to repair things > partially as the resources are started again at node1 exclusive the > resource which I moved due to my move command. > > Final state is like this: > > Online: [ ventsi-clst1-sync ventsi-clst2-sync ] > > Full list of resources: > > ipmi-fence-clst1 (stonith:fence_ipmilan): Started > ventsi-clst2-sync > ipmi-fence-clst2 (stonith:fence_ipmilan): Started > ventsi-clst1-sync > IPaddrNFS (ocf::heartbeat:IPaddr2): Started ventsi-clst1-sync > NFSServer (ocf::heartbeat:nfsserver): Stopped > Master/Slave Set: DRBDClone [DRBD] > Masters: [ ventsi-clst1-sync ] > Slaves: [ ventsi-clst2-sync ] > DRBD_global_clst (ocf::heartbeat:Filesystem): Started > ventsi-clst1-sync > > Failed Actions: > * DRBD_global_clst_start_0 on ventsi-clst2-sync 'unknown error' (1): > call=778, status=complete, exitreason='none', > last-rc-change='Fri Nov 4 19:32:56 2016', queued=0ms, exec=43ms > ================================================================== > > Here are the logged messages for this “failure case”: > > 2016-11-04T19:32:55.163982+01:00 ventsi-clst1 crmd[6116]: notice: > State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC > cause=C_FSA_INTERNAL origin=abort_transition_graph ] > 2016-11-04T19:32:55.168100+01:00 ventsi-clst1 pengine[6115]: notice: > On loss of CCM Quorum: Ignore > 2016-11-04T19:32:55.181252+01:00 ventsi-clst1 pengine[6115]: notice: > Move IPaddrNFS#011(Started ventsi-clst1-sync -> ventsi-clst2-sync) > 2016-11-04T19:32:55.181260+01:00 ventsi-clst1 pengine[6115]: notice: > Move NFSServer#011(Started ventsi-clst1-sync -> ventsi-clst2-sync) > 2016-11-04T19:32:55.181278+01:00 ventsi-clst1 pengine[6115]: notice: > Move DRBD_global_clst#011(Started ventsi-clst1-sync -> > ventsi-clst2-sync) <=== here no demote/promote is listed > 2016-11-04T19:32:55.182385+01:00 ventsi-clst1 pengine[6115]: notice: > Calculated Transition 202: /var/lib/pacemaker/pengine/pe-input-766.bz2 > 2016-11-04T19:32:55.182998+01:00 ventsi-clst1 crmd[6116]: notice: > Initiating action 15: stop NFSServer_stop_0 on ventsi-clst1-sync (local) > 2016-11-04T19:32:55.196265+01:00 ventsi-clst1 > nfsserver(NFSServer)[15978]: INFO: Stopping NFS server ... > 2016-11-04T19:32:55.249137+01:00 ventsi-clst1 kernel: nfsd: last server > has exited, flushing export cache > 2016-11-04T19:32:55.252241+01:00 ventsi-clst1 rpc.mountd[15282]: Caught > signal 15, un-registering and exiting. > 2016-11-04T19:32:55.632708+01:00 ventsi-clst1 > nfsserver(NFSServer)[15978]: INFO: Stopping sm-notify > 2016-11-04T19:32:55.650552+01:00 ventsi-clst1 > nfsserver(NFSServer)[15978]: INFO: Stopping rpc.statd > 2016-11-04T19:32:55.666777+01:00 ventsi-clst1 rpc.statd[15243]: Caught > signal 15, un-registering and exiting > 2016-11-04T19:32:56.692819+01:00 ventsi-clst1 > nfsserver(NFSServer)[15978]: INFO: NFS server stopped > 2016-11-04T19:32:56.695523+01:00 ventsi-clst1 crmd[6116]: notice: > Operation NFSServer_stop_0: ok (node=ventsi-clst1-sync, call=1220, rc=0, > cib-update=1695, confirmed=true) > 2016-11-04T19:32:56.696243+01:00 ventsi-clst1 crmd[6116]: notice: > Initiating action 12: stop IPaddrNFS_stop_0 on ventsi-clst1-sync (local) > 2016-11-04T19:32:56.727882+01:00 ventsi-clst1 IPaddr2(IPaddrNFS)[16108]: > INFO: IP status = ok, IP_CIP= > 2016-11-04T19:32:56.733383+01:00 ventsi-clst1 crmd[6116]: notice: > Operation IPaddrNFS_stop_0: ok (node=ventsi-clst1-sync, call=1222, rc=0, > cib-update=1696, confirmed=true) > 2016-11-04T19:32:56.733917+01:00 ventsi-clst1 crmd[6116]: notice: > Initiating action 48: stop DRBD_global_clst_stop_0 on ventsi-clst1-sync > (local) > 2016-11-04T19:32:56.757181+01:00 ventsi-clst1 > Filesystem(DRBD_global_clst)[16163]: INFO: Running stop for /dev/drbd1 > on /drbdmnts/global_clst > 2016-11-04T19:32:56.764684+01:00 ventsi-clst1 > Filesystem(DRBD_global_clst)[16163]: INFO: Trying to unmount > /drbdmnts/global_clst > 2016-11-04T19:32:56.771260+01:00 ventsi-clst1 > Filesystem(DRBD_global_clst)[16163]: INFO: unmounted > /drbdmnts/global_clst successfully > 2016-11-04T19:32:56.776640+01:00 ventsi-clst1 crmd[6116]: notice: > Operation DRBD_global_clst_stop_0: ok (node=ventsi-clst1-sync, > call=1224, rc=0, cib-update=1697, confirmed=true) > 2016-11-04T19:32:56.777140+01:00 ventsi-clst1 crmd[6116]: notice: > Initiating action 49: start DRBD_global_clst_start_0 on > ventsi-clst2-sync <=== hereis the attempt to start the filesystem at > the other node, although DRBD has not yet been promoted > 2016-11-04T19:32:56.840137+01:00 ventsi-clst1 crmd[6116]: warning: > Action 49 (DRBD_global_clst_start_0) on ventsi-clst2-sync failed > (target: 0 vs. rc: 1): Error > 2016-11-04T19:32:56.840158+01:00 ventsi-clst1 crmd[6116]: notice: > Transition aborted by DRBD_global_clst_start_0 'modify' on > ventsi-clst2-sync: Event failed > (magic=0:1;49:202:0:b7941532-c74b-40cc-a8ad-27b5502b8fba, cib=0.649.4, > source=match_graph_event:381, 0) > 2016-11-04T19:32:56.840232+01:00 ventsi-clst1 crmd[6116]: warning: > Action 49 (DRBD_global_clst_start_0) on ventsi-clst2-sync failed > (target: 0 vs. rc: 1): Error > 2016-11-04T19:32:56.840328+01:00 ventsi-clst1 crmd[6116]: notice: > Transition 202 (Complete=5, Pending=0, Fired=0, Skipped=0, Incomplete=5, > Source=/var/lib/pacemaker/pengine/pe-input-766.bz2): Complete > 2016-11-04T19:32:56.843693+01:00 ventsi-clst1 pengine[6115]: notice: > On loss of CCM Quorum: Ignore > 2016-11-04T19:32:56.844072+01:00 ventsi-clst1 pengine[6115]: warning: > Processing failed op start for DRBD_global_clst on ventsi-clst2-sync: > unknown error (1) > 2016-11-04T19:32:56.844102+01:00 ventsi-clst1 pengine[6115]: warning: > Processing failed op start for DRBD_global_clst on ventsi-clst2-sync: > unknown error (1) > 2016-11-04T19:32:56.845071+01:00 ventsi-clst1 pengine[6115]: notice: > Start IPaddrNFS#011(ventsi-clst2-sync) > 2016-11-04T19:32:56.845078+01:00 ventsi-clst1 pengine[6115]: notice: > Start NFSServer#011(ventsi-clst2-sync) > 2016-11-04T19:32:56.845081+01:00 ventsi-clst1 pengine[6115]: notice: > Demote DRBD:0#011(Master -> Slave ventsi-clst1-sync) <=== here there > would be the necessarydemote/promote … but it’s too late; the start of > the filesystem already failed… > 2016-11-04T19:32:56.845083+01:00 ventsi-clst1 pengine[6115]: notice: > Promote DRBD:1#011(Slave -> Master ventsi-clst2-sync) > 2016-11-04T19:32:56.845084+01:00 ventsi-clst1 pengine[6115]: notice: > Recover DRBD_global_clst#011(Started ventsi-clst2-sync) > 2016-11-04T19:32:56.847986+01:00 ventsi-clst1 pengine[6115]: notice: > Calculated Transition 203: /var/lib/pacemaker/pengine/pe-input-767.bz2 > <=== … so the above transition gets caught by thefollowing attempt to > repair things partially > 2016-11-04T19:32:56.867679+01:00 ventsi-clst1 pengine[6115]: notice: > On loss of CCM Quorum: Ignore > 2016-11-04T19:32:56.868074+01:00 ventsi-clst1 pengine[6115]: warning: > Processing failed op start for DRBD_global_clst on ventsi-clst2-sync: > unknown error (1) > 2016-11-04T19:32:56.868101+01:00 ventsi-clst1 pengine[6115]: warning: > Processing failed op start for DRBD_global_clst on ventsi-clst2-sync: > unknown error (1) > 2016-11-04T19:32:56.868287+01:00 ventsi-clst1 pengine[6115]: warning: > Forcing DRBD_global_clst away from ventsi-clst2-sync after 1000000 > failures (max=1000000) > 2016-11-04T19:32:56.869011+01:00 ventsi-clst1 pengine[6115]: notice: > Start IPaddrNFS#011(ventsi-clst1-sync) > 2016-11-04T19:32:56.869023+01:00 ventsi-clst1 pengine[6115]: notice: > Recover DRBD_global_clst#011(Started ventsi-clst2-sync -> ventsi-clst1-sync) > 2016-11-04T19:32:56.869770+01:00 ventsi-clst1 pengine[6115]: notice: > Calculated Transition 204: /var/lib/pacemaker/pengine/pe-input-768.bz2 > 2016-11-04T19:32:56.870065+01:00 ventsi-clst1 crmd[6116]: notice: > Initiating action 3: stop DRBD_global_clst_stop_0 on ventsi-clst2-sync > 2016-11-04T19:32:56.908075+01:00 ventsi-clst1 crmd[6116]: notice: > Initiating action 42: start DRBD_global_clst_start_0 on > ventsi-clst1-sync (local) > 2016-11-04T19:32:56.931072+01:00 ventsi-clst1 > Filesystem(DRBD_global_clst)[16242]: INFO: Running start for /dev/drbd1 > on /drbdmnts/global_clst > 2016-11-04T19:32:56.943250+01:00 ventsi-clst1 kernel: EXT4-fs (drbd1): > warning: maximal mount count reached, running e2fsck is recommended > 2016-11-04T19:32:56.953253+01:00 ventsi-clst1 kernel: EXT4-fs (drbd1): > mounted filesystem with ordered data mode. Opts: > 2016-11-04T19:32:56.964284+01:00 ventsi-clst1 crmd[6116]: notice: > Operation DRBD_global_clst_start_0: ok (node=ventsi-clst1-sync, > call=1225, rc=0, cib-update=1701, confirmed=true) > 2016-11-04T19:32:56.965104+01:00 ventsi-clst1 crmd[6116]: notice: > Initiating action 10: start IPaddrNFS_start_0 on ventsi-clst1-sync (local) > 2016-11-04T19:32:56.965325+01:00 ventsi-clst1 crmd[6116]: notice: > Initiating action 43: monitor DRBD_global_clst_monitor_20000 on > ventsi-clst1-sync (local) > 2016-11-04T19:32:56.996235+01:00 ventsi-clst1 IPaddr2(IPaddrNFS)[16308]: > INFO: Adding inet address xxx.xxx.xxx.xxx/24 with broadcast address > xxx.xxx.xxx.255 to device bond0 > 2016-11-04T19:32:57.002059+01:00 ventsi-clst1 IPaddr2(IPaddrNFS)[16308]: > INFO: Bringing device bond0 up > 2016-11-04T19:32:57.008128+01:00 ventsi-clst1 IPaddr2(IPaddrNFS)[16308]: > INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p > /var/run/resource-agents/send_arp-xxx.xxx.xxx.xxx bond0 xxx.xxx.xxx.xxx > auto not_used not_used > 2016-11-04T19:32:57.020159+01:00 ventsi-clst1 crmd[6116]: notice: > Operation IPaddrNFS_start_0: ok (node=ventsi-clst1-sync, call=1226, > rc=0, cib-update=1703, confirmed=true) > 2016-11-04T19:32:57.020901+01:00 ventsi-clst1 crmd[6116]: notice: > Initiating action 11: monitor IPaddrNFS_monitor_5000 on > ventsi-clst1-sync (local) > 2016-11-04T19:32:57.052231+01:00 ventsi-clst1 crmd[6116]: notice: > Transition 204 (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-768.bz2): Complete > 2016-11-04T19:32:57.052251+01:00 ventsi-clst1 crmd[6116]: notice: > State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > ================================================================== > > Any ideas what could be the reason for this behavior? > And how could this be fixed? > > > (I already found several articles on the internet with the > recommendation to have two separately configured monitor operations for > the DRBD resource configured one for the master role and another one for > the slave role. > Already tried this to no avail.) > > Regards > Andi _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org