Hi all, We have a use case where we want to place a node into standby and then wait for all the resources to move off the node (and be started on other nodes) before continuing.
In order to do this we call: $ pcs cluster standby brilxvm45 $ crm_resource --wait --timeout 300 This works most of the time, but in one of our test environments we are hitting a problem: When we put the node in standby, the reported cluster transition is: $ /usr/sbin/crm_simulate -x pe-input-3595.bz2 -S Using the original execution date of: 2017-10-08 *16:58:05Z* ... Transition Summary: * Restart sv_fencer (Started brilxvm43) * Stop sv.svtest.aa.sv.monitor:1 (brilxvm45) * Move sv.svtest.aa.26.partition (Started brilxvm45 -> brilxvm43) * Move sv.svtest.aa.27.partition (Started brilxvm45 -> brilxvm44) * Move sv.svtest.aa.28.partition (Started brilxvm45 -> brilxvm43) We expect crm_resource --wait to return once sv_fencer (a fencing device) has been restarted (not sure why it's being restarted), and the 3 partition resources have been moved. But crm_resource actually times out after 300 seconds with the following error: Pending actions: Action 40: sv_fencer_monitor_60000 on brilxvm44 Action 39: sv_fencer_start_0 on brilxvm44 Action 38: sv_fencer_stop_0 on brilxvm43 Error performing operation: Timer expired It looks like it's waiting for the sv_fencer fencing agent to start on brilxvm44, even though the current transition did not include that move. After the crm_resource --wait has timed out, we set a property on a different node (brilxvm43). This seems to trigger a new transition to move sv_fencer to brilxvm44: $ /usr/sbin/crm_simulate -x pe-input-3596.bz2 -S Using the original execution date of: 2017-10-08 *17:03:27Z* Transition Summary: * Move sv_fencer (Started brilxvm43 -> brilxvm44) And from the corosync.log it looks like this transition triggers actions 38 - 40 (the ones crm_resource --wait waited for). So it looks like the crm_resource --wait knows about the transition to move the sv_fencer resource, but the subsequent setting of the node property is the one that actually triggers it (which is too late as it gets executed after the wait). I have attached the DC's corosync.log for the applicable time period (timezone is UTC+10). (The last few lines in the corosync - the interruption of transition 141 - is because of a subsequent standby being done for brilxvm43). A possible workaround I thought of was to make the sv_fencer resource slightly sticky (all the other resources are), but I'm not sure if this will just hide the problem for this specific scenario. We are using Pacemaker 1.1.15 on RedHat 6.9. Regards, Leon
wait_corosync.log
Description: Binary data
_______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org