On Mon, 2017-10-09 at 16:37 +1000, Leon Steffens wrote: > Hi all, > > We have a use case where we want to place a node into standby and > then wait for all the resources to move off the node (and be started > on other nodes) before continuing. > > In order to do this we call: > $ pcs cluster standby brilxvm45 > $ crm_resource --wait --timeout 300 > > This works most of the time, but in one of our test environments we > are hitting a problem: > > When we put the node in standby, the reported cluster transition is: > > $ /usr/sbin/crm_simulate -x pe-input-3595.bz2 -S > > Using the original execution date of: 2017-10-08 16:58:05Z > ... > Transition Summary: > * Restart sv_fencer (Started brilxvm43) > * Stop sv.svtest.aa.sv.monitor:1 (brilxvm45) > * Move sv.svtest.aa.26.partition (Started brilxvm45 -> > brilxvm43) > * Move sv.svtest.aa.27.partition (Started brilxvm45 -> > brilxvm44) > * Move sv.svtest.aa.28.partition (Started brilxvm45 -> > brilxvm43) > > We expect crm_resource --wait to return once sv_fencer (a fencing > device) has been restarted (not sure why it's being restarted), and > the 3 partition resources have been moved. > > But crm_resource actually times out after 300 seconds with the > following error: > > Pending actions: > Action 40: sv_fencer_monitor_60000 on brilxvm44 > Action 39: sv_fencer_start_0 on brilxvm44 > Action 38: sv_fencer_stop_0 on brilxvm43 > Error performing operation: Timer expired > > It looks like it's waiting for the sv_fencer fencing agent to start > on brilxvm44, even though the current transition did not include that > move.
crm_resource --wait doesn't wait for a specific transition to complete; it waits until no further actions are needed. That is one of its limitations, that if something keeps provoking a new transition, it will never complete except by timeout. > > After the crm_resource --wait has timed out, we set a property on a > different node (brilxvm43). This seems to trigger a new transition > to move sv_fencer to brilxvm44: > > $ /usr/sbin/crm_simulate -x pe-input-3596.bz2 -S > Using the original execution date of: 2017-10-08 17:03:27Z > > Transition Summary: > * Move sv_fencer (Started brilxvm43 -> brilxvm44) > > And from the corosync.log it looks like this transition triggers > actions 38 - 40 (the ones crm_resource --wait waited for). > > So it looks like the crm_resource --wait knows about the transition > to move the sv_fencer resource, but the subsequent setting of the > node property is the one that actually triggers it (which is too > late as it gets executed after the wait). > > I have attached the DC's corosync.log for the applicable time period > (timezone is UTC+10). (The last few lines in the corosync - the > interruption of transition 141 - is because of a subsequent standby > being done for brilxvm43). > > A possible workaround I thought of was to make the sv_fencer resource > slightly sticky (all the other resources are), but I'm not sure if > this will just hide the problem for this specific scenario. > > We are using Pacemaker 1.1.15 on RedHat 6.9. > > Regards, > Leon > > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch. > pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org