18.02.2019 18:53, Ken Gaillot пишет: > On Sun, 2019-02-17 at 20:33 +0300, Andrei Borzenkov wrote: >> 17.02.2019 0:33, Andrei Borzenkov пишет: >>> 17.02.2019 0:03, Eric Robinson пишет: >>>> Here are the relevant corosync logs. >>>> >>>> It appears that the stop action for resource p_mysql_002 failed, >>>> and that caused a cascading series of service changes. However, I >>>> don't understand why, since no other resources are dependent on >>>> p_mysql_002. >>>> >>> >>> You have mandatory colocation constraints for each SQL resource >>> with >>> VIP. it means that to move SQL resource to another node pacemaker >>> also >>> must move VIP to another node which in turn means it needs to move >>> all >>> other dependent resources as well. >>> ... >>>> Feb 16 14:06:39 [3912] 001db01a pengine: warning: >>>> check_migration_threshold: Forcing p_mysql_002 away from >>>> 001db01a after 1000000 failures (max=1000000) >>> >>> ... >>>> Feb 16 14:06:39 [3912] 001db01a pengine: notice: >>>> LogAction: * >>>> Stop p_vip_clust01 ( 001db01a >>>> ) blocked >>> >>> ... >>>> Feb 16 14:06:39 [3912] 001db01a pengine: notice: >>>> LogAction: * >>>> Stop p_mysql_001 ( 001db01a ) due >>>> to colocation with p_vip_clust01 >> >> There is apparently more in it. Note that p_vip_clust01 operation is >> "blocked". That is because mandatory order constraint is symmetrical >> by >> default, so to move VIP pacemaker needs first to stop it on current >> node; but before it can stop VIP it needs to (be able to) stop >> p_mysql_002; but it cannot do it because by default when "stop" fails >> without stonith, resource is blocked and no further actions are >> possible >> - i.e. resource can no more (tried to) be stopped. > > Correct, failed stop actions are special -- an on-fail policy of "stop" > or "restart" requires a stop, so obviously they can't be applied to > failed stops. As you mentioned, without fencing, on-fail defaults to > "block" for stops, which should freeze the resource as it is. > >> I still consider is rather questionable behavior. I tried to >> reproduce >> it and I see the same. >> >> 1. After this happens resource p_mysql_002 has target=Stopped in CIB. >> Why, oh why, pacemaker tries to "force away" resource that is not >> going >> to be started on another node anyway? > > Without having the policy engine inputs, I can't be sure, but I suspect > p_mysql_002 is not being forced away, but its failure causes that node > to be less preferred for the resources it depends on. > >> 2. pacemaker knows that it cannot stop (and hence move) >> p_vip_clust01, >> still it happily will stop all resources that depend on it in >> preparation to move them and leave them at that because it cannot >> move > > I think this is the point at which the behavior is undesirable, because > it would be relevant whether the move was related to the blocked > failure or not. Feel free to open a bug report and attach the relevant > policy engine input (or a crm_report). >
https://bugs.clusterlabs.org/show_bug.cgi?id=5379 >> them. Resources are neither restarted on current node, nor moved to >> another node. At this point I'd expect pacemaker to be smart enough >> and >> not even initiate actions that are known to be unsuccessful. >> >> The best we can do at this point is set symmetrical=false which >> allows >> move to actually happen, but it still means downtime for resources >> that >> are moved and has its own can of worms in normal case. > -- > Ken Gaillot <kgail...@redhat.com> > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org