On 21/08/19 14:48 +0200, Jan Pokorný wrote: > On 20/08/19 20:55 +0200, Jan Pokorný wrote: >> On 15/08/19 17:03 +0000, Michael Powell wrote: >>> First, thanks to all for their responses. With your help, I'm >>> steadily gaining competence WRT HA, albeit slowly. >>> >>> I've basically followed Harvey's workaround suggestion, and the >>> failover I hoped for takes effect quite quickly. I nevertheless >>> remain puzzled about why our legacy code, based upon Pacemaker >>> 1.0/Heartbeat, works satisfactorily w/o such changes, >>> >>> Here's what I've done. First, based upon responses to my post, I've >>> implemented the following commands when setting up the cluster: >>>> crm_resource -r SS16201289RN00023 --meta -p resource-stickiness -v 0 # >>>> (someone asserted that this was unnecessary) >>>> crm_resource --meta -r SS16201289RN00023 -p migration-threshold -v 1 >>>> crm_resource -r SS16201289RN00023 --meta -p failure-timeout -v 10 >>>> crm_resource -r SS16201289RN00023 --meta -p cluster-recheck-interval -v 15 >>> >>> In addition, I've added "-l reboot" to those instances where >>> 'crm_master' is invoked by the RA to change resource scores. I also >>> found a location constraint in our setup that I couldn't understand >>> the need for, and removed it. >>> >>> After doing this, in my initial tests, I found that after 'kill -9 >>> <pid>' was issued to the master, the slave instance on the other >>> node was promoted to master within a few seconds. However, it took >>> 60 seconds before the killed resource was restarted. In examining >>> the cib.xml file, I found an "rsc-options-failure-timeout", which >>> was set to "1min". Thinking "aha!", I added the following line >>>> crm_attribute --type rsc_defaults --name failure-timeout --update 15 >>> >>> Sadly, this does not appear to have had any impact. >> >> It won't have any impact on your SS16201289RN00023 resource since it >> has it's own (closer, hence overriding such outermost fallback value, >> barring the deferring to built-in default when that default would not >> be set explicitly) failure timeout set to value of 10 seconds per >> above. >> >> Admittedly, the documentation would use a precize formalization >> of the precedence rules. >> >> Anyway, this whole seems moot to me, see below. >> >>> So, while the good news is that the failover occurs as I'd hoped >>> for, the time required to restart the failed resource seems >>> excessive. Apparently setting the failure-timeout values isn't >>> sufficient. As an experiment, I issued a "crm_resource -r >>> <resourceid> --cleanup" command shortly after the failover took >>> effect and found that the resource was quickly restarted. Is that >>> the recommended procedure? If so, is changing the "failure-timeout" >>> and "cluster-recheck-interval" really necessary? >>> >>> Finally, for a period of about 20 seconds while the resource is >>> being restarted (it takes about 30s to start up the resource's >>> application), it appears that both nodes are masters. E.g. here's >>> the 'crm_mon' output. >>> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv >>> [root@mgraid-16201289RN00023-0 bin]# date;crm_mon -1 >>> Thu Aug 15 09:44:59 PDT 2019 >>> Stack: corosync >>> Current DC: mgraid-16201289RN00023-0 (version 1.1.19-8.el7-c3c624ea3d) - >>> partition with quorum >>> Last updated: Thu Aug 15 09:44:59 2019 >>> Last change: Thu Aug 15 06:26:39 2019 by root via crm_attribute on >>> mgraid-16201289RN00023-0 >>> >>> 2 nodes configured >>> 4 resources configured >>> >>> Online: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ] >>> >>> Active resources: >>> >>> Clone Set: mgraid-stonith-clone [mgraid-stonith] >>> Started: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ] >>> Master/Slave Set: ms-SS16201289RN00023 [SS16201289RN00023] >>> Masters: [ mgraid-16201289RN00023-0 mgraid-16201289RN00023-1 ] >>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >>> This seems odd. Eventually the just-started resource reverts to a slave, >>> but this doesn't make sense to me. >>> >>> For the curious, I've attached the 'cibadmin --query' output from >>> the DC node, taken just prior to issuing the 'kill -9' command, as >>> well as the corosync.log file from the DC following the 'kill -9' >>> command. >> >> Thanks for attaching both, since your above described situation starts >> to make a bit more sense to me. >> >> I think you've observed (at least that's what I do per the >> provided log) resource role continually flapping due to >> a non-intermittent/deterministic obstacle on one of the nodes): >> >> nodes for simplicity: A, B >> only the M/S resource in question is considered >> notation: [node, state of the resource at hand] >> >> 1. [A, master], [B, slave] >> >> 2. for some reason, A fails (spotted with monitor) >> >> 3. attempt demoting A, for likely the same reason, it also fails, >> note that the agent itself notices something is pretty fishy >> there(!): >> > WARNING: ss_demote() trying to demote a resource that was >> > not started >> while B gets promoted > > Note that the agent's monitor operation shall also likely fail > with OCF_FAILED_MASTER when master role is assumed rather than > OCF_NOT_RUNNING (even though the pacemaker's response appear the > same as documented for the former, with demoting first, nonetheless). > Yet again, it's admittedly terribly underspecified which exact > responses are expected at which state of resource instance's live > cycle, also owing to the fact this is originally a "proprietary" > extension over plain OCF.
Sorry, forgot that -- counterintuitively -- the details are at the place of API convention dependants (rather than of convention API exerciser): https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc#ocf_not_running-7 Apparently, in this case, the resource was _not_ "gracefully shut down" not "has never been started" applies (if I am not missing anything, that is). So returning OCF_NOT_RUNNING was inappropriate. Would make sure the agent is conforming to all these rules before investigating further. >> 4. [A, stopped], [B, master] >> >> 5. looks like at this point, there's a sort of a deadlock, >> since the missing instance of the resource cannot be >> started anywhere(?) >> >> 6. failure-timeout of 10 seconds expires on A, hence >> that missing resource instance gets started on A again >> >> 7. [A slave, B master] >> >> 8. likely due to score assignments, A gets promoted again >> >> 9. goto 1. >> >> The main problem is that with setting failure-timeout, you make >> a contract with the cluster stack that you've taken precautions >> to guarantee that the fault on the particular node is very >> very likely an intermittent one, not a systemic failure (unless >> the resource agent itself is the weakest link, badly reacting >> to circumstances etc., but you shall have accounted for that >> in the first place, more so with a custom agent), so it's a valid >> decision to allow that resource back after a while (using the >> common sense, it would be foolish otherwise). >> >> When that's not the case, you'll get what you ask for, forever >> unstable looping, due to failure timeout excessive tolerance. >> >> Admittedly again, pacemaker could have at least two levels of >> the loop tracking within the resouce live cycle, to possibly >> detect such eventually futile attempts without progress (liveness >> assurance). Currently it has only a single tight loop tracking >> IIUIC, which appears not enough when further neutralized with >> failure-timeout non-implicit setting. >> >> (Private aside: any logging regarding notifications shall only >> be available under some log tag not enabled by default; there's >> little value in logging that, especially when in-agent logging >> is commonly present as well for any effective review of what >> the agent actually obtained). -- Jan (Poki)
pgp21ZlLIvkFu.pgp
Description: PGP signature
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/