On 01/19/2016 10:30 AM, Kostiantyn Ponomarenko wrote: > The resource that wasn't running, but was reported as running, is > "adminServer". > > Here are a brief chronological description: > > [Jan 19 23:42:16] The first time Pacemaker triggers its monitor function at > line #1107. (those lines are from its Resource Agent) > [Jan 19 23:42:16] Then Pacemaker starts the resource - line #1191. > [Jan 19 11:42:53] The first failure is reported by monitor operation at > line #1543. > [Jan 19 11:42:53] The fail-count is set, but I don't see any attempt from > Pacemaker to "start" the resource - the start function is not called (from > the logs) - line #1553. > [Jan 19 12:27:56] Then adminServer's monitor operation keeps returning > $OCF_NOT_RUNNING - starts at line #1860. > [Jan 19 12:57:53] Then the expired failcount is cleared at line #1969. > [Jan 19 12:57:53] Another call of the monitor function happens at line > #2038. > [Jan 19 12:57:53] I assume that the line #2046 means "not running" (?). > [Jan 19 12:57:53] The "stop" function is called - line #2150 > [Jan 19 12:57:53] The "start" function is called and the resource is > successfully started - line #2164 > > > The time change occurred while cluster was starting, I see this from > "journalctl --since="2016-01-19" --until="2016-01-20"": > > Jan 19 23:10:39 A2-2U12-302-LS ntpd[2210]: 0.0.0.0 c61c 0c clock_step > -43193.793349 s > Jan 19 11:10:45 A2-2U12-302-LS ntpd[2210]: 0.0.0.0 c614 04 freq_mode > Jan 19 11:10:45 A2-2U12-302-LS systemd[1]: Time has been changed > > I am attaching corosync.log.
The time change is interesting. I suspect what's happening is that pacemaker considers the failed monitor "older" than the original successful one, and so ignores it. In general, we don't support large clock shifts in a running cluster (a topic that has come up before on this list). But if you can reproduce the behavior with 1.1.14, feel free to open a bug report. It might be worth revisiting to see if there is anything we can do about it. > Thank you, > Kostia > > On Tue, Jan 19, 2016 at 5:17 PM, Bogdan Dobrelya <bdobre...@mirantis.com> > wrote: > >> On 19.01.2016 16:13, Ken Gaillot wrote: >>> On 01/19/2016 06:49 AM, Kostiantyn Ponomarenko wrote: >>>> One of resources in my cluster is not actually running, but "crm_mon" >> shows >>>> it with the "Started" status. >>>> Its resource agent's monitor function returns "$OCF_NOT_RUNNING", but >>>> Pacemaker doesn't react on this anyhow - crm_mon show the resource as >>>> Started. >>>> I couldn't find an explanation to this behavior, so I suppose it is a >> bug, >>>> is it? >>> >>> That is unexpected. Can you post the configuration and logs from around >>> the time of the issue? >>> >> >> Oh, sorry, I forgot to mention the related thread [0]. That is exactly >> the case I reported there. Looks same, so I thought you've just updated >> my thread :) >> >> These may be merged perhaps. >> >> [0] http://clusterlabs.org/pipermail/users/2016-January/002035.html >> >>> >>> _______________________________________________ >>> Users mailing list: Users@clusterlabs.org >>> http://clusterlabs.org/mailman/listinfo/users >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org