Dear all, I did not get any response so far. Could you please find the time and tell me how the "meta failure-timeout" is supposed to work, in combination with monitor operations?
Thanks, Carsten On Thu, Oct 16, 2014 at 05:06:41PM +0200, Carsten Otto wrote: > Dear all, > > I configured meta failure-timeout=60sec on all of my resources. For the > sake of simplicity, assume I have a group of two resources FIRST and > SECOND (where SECOND is started after FIRST, surprise!). > > If now FIRST crashes, I see a failure, as expected. I also see that > SECOND is stopped, as expected. > > Sadly, SECOND needs more than 60 seconds to stop. Thus, it can happen > that the "failure-timeout" for FIRST is reached, and its failure is > cleaned. This also is expected. > > The problem now is that after the 60sec timeout pacemaker assumes that > FIRST is in the Started state. There is no indication about that in the > log files, and the last monitor operation which ran just a few seconds > before also indicated that FIRST is actually not running. > > As a consequence of the bug, pacemaker tries to re-start SECOND on the > same system, which fails to start (as it depends on FIRST, which > actually is not running). Only then the resources are started on the > other system. > > So, my question is: > Why does pacemaker assume that a previously failed resource is "Started" > when the "meta failure-timeout" is triggered? Why is the monitor > operation not invoked to determine the correct state? > > The corresponding lines of the log file, about a minute after FIRST > crashed and the stop operation for SECOND was triggered: > > Oct 16 16:27:20 [2100] HOSTNAME [...] (monitor operation indicating that > FIRST is not running) > [...] > Oct 16 16:27:23 [2104] HOSTNAME lrmd: info: log_finished: > finished - rsc:SECOND action:stop call_id:123 pid:29314 exit-code:0 > exec-time:62827ms queue-time:0ms > Oct 16 16:27:23 [2107] HOSTNAME crmd: notice: process_lrm_event: > LRM operation SECOND_stop_0 (call=123, rc=0, cib-update=225, confirmed=true) > ok > Oct 16 16:27:23 [2107] HOSTNAME crmd: info: match_graph_event: > Action SECOND_stop_0 (74) confirmed on HOSTNAME (rc=0) > Oct 16 16:27:23 [2107] HOSTNAME crmd: notice: run_graph: > Transition 40 (Complete=5, Pending=0, Fired=0, Skipped=31, Incomplete=10, > Source=/var/lib/pacemaker/pengine/pe-input-2937.bz2): Stopped > Oct 16 16:27:23 [2107] HOSTNAME crmd: info: do_state_transition: > State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC > cause=C_FSA_INTERNAL origin=notify_crmd ] > Oct 16 16:27:23 [2100] HOSTNAME cib: info: cib_process_request: > Completed cib_modify operation for section status: OK (rc=0, > origin=local/crmd/225, version=0.1450.89) > Oct 16 16:27:23 [2100] HOSTNAME cib: info: cib_process_request: > Completed cib_query operation for section 'all': OK (rc=0, > origin=local/crmd/226, version=0.1450.89) > Oct 16 16:27:23 [2106] HOSTNAME pengine: notice: unpack_config: > On loss of CCM Quorum: Ignore > Oct 16 16:27:23 [2106] HOSTNAME pengine: info: > determine_online_status_fencing: Node HOSTNAME is active > Oct 16 16:27:23 [2106] HOSTNAME pengine: info: > determine_online_status: Node HOSTNAME is online > [...] > Oct 16 16:27:23 [2106] HOSTNAME pengine: info: get_failcount_full: > FIRST has failed 1 times on HOSTNAME > Oct 16 16:27:23 [2106] HOSTNAME pengine: notice: unpack_rsc_op: > Clearing expired failcount for FIRST on HOSTNAME > Oct 16 16:27:23 [2106] HOSTNAME pengine: info: get_failcount_full: > FIRST has failed 1 times on HOSTNAME > Oct 16 16:27:23 [2106] HOSTNAME pengine: notice: unpack_rsc_op: > Clearing expired failcount for FIRST on HOSTNAME > Oct 16 16:27:23 [2106] HOSTNAME pengine: info: get_failcount_full: > FIRST has failed 1 times on HOSTNAME > Oct 16 16:27:23 [2106] HOSTNAME pengine: notice: unpack_rsc_op: > Clearing expired failcount for FIRST on HOSTNAME > Oct 16 16:27:23 [2106] HOSTNAME pengine: notice: unpack_rsc_op: > Re-initiated expired calculated failure FIRST_last_failure_0 (rc=7, > magic=0:7;68:31:0:28c68203-6990-48fd-96cc-09f86e2b21f9) on HOSTNAME > [...] > Oct 16 16:27:23 [2106] HOSTNAME pengine: info: group_print: Resource > Group: GROUP > Oct 16 16:27:23 [2106] HOSTNAME pengine: info: native_print: > FIRST (ocf::heartbeat:xxx): Started HOSTNAME > Oct 16 16:27:23 [2106] HOSTNAME pengine: info: native_print: > SECOND (ocf::heartbeat:yyy): Stopped > > Thank you, > Carsten > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org -- andrena objects ag Büro Frankfurt Clemensstr. 8 60487 Frankfurt Tel: +49 (0) 69 977 860 38 Fax: +49 (0) 69 977 860 39 http://www.andrena.de Vorstand: Hagen Buchwald, Matthias Grund, Dr. Dieter Kuhn Aufsichtsratsvorsitzender: Rolf Hetzelberger Sitz der Gesellschaft: Karlsruhe Amtsgericht Mannheim, HRB 109694 USt-IdNr. DE174314824 Bitte beachten Sie auch unsere anstehenden Veranstaltungen: http://www.andrena.de/events
signature.asc
Description: Digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org