Hi, On Mon, Jun 07, 2010 at 12:13:41PM +0200, Andreas Kurz wrote: > Hi all, > > I observed a strange behaviour when trying to stop two resources with latest > pacemaker: > > I updated two resources (ping) and changed some constraints. One of the > changed resources is mentioned in the logs with "strange" lrmd messages : > > ... > Jun 07 10:16:58 emahqwienfw1b crmd: [31354]: ERROR: do_lrm_rsc_op: Operation > monitor on res_ping_ABC failed: -1 > Jun 07 10:16:58 emahqwienfw1b lrmd: [31351]: notice: on_msg_perform_op: > resource res_ping_ABC is frozen, no ops can run.
This happens in case the resource is being deleted or operations flushed, but there is still an operation running on the resource and lrmd is waiting for that operation to finish. Before this operation is done, no new operations can run on the resource. > Jun 07 10:16:58 emahqwienfw1b lrmd: [31351]: debug: RA output [dummy status > to > fool heartbeat > ] didn't match any pattern > Jun 07 10:16:58 emahqwienfw1b crmd: [31354]: WARN: do_log: FSA: Input I_FAIL > from do_lrm_rsc_op() received in state S_TRANSITION_ENGINE > Jun 07 10:16:58 emahqwienfw1b crmd: [31354]: info: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_FAIL > cause=C_FSA_INTERNAL origin=do_lrm_rsc_op ] > .... > > Then I try to stop two other resources (part of a group) and nothing happens. > One of this resources is a dependency of res_ping_ABC that is mentioned as > "frozen" by the lrmd. > > Running ptest -L shows that pengine knows what to do (stop the two resources > and all dependencies). Jun 07 10:16:57 emahqwienfw1b pengine: [31711]: notice: native_print: res_ping_ABC (ocf::pacemaker:ping): Started emahqwienfw1b Jun 07 10:16:57 emahqwienfw1b pengine: [31711]: WARN: check_action_definition: Parameters to res_ping_ABC_start_0 on emahqwienfw1b changed: recorded 3e6589d0db01fb229fd441bb0d1d50f3 vs. 584dbc4ad2ec43013bd447445557c554 (all:3.0.1) 0:0;22:344:0:8e44c059-ca7d-41ce-b81a-793882819347 Jun 07 10:16:57 emahqwienfw1b pengine: [31711]: notice: RecurringOp: Start recurring monitor (30s) for res_ping_ABC on emahqwienfw1b Jun 07 10:16:57 emahqwienfw1b pengine: [31711]: notice: LogActions: Restart resource res_ping_ABC (Started emahqwienfw1b) Jun 07 10:16:58 emahqwienfw1b crmd: [31354]: info: te_rsc_command: Initiating action 42: monitor res_ping_ABC_monitor_0 on emahqwienfw1a PE decides to restart the resource, but then it does a probe even though the resource's state is Started. That operation fails, but should be retried. Obviously we need to improve the interaction between lrmd and crmd. Please file a bugzilla. Thanks, Dejan > Any ideas? hb_report is attached .... I left the cluster in this state so if > there is anything else I should provide for debugging please tell me. > > Regards, > Andreas > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker