Re: [Pacemaker] cluster got stuck on stopping resources

Dejan Muhamedagic Mon, 07 Jun 2010 05:38:56 -0700

Hi,

On Mon, Jun 07, 2010 at 12:13:41PM +0200, Andreas Kurz wrote:
> Hi all,
> 
> I observed a strange behaviour when trying to stop two resources with latest 
> pacemaker:
> 
> I updated two resources (ping) and changed some constraints. One of the 
> changed resources is mentioned in the logs with "strange" lrmd messages :
> 
> ...
>  Jun 07 10:16:58 emahqwienfw1b crmd: [31354]: ERROR: do_lrm_rsc_op: Operation 
> monitor on res_ping_ABC failed: -1
> Jun 07 10:16:58 emahqwienfw1b lrmd: [31351]: notice: on_msg_perform_op: 
> resource res_ping_ABC is frozen, no ops can run.


This happens in case the resource is being deleted or operations
flushed, but there is still an operation running on the resource
and lrmd is waiting for that operation to finish. Before this
operation is done, no new operations can run on the resource.

> Jun 07 10:16:58 emahqwienfw1b lrmd: [31351]: debug: RA output [dummy status 
> to 
> fool heartbeat
> ] didn't match any pattern
> Jun 07 10:16:58 emahqwienfw1b crmd: [31354]: WARN: do_log: FSA: Input I_FAIL 
> from do_lrm_rsc_op() received in state S_TRANSITION_ENGINE
> Jun 07 10:16:58 emahqwienfw1b crmd: [31354]: info: do_state_transition: State 
> transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_FAIL 
> cause=C_FSA_INTERNAL origin=do_lrm_rsc_op ]
> ....
> 
> Then I try to stop two other resources (part of a group) and nothing happens. 
> One of this resources is a dependency of  res_ping_ABC that is mentioned as 
> "frozen" by the lrmd. 
> 
> Running ptest -L shows that pengine knows what to do (stop the two resources 
> and all dependencies).

Jun 07 10:16:57 emahqwienfw1b pengine: [31711]: notice: native_print: 
res_ping_ABC      (ocf::pacemaker:ping):  Started emahqwienfw1b
Jun 07 10:16:57 emahqwienfw1b pengine: [31711]: WARN: check_action_definition: 
Parameters to res_ping_ABC_start_0 on emahqwienfw1b changed: recorded 
3e6589d0db01fb229fd441bb0d1d50f3 vs. 584dbc4ad2ec43013bd447445557c554 
(all:3.0.1) 0:0;22:344:0:8e44c059-ca7d-41ce-b81a-793882819347
Jun 07 10:16:57 emahqwienfw1b pengine: [31711]: notice: RecurringOp:  Start 
recurring monitor (30s) for res_ping_ABC on emahqwienfw1b
Jun 07 10:16:57 emahqwienfw1b pengine: [31711]: notice: LogActions: Restart 
resource res_ping_ABC       (Started emahqwienfw1b)
Jun 07 10:16:58 emahqwienfw1b crmd: [31354]: info: te_rsc_command: Initiating 
action 42: monitor res_ping_ABC_monitor_0 on emahqwienfw1a

PE decides to restart the resource, but then it does a probe even
though the resource's state is Started. That operation fails, but
should be retried. Obviously we need to improve the interaction
between lrmd and crmd. Please file a bugzilla.

Thanks,

Dejan

> Any ideas? hb_report is attached .... I left the cluster in this state so if 
> there is anything else I should provide for debugging please tell me.
> 
> Regards,
> Andreas
> 


> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] cluster got stuck on stopping resources

Reply via email to