On Tue, Sep 21, 2010 at 8:59 AM, <renayama19661...@ybb.ne.jp> wrote: > Hi, > > Node was in state that the load was very high, and we confirmed monitor > movement of Pacemeker. > Action Lost occurred in stop movement after the error of the monitor occurred. > > Sep 8 20:02:22 cgl54 crmd: [3507]: ERROR: print_elem: Aborting transition, > action lost: [Action 9]: > In-flight (id: prmApPostgreSQLDB1_stop_0, loc: cgl49, priority: 0) > Sep 8 20:02:22 cgl54 crmd: [3507]: info: abort_transition_graph: > action_timer_callback:486 - > Triggered transition abort (complete=0) : Action lost > > > For the load of the node, We think that the stop movement did not go well. > But cannot nodes execute stonith.
A long time ago in a galaxy far away, some messaging layers used to loose quite a few actions, including stops. About the same time, we decided that fencing because a stop action was lost wasn't a good idea. The rationale was that if the operation eventually completed, it would end up in the CIB anyway. And even if it didn't, the PE would continue to try the operation again until the whole node fell over at which point it would get shot anyway. Now, having said that, things have improved since then and perhaps, the interest of speeding up recovery in these situations, it is time to stop treating stop operations differently. Would you agree? _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker