Re: [Pacemaker] Action from a different CRMD transition results in restarting services
On Fri, Dec 14, 2012 at 1:33 AM, Latrous, Youssef wrote: > > Andrew Beekhof wrote: >> 18014 is where we're up to now, 16048 is the (old) one that scheduled > the recurring monitor operation. >> I suspect you'll find the action failed earlier in the logs and thats > why it needed to be restarted. >> >> Not the best log message though :( > > Thanks Andrew for the quick answer. I still need more info if possible. > > I searched everywhere for transaction 16048 and I couldn't find a trace > of it (looked for up to 5 days of logs prior to transaction 18014). > It would have been good if we had timestamps for each transaction > involved in this situation :-) > > Is there a way to find about this old transaction in any other logs (I > looked into /var/log/messages on both nodes involved in this cluster)? Its not really relevant. The only important thing is that its not one we're currently executing. What you should care about is any logs that hopefully show you why the resource failed at around Dec 6 22:55:05. > > To give you an idea of how many transactions happened during this > period: >TR_ID 18010 @ 21:52:16 >... >TR_ID 18018 @ 22:55:25 > > Over an hour between these two events. > > Given this, how come such a (very) old transaction (~2000 transactions > before current one) only acts now? Could it be stale information in > pacemaker? No. It hasn't only just acted now. Its been repeating over and over for the last few weeks or so. The difference is that this time it failed. > > Thanks in advance. > > Youssef > > > Message: 4 from Pacemaker Digest, Vol 61, Issue 34 > ------- > Date: Thu, 13 Dec 2012 10:52:42 +1100 > From: Andrew Beekhof > To: The Pacemaker cluster resource manager > > Subject: Re: [Pacemaker] Action from a different CRMD transition > results in restarting services > Message-ID: > > > Content-Type: text/plain; charset=windows-1252 > > On Thu, Dec 13, 2012 at 6:31 AM, Latrous, Youssef > wrote: >> Hi, >> >> >> >> I run into the following issue and I couldn?t find what it really > means: >> >> >> >> Detected action msgbroker_monitor_1 from a different > transition: >> 16048 vs. 18014 > > 18014 is where we're up to now, 16048 is the (old) one that scheduled > the recurring monitor operation. > I suspect you'll find the action failed earlier in the logs and thats > why it needed to be restarted. > > Not the best log message though :( > >> >> >> >> I can see that its impact is to stop/start a service but I?d like to >> understand it a bit more. >> >> >> >> Thank you in advance for any information. >> >> >> >> >> >> Logs about this issue: >> >> ? >> >> Dec 6 22:55:05 Node1 crmd: [5235]: info: process_graph_event: >> Detected action msgbroker_monitor_1 from a different transition: >> 16048 vs. 18014 >> >> Dec 6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph: >> process_graph_event:477 - Triggered transition abort (complete=1, >> tag=lrm_rsc_op, id=msgbroker_monitor_1, >> magic=0:7;104:16048:0:5fb57f01-3397-45a8-905f-c48cecdc8692, > cib=0.971.5) : >> Old event >> >> Dec 6 22:55:05 Node1 crmd: [5235]: WARN: update_failcount: Updating >> failcount for msgbroker on Node0 after failed monitor: rc=7 >> (update=value++, >> time=1354852505) >> >> Dec 6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: State >> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC >> cause=C_FSA_INTERNAL origin=abort_transition_graph ] >> >> Dec 6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: All 2 >> cluster nodes are eligible to run resources. >> >> Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28069: >> Requesting the current CIB: S_POLICY_ENGINE >> >> Dec 6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph: >> te_update_diff:142 - Triggered transition abort (complete=1, >> tag=nvpair, id=status-Node0-fail-count-msgbroker, magic=NA, >> cib=0.971.6) : Transient >> attribute: update >> >> Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28070: >> Requesting the current CIB: S_POLICY_ENGINE >> >> Dec 6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph: >> te_update_diff:142 - Triggered transition abort (complete=1, >> tag=nvpair, id=status-Node0-last-failure-msgbroker, magic=NA, >> cib=0.971.7) : Transient &g
Re: [Pacemaker] Action from a different CRMD transition results in restarting services
Andrew Beekhof wrote: > 18014 is where we're up to now, 16048 is the (old) one that scheduled the recurring monitor operation. > I suspect you'll find the action failed earlier in the logs and thats why it needed to be restarted. > > Not the best log message though :( Thanks Andrew for the quick answer. I still need more info if possible. I searched everywhere for transaction 16048 and I couldn't find a trace of it (looked for up to 5 days of logs prior to transaction 18014). It would have been good if we had timestamps for each transaction involved in this situation :-) Is there a way to find about this old transaction in any other logs (I looked into /var/log/messages on both nodes involved in this cluster)? To give you an idea of how many transactions happened during this period: TR_ID 18010 @ 21:52:16 ... TR_ID 18018 @ 22:55:25 Over an hour between these two events. Given this, how come such a (very) old transaction (~2000 transactions before current one) only acts now? Could it be stale information in pacemaker? Thanks in advance. Youssef Message: 4 from Pacemaker Digest, Vol 61, Issue 34 --- Date: Thu, 13 Dec 2012 10:52:42 +1100 From: Andrew Beekhof To: The Pacemaker cluster resource manager Subject: Re: [Pacemaker] Action from a different CRMD transition results in restarting services Message-ID: Content-Type: text/plain; charset=windows-1252 On Thu, Dec 13, 2012 at 6:31 AM, Latrous, Youssef wrote: > Hi, > > > > I run into the following issue and I couldn?t find what it really means: > > > > Detected action msgbroker_monitor_1 from a different transition: > 16048 vs. 18014 18014 is where we're up to now, 16048 is the (old) one that scheduled the recurring monitor operation. I suspect you'll find the action failed earlier in the logs and thats why it needed to be restarted. Not the best log message though :( > > > > I can see that its impact is to stop/start a service but I?d like to > understand it a bit more. > > > > Thank you in advance for any information. > > > > > > Logs about this issue: > > ? > > Dec 6 22:55:05 Node1 crmd: [5235]: info: process_graph_event: > Detected action msgbroker_monitor_1 from a different transition: > 16048 vs. 18014 > > Dec 6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph: > process_graph_event:477 - Triggered transition abort (complete=1, > tag=lrm_rsc_op, id=msgbroker_monitor_1, > magic=0:7;104:16048:0:5fb57f01-3397-45a8-905f-c48cecdc8692, cib=0.971.5) : > Old event > > Dec 6 22:55:05 Node1 crmd: [5235]: WARN: update_failcount: Updating > failcount for msgbroker on Node0 after failed monitor: rc=7 > (update=value++, > time=1354852505) > > Dec 6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC > cause=C_FSA_INTERNAL origin=abort_transition_graph ] > > Dec 6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: All 2 > cluster nodes are eligible to run resources. > > Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28069: > Requesting the current CIB: S_POLICY_ENGINE > > Dec 6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph: > te_update_diff:142 - Triggered transition abort (complete=1, > tag=nvpair, id=status-Node0-fail-count-msgbroker, magic=NA, > cib=0.971.6) : Transient > attribute: update > > Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28070: > Requesting the current CIB: S_POLICY_ENGINE > > Dec 6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph: > te_update_diff:142 - Triggered transition abort (complete=1, > tag=nvpair, id=status-Node0-last-failure-msgbroker, magic=NA, > cib=0.971.7) : Transient > attribute: update > > Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28071: > Requesting the current CIB: S_POLICY_ENGINE > > Dec 6 22:55:05 Node1 attrd: [5232]: info: find_hash_entry: Creating > hash entry for last-failure-msgbroker > > Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke_callback: > Invoking the PE: query=28071, ref=pe_calc-dc-1354852505-39407, seq=12, > quorate=1 > > Dec 6 22:55:05 Node1 pengine: [5233]: notice: unpack_config: On loss > of CCM > Quorum: Ignore > > Dec 6 22:55:05 Node1 pengine: [5233]: notice: unpack_rsc_op: > Operation > txpublisher_monitor_0 found resource txpublisher active on Node1 > > Dec 6 22:55:05 Node1 pengine: [5233]: WARN: unpack_rsc_op: Processing > failed op msgbroker_monitor_1 on Node0: not running (7) > > ? > > Dec 6 22:55:05 Node1 pengine: [5233]: notice: common_apply_stickiness: > msgbro
Re: [Pacemaker] Action from a different CRMD transition results in restarting services
On Thu, Dec 13, 2012 at 6:31 AM, Latrous, Youssef wrote: > Hi, > > > > I run into the following issue and I couldn’t find what it really means: > > > > Detected action msgbroker_monitor_1 from a different transition: > 16048 vs. 18014 18014 is where we're up to now, 16048 is the (old) one that scheduled the recurring monitor operation. I suspect you'll find the action failed earlier in the logs and thats why it needed to be restarted. Not the best log message though :( > > > > I can see that its impact is to stop/start a service but I’d like to > understand it a bit more. > > > > Thank you in advance for any information. > > > > > > Logs about this issue: > > … > > Dec 6 22:55:05 Node1 crmd: [5235]: info: process_graph_event: Detected > action msgbroker_monitor_1 from a different transition: 16048 vs. 18014 > > Dec 6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph: > process_graph_event:477 - Triggered transition abort (complete=1, > tag=lrm_rsc_op, id=msgbroker_monitor_1, > magic=0:7;104:16048:0:5fb57f01-3397-45a8-905f-c48cecdc8692, cib=0.971.5) : > Old event > > Dec 6 22:55:05 Node1 crmd: [5235]: WARN: update_failcount: Updating > failcount for msgbroker on Node0 after failed monitor: rc=7 (update=value++, > time=1354852505) > > Dec 6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL > origin=abort_transition_graph ] > > Dec 6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: All 2 cluster > nodes are eligible to run resources. > > Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28069: > Requesting the current CIB: S_POLICY_ENGINE > > Dec 6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph: > te_update_diff:142 - Triggered transition abort (complete=1, tag=nvpair, > id=status-Node0-fail-count-msgbroker, magic=NA, cib=0.971.6) : Transient > attribute: update > > Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28070: > Requesting the current CIB: S_POLICY_ENGINE > > Dec 6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph: > te_update_diff:142 - Triggered transition abort (complete=1, tag=nvpair, > id=status-Node0-last-failure-msgbroker, magic=NA, cib=0.971.7) : Transient > attribute: update > > Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28071: > Requesting the current CIB: S_POLICY_ENGINE > > Dec 6 22:55:05 Node1 attrd: [5232]: info: find_hash_entry: Creating hash > entry for last-failure-msgbroker > > Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke_callback: Invoking > the PE: query=28071, ref=pe_calc-dc-1354852505-39407, seq=12, quorate=1 > > Dec 6 22:55:05 Node1 pengine: [5233]: notice: unpack_config: On loss of CCM > Quorum: Ignore > > Dec 6 22:55:05 Node1 pengine: [5233]: notice: unpack_rsc_op: Operation > txpublisher_monitor_0 found resource txpublisher active on Node1 > > Dec 6 22:55:05 Node1 pengine: [5233]: WARN: unpack_rsc_op: Processing > failed op msgbroker_monitor_1 on Node0: not running (7) > > … > > Dec 6 22:55:05 Node1 pengine: [5233]: notice: common_apply_stickiness: > msgbroker can fail 99 more times on Node0 before being forced off > > … > > Dec 6 22:55:05 Node1 pengine: [5233]: notice: RecurringOp: Start recurring > monitor (10s) for msgbroker on Node0 > > … > > Dec 6 22:55:05 Node1 pengine: [5233]: notice: LogActions: Recover msgbroker > (Started Node0) > > … > > Dec 6 22:55:05 Node1 crmd: [5235]: info: te_rsc_command: Initiating action > 37: stop msgbroker_stop_0 on Node0 > > > > > > Transition 18014 details: > > > > Dec 6 22:52:18 Node1 pengine: [5233]: notice: process_pe_message: > Transition 18014: PEngine Input stored in: > /var/lib/pengine/pe-input-3270.bz2 > > Dec 6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State > transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS > cause=C_IPC_MESSAGE origin=handle_response ] > > Dec 6 22:52:18 Node1 crmd: [5235]: info: unpack_graph: Unpacked transition > 18014: 0 actions in 0 synapses > > Dec 6 22:52:18 Node1 crmd: [5235]: info: do_te_invoke: Processing graph > 18014 (ref=pe_calc-dc-1354852338-39406) derived from > /var/lib/pengine/pe-input-3270.bz2 > > Dec 6 22:52:18 Node1 crmd: [5235]: info: run_graph: > > > Dec 6 22:52:18 Node1 crmd: [5235]: notice: run_graph: Transition 18014 > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pengine/pe-input-3270.bz2): Complete > > Dec 6 22:52:18 Node1 crmd: [5235]: info: te_graph_trigger: Transition 18014 > is now complete > > Dec 6 22:52:18 Node1 crmd: [5235]: info: notify_crmd: Transition 18014 > status: done - > > Dec 6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > > Dec 6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: Starti
[Pacemaker] Action from a different CRMD transition results in restarting services
Hi, I run into the following issue and I couldn't find what it really means: Detected action msgbroker_monitor_1 from a different transition: 16048 vs. 18014 I can see that its impact is to stop/start a service but I'd like to understand it a bit more. Thank you in advance for any information. Logs about this issue: ... Dec 6 22:55:05 Node1 crmd: [5235]: info: process_graph_event: Detected action msgbroker_monitor_1 from a different transition: 16048 vs. 18014 Dec 6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph: process_graph_event:477 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=msgbroker_monitor_1, magic=0:7;104:16048:0:5fb57f01-3397-45a8-905f-c48cecdc8692, cib=0.971.5) : Old event Dec 6 22:55:05 Node1 crmd: [5235]: WARN: update_failcount: Updating failcount for msgbroker on Node0 after failed monitor: rc=7 (update=value++, time=1354852505) Dec 6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Dec 6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: All 2 cluster nodes are eligible to run resources. Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28069: Requesting the current CIB: S_POLICY_ENGINE Dec 6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph: te_update_diff:142 - Triggered transition abort (complete=1, tag=nvpair, id=status-Node0-fail-count-msgbroker, magic=NA, cib=0.971.6) : Transient attribute: update Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28070: Requesting the current CIB: S_POLICY_ENGINE Dec 6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph: te_update_diff:142 - Triggered transition abort (complete=1, tag=nvpair, id=status-Node0-last-failure-msgbroker, magic=NA, cib=0.971.7) : Transient attribute: update Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28071: Requesting the current CIB: S_POLICY_ENGINE Dec 6 22:55:05 Node1 attrd: [5232]: info: find_hash_entry: Creating hash entry for last-failure-msgbroker Dec 6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke_callback: Invoking the PE: query=28071, ref=pe_calc-dc-1354852505-39407, seq=12, quorate=1 Dec 6 22:55:05 Node1 pengine: [5233]: notice: unpack_config: On loss of CCM Quorum: Ignore Dec 6 22:55:05 Node1 pengine: [5233]: notice: unpack_rsc_op: Operation txpublisher_monitor_0 found resource txpublisher active on Node1 Dec 6 22:55:05 Node1 pengine: [5233]: WARN: unpack_rsc_op: Processing failed op msgbroker_monitor_1 on Node0: not running (7) ... Dec 6 22:55:05 Node1 pengine: [5233]: notice: common_apply_stickiness: msgbroker can fail 99 more times on Node0 before being forced off ... Dec 6 22:55:05 Node1 pengine: [5233]: notice: RecurringOp: Start recurring monitor (10s) for msgbroker on Node0 ... Dec 6 22:55:05 Node1 pengine: [5233]: notice: LogActions: Recover msgbroker (Started Node0) ... Dec 6 22:55:05 Node1 crmd: [5235]: info: te_rsc_command: Initiating action 37: stop msgbroker_stop_0 on Node0 Transition 18014 details: Dec 6 22:52:18 Node1 pengine: [5233]: notice: process_pe_message: Transition 18014: PEngine Input stored in: /var/lib/pengine/pe-input-3270.bz2 Dec 6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Dec 6 22:52:18 Node1 crmd: [5235]: info: unpack_graph: Unpacked transition 18014: 0 actions in 0 synapses Dec 6 22:52:18 Node1 crmd: [5235]: info: do_te_invoke: Processing graph 18014 (ref=pe_calc-dc-1354852338-39406) derived from /var/lib/pengine/pe-input-3270.bz2 Dec 6 22:52:18 Node1 crmd: [5235]: info: run_graph: Dec 6 22:52:18 Node1 crmd: [5235]: notice: run_graph: Transition 18014 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-3270.bz2): Complete Dec 6 22:52:18 Node1 crmd: [5235]: info: te_graph_trigger: Transition 18014 is now complete Dec 6 22:52:18 Node1 crmd: [5235]: info: notify_crmd: Transition 18014 status: done - Dec 6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Dec 6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: Starting PEngine Recheck Timer Youssef ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org