Re: [Pacemaker] Action from a different CRMD transition results in restarting services

2012-12-13 Thread Andrew Beekhof
On Fri, Dec 14, 2012 at 1:33 AM, Latrous, Youssef
 wrote:
>
> Andrew Beekhof  wrote:
>> 18014 is where we're up to now, 16048 is the (old) one that scheduled
> the recurring monitor operation.
>> I suspect you'll find the action failed earlier in the logs and thats
> why it needed to be restarted.
>>
>> Not the best log message though :(
>
> Thanks Andrew for the quick answer. I still need more info if possible.
>
> I searched everywhere for transaction 16048 and I couldn't find a trace
> of it (looked for up to 5 days of logs prior to transaction 18014).
> It would have been good if we had timestamps for each transaction
> involved in this situation :-)
>
> Is there a way to find about this old transaction in any other logs (I
> looked into /var/log/messages on both nodes involved in this cluster)?

Its not really relevant.
The only important thing is that its not one we're currently executing.

What you should care about is any logs that hopefully show you why the
resource failed at around Dec  6 22:55:05.

>
> To give you an idea of how many transactions happened during this
> period:
>TR_ID 18010 @ 21:52:16
>...
>TR_ID 18018 @ 22:55:25
>
> Over an hour between these two events.
>
> Given this, how come such a (very) old transaction (~2000 transactions
> before current one) only acts now? Could it be stale information in
> pacemaker?

No. It hasn't only just acted now. Its been repeating over and over
for the last few weeks or so.
The difference is that this time it failed.

>
> Thanks in advance.
>
> Youssef
>
>
> Message: 4  from Pacemaker Digest, Vol 61, Issue 34
> -------
> Date: Thu, 13 Dec 2012 10:52:42 +1100
> From: Andrew Beekhof 
> To: The Pacemaker cluster resource manager
> 
> Subject: Re: [Pacemaker] Action from a different CRMD transition
> results in restarting services
> Message-ID:
>
> 
> Content-Type: text/plain; charset=windows-1252
>
> On Thu, Dec 13, 2012 at 6:31 AM, Latrous, Youssef
>  wrote:
>> Hi,
>>
>>
>>
>> I run into the following issue and I couldn?t find what it really
> means:
>>
>>
>>
>> Detected action msgbroker_monitor_1 from a different
> transition:
>> 16048 vs. 18014
>
> 18014 is where we're up to now, 16048 is the (old) one that scheduled
> the recurring monitor operation.
> I suspect you'll find the action failed earlier in the logs and thats
> why it needed to be restarted.
>
> Not the best log message though :(
>
>>
>>
>>
>> I can see that its impact is to stop/start a service but I?d like to
>> understand it a bit more.
>>
>>
>>
>> Thank you in advance for any information.
>>
>>
>>
>>
>>
>> Logs about this issue:
>>
>> ?
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: process_graph_event:
>> Detected action msgbroker_monitor_1 from a different transition:
>> 16048 vs. 18014
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
>> process_graph_event:477 - Triggered transition abort (complete=1,
>> tag=lrm_rsc_op, id=msgbroker_monitor_1,
>> magic=0:7;104:16048:0:5fb57f01-3397-45a8-905f-c48cecdc8692,
> cib=0.971.5) :
>> Old event
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: WARN: update_failcount: Updating
>> failcount for msgbroker on Node0 after failed monitor: rc=7
>> (update=value++,
>> time=1354852505)
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: State
>> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
>> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: All 2
>> cluster nodes are eligible to run resources.
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28069:
>> Requesting the current CIB: S_POLICY_ENGINE
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
>> te_update_diff:142 - Triggered transition abort (complete=1,
>> tag=nvpair, id=status-Node0-fail-count-msgbroker, magic=NA,
>> cib=0.971.6) : Transient
>> attribute: update
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28070:
>> Requesting the current CIB: S_POLICY_ENGINE
>>
>> Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
>> te_update_diff:142 - Triggered transition abort (complete=1,
>> tag=nvpair, id=status-Node0-last-failure-msgbroker, magic=NA,
>> cib=0.971.7) : Transient
&g

Re: [Pacemaker] Action from a different CRMD transition results in restarting services

2012-12-13 Thread Latrous, Youssef

Andrew Beekhof  wrote:
> 18014 is where we're up to now, 16048 is the (old) one that scheduled
the recurring monitor operation.
> I suspect you'll find the action failed earlier in the logs and thats
why it needed to be restarted.
>
> Not the best log message though :(

Thanks Andrew for the quick answer. I still need more info if possible.

I searched everywhere for transaction 16048 and I couldn't find a trace
of it (looked for up to 5 days of logs prior to transaction 18014).
It would have been good if we had timestamps for each transaction
involved in this situation :-)

Is there a way to find about this old transaction in any other logs (I
looked into /var/log/messages on both nodes involved in this cluster)?

To give you an idea of how many transactions happened during this
period:
   TR_ID 18010 @ 21:52:16
   ...
   TR_ID 18018 @ 22:55:25

Over an hour between these two events.

Given this, how come such a (very) old transaction (~2000 transactions
before current one) only acts now? Could it be stale information in
pacemaker?

Thanks in advance.

Youssef


Message: 4  from Pacemaker Digest, Vol 61, Issue 34
---
Date: Thu, 13 Dec 2012 10:52:42 +1100
From: Andrew Beekhof 
To: The Pacemaker cluster resource manager
    
Subject: Re: [Pacemaker] Action from a different CRMD transition
    results in restarting services
Message-ID:


Content-Type: text/plain; charset=windows-1252

On Thu, Dec 13, 2012 at 6:31 AM, Latrous, Youssef
 wrote:
> Hi,
>
>
>
> I run into the following issue and I couldn?t find what it really
means:
>
>
>
> Detected action msgbroker_monitor_1 from a different
transition:
> 16048 vs. 18014

18014 is where we're up to now, 16048 is the (old) one that scheduled
the recurring monitor operation.
I suspect you'll find the action failed earlier in the logs and thats
why it needed to be restarted.

Not the best log message though :(

>
>
>
> I can see that its impact is to stop/start a service but I?d like to 
> understand it a bit more.
>
>
>
> Thank you in advance for any information.
>
>
>
>
>
> Logs about this issue:
>
> ?
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: process_graph_event: 
> Detected action msgbroker_monitor_1 from a different transition: 
> 16048 vs. 18014
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
> process_graph_event:477 - Triggered transition abort (complete=1, 
> tag=lrm_rsc_op, id=msgbroker_monitor_1, 
> magic=0:7;104:16048:0:5fb57f01-3397-45a8-905f-c48cecdc8692,
cib=0.971.5) :
> Old event
>
> Dec  6 22:55:05 Node1 crmd: [5235]: WARN: update_failcount: Updating 
> failcount for msgbroker on Node0 after failed monitor: rc=7 
> (update=value++,
> time=1354852505)
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: State 
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC 
> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: All 2 
> cluster nodes are eligible to run resources.
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28069:
> Requesting the current CIB: S_POLICY_ENGINE
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
> te_update_diff:142 - Triggered transition abort (complete=1, 
> tag=nvpair, id=status-Node0-fail-count-msgbroker, magic=NA, 
> cib=0.971.6) : Transient
> attribute: update
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28070:
> Requesting the current CIB: S_POLICY_ENGINE
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
> te_update_diff:142 - Triggered transition abort (complete=1, 
> tag=nvpair, id=status-Node0-last-failure-msgbroker, magic=NA, 
> cib=0.971.7) : Transient
> attribute: update
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28071:
> Requesting the current CIB: S_POLICY_ENGINE
>
> Dec  6 22:55:05 Node1 attrd: [5232]: info: find_hash_entry: Creating 
> hash entry for last-failure-msgbroker
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke_callback: 
> Invoking the PE: query=28071, ref=pe_calc-dc-1354852505-39407, seq=12,

> quorate=1
>
> Dec  6 22:55:05 Node1 pengine: [5233]: notice: unpack_config: On loss 
> of CCM
> Quorum: Ignore
>
> Dec  6 22:55:05 Node1 pengine: [5233]: notice: unpack_rsc_op: 
> Operation
> txpublisher_monitor_0 found resource txpublisher active on Node1
>
> Dec  6 22:55:05 Node1 pengine: [5233]: WARN: unpack_rsc_op: Processing

> failed op msgbroker_monitor_1 on Node0: not running (7)
>
> ?
>
> Dec  6 22:55:05 Node1 pengine: [5233]: notice:
common_apply_stickiness:
> msgbro

Re: [Pacemaker] Action from a different CRMD transition results in restarting services

2012-12-12 Thread Andrew Beekhof
On Thu, Dec 13, 2012 at 6:31 AM, Latrous, Youssef
 wrote:
> Hi,
>
>
>
> I run into the following issue and I couldn’t find what it really means:
>
>
>
> Detected action msgbroker_monitor_1 from a different transition:
> 16048 vs. 18014

18014 is where we're up to now, 16048 is the (old) one that scheduled
the recurring monitor operation.
I suspect you'll find the action failed earlier in the logs and thats
why it needed to be restarted.

Not the best log message though :(

>
>
>
> I can see that its impact is to stop/start a service but I’d like to
> understand it a bit more.
>
>
>
> Thank you in advance for any information.
>
>
>
>
>
> Logs about this issue:
>
> …
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: process_graph_event: Detected
> action msgbroker_monitor_1 from a different transition: 16048 vs. 18014
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
> process_graph_event:477 - Triggered transition abort (complete=1,
> tag=lrm_rsc_op, id=msgbroker_monitor_1,
> magic=0:7;104:16048:0:5fb57f01-3397-45a8-905f-c48cecdc8692, cib=0.971.5) :
> Old event
>
> Dec  6 22:55:05 Node1 crmd: [5235]: WARN: update_failcount: Updating
> failcount for msgbroker on Node0 after failed monitor: rc=7 (update=value++,
> time=1354852505)
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
> origin=abort_transition_graph ]
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: All 2 cluster
> nodes are eligible to run resources.
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28069:
> Requesting the current CIB: S_POLICY_ENGINE
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
> te_update_diff:142 - Triggered transition abort (complete=1, tag=nvpair,
> id=status-Node0-fail-count-msgbroker, magic=NA, cib=0.971.6) : Transient
> attribute: update
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28070:
> Requesting the current CIB: S_POLICY_ENGINE
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
> te_update_diff:142 - Triggered transition abort (complete=1, tag=nvpair,
> id=status-Node0-last-failure-msgbroker, magic=NA, cib=0.971.7) : Transient
> attribute: update
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28071:
> Requesting the current CIB: S_POLICY_ENGINE
>
> Dec  6 22:55:05 Node1 attrd: [5232]: info: find_hash_entry: Creating hash
> entry for last-failure-msgbroker
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke_callback: Invoking
> the PE: query=28071, ref=pe_calc-dc-1354852505-39407, seq=12, quorate=1
>
> Dec  6 22:55:05 Node1 pengine: [5233]: notice: unpack_config: On loss of CCM
> Quorum: Ignore
>
> Dec  6 22:55:05 Node1 pengine: [5233]: notice: unpack_rsc_op: Operation
> txpublisher_monitor_0 found resource txpublisher active on Node1
>
> Dec  6 22:55:05 Node1 pengine: [5233]: WARN: unpack_rsc_op: Processing
> failed op msgbroker_monitor_1 on Node0: not running (7)
>
> …
>
> Dec  6 22:55:05 Node1 pengine: [5233]: notice: common_apply_stickiness:
> msgbroker can fail 99 more times on Node0 before being forced off
>
> …
>
> Dec  6 22:55:05 Node1 pengine: [5233]: notice: RecurringOp:  Start recurring
> monitor (10s) for msgbroker on Node0
>
> …
>
> Dec  6 22:55:05 Node1 pengine: [5233]: notice: LogActions: Recover msgbroker
> (Started Node0)
>
> …
>
> Dec  6 22:55:05 Node1 crmd: [5235]: info: te_rsc_command: Initiating action
> 37: stop msgbroker_stop_0 on Node0
>
>
>
>
>
> Transition 18014 details:
>
>
>
> Dec  6 22:52:18 Node1 pengine: [5233]: notice: process_pe_message:
> Transition 18014: PEngine Input stored in:
> /var/lib/pengine/pe-input-3270.bz2
>
> Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
>
> Dec  6 22:52:18 Node1 crmd: [5235]: info: unpack_graph: Unpacked transition
> 18014: 0 actions in 0 synapses
>
> Dec  6 22:52:18 Node1 crmd: [5235]: info: do_te_invoke: Processing graph
> 18014 (ref=pe_calc-dc-1354852338-39406) derived from
> /var/lib/pengine/pe-input-3270.bz2
>
> Dec  6 22:52:18 Node1 crmd: [5235]: info: run_graph:
> 
>
> Dec  6 22:52:18 Node1 crmd: [5235]: notice: run_graph: Transition 18014
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pengine/pe-input-3270.bz2): Complete
>
> Dec  6 22:52:18 Node1 crmd: [5235]: info: te_graph_trigger: Transition 18014
> is now complete
>
> Dec  6 22:52:18 Node1 crmd: [5235]: info: notify_crmd: Transition 18014
> status: done - 
>
> Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
>
> Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: Starti

[Pacemaker] Action from a different CRMD transition results in restarting services

2012-12-12 Thread Latrous, Youssef
Hi,

 

I run into the following issue and I couldn't find what it really means:

 

Detected action msgbroker_monitor_1 from a different
transition: 16048 vs. 18014

 

I can see that its impact is to stop/start a service but I'd like to
understand it a bit more.

 

Thank you in advance for any information.

 

 

Logs about this issue:

...

Dec  6 22:55:05 Node1 crmd: [5235]: info: process_graph_event: Detected
action msgbroker_monitor_1 from a different transition: 16048 vs.
18014

Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
process_graph_event:477 - Triggered transition abort (complete=1,
tag=lrm_rsc_op, id=msgbroker_monitor_1,
magic=0:7;104:16048:0:5fb57f01-3397-45a8-905f-c48cecdc8692, cib=0.971.5)
: Old event

Dec  6 22:55:05 Node1 crmd: [5235]: WARN: update_failcount: Updating
failcount for msgbroker on Node0 after failed monitor: rc=7
(update=value++, time=1354852505)

Dec  6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_FSA_INTERNAL origin=abort_transition_graph ]

Dec  6 22:55:05 Node1 crmd: [5235]: info: do_state_transition: All 2
cluster nodes are eligible to run resources.

Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28069:
Requesting the current CIB: S_POLICY_ENGINE

Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
te_update_diff:142 - Triggered transition abort (complete=1, tag=nvpair,
id=status-Node0-fail-count-msgbroker, magic=NA, cib=0.971.6) : Transient
attribute: update

Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28070:
Requesting the current CIB: S_POLICY_ENGINE

Dec  6 22:55:05 Node1 crmd: [5235]: info: abort_transition_graph:
te_update_diff:142 - Triggered transition abort (complete=1, tag=nvpair,
id=status-Node0-last-failure-msgbroker, magic=NA, cib=0.971.7) :
Transient attribute: update

Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke: Query 28071:
Requesting the current CIB: S_POLICY_ENGINE

Dec  6 22:55:05 Node1 attrd: [5232]: info: find_hash_entry: Creating
hash entry for last-failure-msgbroker

Dec  6 22:55:05 Node1 crmd: [5235]: info: do_pe_invoke_callback:
Invoking the PE: query=28071, ref=pe_calc-dc-1354852505-39407, seq=12,
quorate=1

Dec  6 22:55:05 Node1 pengine: [5233]: notice: unpack_config: On loss of
CCM Quorum: Ignore

Dec  6 22:55:05 Node1 pengine: [5233]: notice: unpack_rsc_op: Operation
txpublisher_monitor_0 found resource txpublisher active on Node1

Dec  6 22:55:05 Node1 pengine: [5233]: WARN: unpack_rsc_op: Processing
failed op msgbroker_monitor_1 on Node0: not running (7)

...

Dec  6 22:55:05 Node1 pengine: [5233]: notice: common_apply_stickiness:
msgbroker can fail 99 more times on Node0 before being forced off

...

Dec  6 22:55:05 Node1 pengine: [5233]: notice: RecurringOp:  Start
recurring monitor (10s) for msgbroker on Node0

...

Dec  6 22:55:05 Node1 pengine: [5233]: notice: LogActions: Recover
msgbroker  (Started Node0)

...

Dec  6 22:55:05 Node1 crmd: [5235]: info: te_rsc_command: Initiating
action 37: stop msgbroker_stop_0 on Node0

 

 

Transition 18014 details:

 

Dec  6 22:52:18 Node1 pengine: [5233]: notice: process_pe_message:
Transition 18014: PEngine Input stored in:
/var/lib/pengine/pe-input-3270.bz2

Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
cause=C_IPC_MESSAGE origin=handle_response ]

Dec  6 22:52:18 Node1 crmd: [5235]: info: unpack_graph: Unpacked
transition 18014: 0 actions in 0 synapses

Dec  6 22:52:18 Node1 crmd: [5235]: info: do_te_invoke: Processing graph
18014 (ref=pe_calc-dc-1354852338-39406) derived from
/var/lib/pengine/pe-input-3270.bz2

Dec  6 22:52:18 Node1 crmd: [5235]: info: run_graph:


Dec  6 22:52:18 Node1 crmd: [5235]: notice: run_graph: Transition 18014
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pengine/pe-input-3270.bz2): Complete

Dec  6 22:52:18 Node1 crmd: [5235]: info: te_graph_trigger: Transition
18014 is now complete

Dec  6 22:52:18 Node1 crmd: [5235]: info: notify_crmd: Transition 18014
status: done - 

Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]

Dec  6 22:52:18 Node1 crmd: [5235]: info: do_state_transition: Starting
PEngine Recheck Timer

 

 

Youssef

 

 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org