Re: [Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated pacemaker's problem or mine?

2012-05-01 Thread Andrew Beekhof
On Mon, Apr 30, 2012 at 10:44 PM, Lars Ellenberg
 wrote:
> On Mon, Apr 30, 2012 at 01:00:11PM +1000, Andrew Beekhof wrote:
>> On Sat, Apr 28, 2012 at 5:40 AM, Lars Ellenberg
>>  wrote:
>> > On Fri, Apr 27, 2012 at 11:31:23AM +0100, Tim Small wrote:
>> >> Hi,
>> >>
>> >> I'm trying to get to the bottom of a problem I'm seeing with a cluster.
>> >> At this stage I'm unclear as to whether the issue is with the config or
>> >> not - the generated error messages seem unclear.  So I'm not sure
>> >> whether I should be staring at the config or the source code at this
>> >> point, and would appreciate a clue!
>> >>
>> >> I'm running with some of the (live) resources in an unmanaged state
>> >> whilst testing fail-over with other (non-dependant) resources.
>> >>
>> >> The managed resources are a number of OpenVZ virtual machines (each
>> >> comprising 3 primitives - file-system + OpenVZ VE + SendArp).  The
>> >> filesystems are on LVM volume groups, and the single LVM PV for each
>> >> volume group resides on a DRBD volume.  There are n virtual machines per
>> >> DRBD volume.
>> >>
>> >> I'm running pacemaker 1.0.9.1+hg15626-1 on Debian 6.0.  Here are some of
>> >> the messages (configuration follows at the end of the email):
>> >
>> > Upgrading to 1.0.12, or 1.1.7, may get you a little further.
>> > It would not solve the "I need to stop that resource first, but I can
>> > not as it is unmanaged" dependency problem you apparently have here.
>>
>> There's really not a lot the cluster can do in this situation, there's
>> a 50% chance of getting it wrong no matter what we do.
>> In the most recent versions we now log as loudly as possible
>> (LOG_CRIT) that we cant shutdown because something depends on an
>> unmanaged resource.
>
> That's in fact what I meant ;-)
>
> Not only the cryptic "ERROR: te_graph_trigger: Transition failed: terminated"
> but "Hey you fool, I cannot do that because you told me not to manage
> that resource, but the other ones depend on it".
>
> Though, you still have to spot that line in the flood...

We're working on that part too :-)

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated pacemaker's problem or mine?

2012-04-30 Thread Lars Ellenberg
On Mon, Apr 30, 2012 at 01:00:11PM +1000, Andrew Beekhof wrote:
> On Sat, Apr 28, 2012 at 5:40 AM, Lars Ellenberg
>  wrote:
> > On Fri, Apr 27, 2012 at 11:31:23AM +0100, Tim Small wrote:
> >> Hi,
> >>
> >> I'm trying to get to the bottom of a problem I'm seeing with a cluster.
> >> At this stage I'm unclear as to whether the issue is with the config or
> >> not - the generated error messages seem unclear.  So I'm not sure
> >> whether I should be staring at the config or the source code at this
> >> point, and would appreciate a clue!
> >>
> >> I'm running with some of the (live) resources in an unmanaged state
> >> whilst testing fail-over with other (non-dependant) resources.
> >>
> >> The managed resources are a number of OpenVZ virtual machines (each
> >> comprising 3 primitives - file-system + OpenVZ VE + SendArp).  The
> >> filesystems are on LVM volume groups, and the single LVM PV for each
> >> volume group resides on a DRBD volume.  There are n virtual machines per
> >> DRBD volume.
> >>
> >> I'm running pacemaker 1.0.9.1+hg15626-1 on Debian 6.0.  Here are some of
> >> the messages (configuration follows at the end of the email):
> >
> > Upgrading to 1.0.12, or 1.1.7, may get you a little further.
> > It would not solve the "I need to stop that resource first, but I can
> > not as it is unmanaged" dependency problem you apparently have here.
> 
> There's really not a lot the cluster can do in this situation, there's
> a 50% chance of getting it wrong no matter what we do.
> In the most recent versions we now log as loudly as possible
> (LOG_CRIT) that we cant shutdown because something depends on an
> unmanaged resource.

That's in fact what I meant ;-)

Not only the cryptic "ERROR: te_graph_trigger: Transition failed: terminated"
but "Hey you fool, I cannot do that because you told me not to manage
that resource, but the other ones depend on it".

Though, you still have to spot that line in the flood...

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated pacemaker's problem or mine?

2012-04-29 Thread Andrew Beekhof
On Sat, Apr 28, 2012 at 5:40 AM, Lars Ellenberg
 wrote:
> On Fri, Apr 27, 2012 at 11:31:23AM +0100, Tim Small wrote:
>> Hi,
>>
>> I'm trying to get to the bottom of a problem I'm seeing with a cluster.
>> At this stage I'm unclear as to whether the issue is with the config or
>> not - the generated error messages seem unclear.  So I'm not sure
>> whether I should be staring at the config or the source code at this
>> point, and would appreciate a clue!
>>
>> I'm running with some of the (live) resources in an unmanaged state
>> whilst testing fail-over with other (non-dependant) resources.
>>
>> The managed resources are a number of OpenVZ virtual machines (each
>> comprising 3 primitives - file-system + OpenVZ VE + SendArp).  The
>> filesystems are on LVM volume groups, and the single LVM PV for each
>> volume group resides on a DRBD volume.  There are n virtual machines per
>> DRBD volume.
>>
>> I'm running pacemaker 1.0.9.1+hg15626-1 on Debian 6.0.  Here are some of
>> the messages (configuration follows at the end of the email):
>
> Upgrading to 1.0.12, or 1.1.7, may get you a little further.
> It would not solve the "I need to stop that resource first, but I can
> not as it is unmanaged" dependency problem you apparently have here.

There's really not a lot the cluster can do in this situation, there's
a 50% chance of getting it wrong no matter what we do.
In the most recent versions we now log as loudly as possible
(LOG_CRIT) that we cant shutdown because something depends on an
unmanaged resource.

> I think you simply have some copy'n'paste errors in your constraints,
> calypso should be ordered with essex03, not 02.
>
> May not be the only problem, though.
>
> BTW, LCMC, respectively the cluster resource and constraint graph view
> it presents to you, can help to "just see" this kind of error.
>
> Some more comments inline.
>
>> Apr 27 11:06:35 fig crmd: [395]: info: crm_timer_popped: PEngine Recheck 
>> Timer (I_PE_CALC) just popped!
>> Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: State transition 
>> S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED 
>> origin=crm_timer_popped ]
>> Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: Progressed to 
>> state S_POLICY_ENGINE after C_TIMER_POPPED
>> Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: All 2 cluster 
>> nodes are eligible to run resources.
>> Apr 27 11:06:35 fig crmd: [395]: info: do_pe_invoke: Query 985: Requesting 
>> the current CIB: S_POLICY_ENGINE
>> Apr 27 11:06:35 fig crmd: [395]: info: do_pe_invoke_callback: Invoking the 
>> PE: query=985, ref=pe_calc-dc-1335521195-1437, seq=184, quorate=1
>> Apr 27 11:06:35 fig pengine: [394]: notice: unpack_config: On loss of CCM 
>> Quorum: Ignore
>> Apr 27 11:06:35 fig pengine: [394]: info: unpack_config: Node scores: 'red' 
>> = -INFINITY, 'yellow' = 0, 'green' = 0
>> Apr 27 11:06:35 fig pengine: [394]: info: determine_online_status: Node fig 
>> is online
>> Apr 27 11:06:35 fig pengine: [394]: info: determine_online_status: Node 
>> hazel is online
>> Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource 
>> essex03-LVM isnt managed
>> Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource 
>> calypso-FS isnt managed
>> Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource 
>> calypso-VE isnt managed
>> Apr 27 11:06:35 fig pengine: [394]: notice: unpack_rsc_op: Operation 
>> essex03-DRBD:0_monitor_0 found resource essex03-DRBD:0 active in master mode 
>> on hazel
>> Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource 
>> essex03-DRBD:0 isnt managed
>> Apr 27 11:06:35 fig pengine: [394]: notice: unpack_rsc_op: Operation 
>> essex02-DRBD:1_monitor_0 found resource essex02-DRBD:1 active in master mode 
>> on hazel
>> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
>> artemis-FS#011(ocf::heartbeat:Filesystem):#011Stopped
>> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
>> artemis-SendArp#011(ocf::heartbeat:SendArp):#011Stopped
>> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
>> artemis-VE#011(ocf::heartbeat:ManageVE):#011Stopped  (unmanaged)
>> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
>> athena-FS#011(ocf::heartbeat:Filesystem):#011Stopped
>> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
>> athena-SendArp#011(ocf::heartbeat:SendArp):#011Stopped
>> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
>> athena-VE#011(ocf::heartbeat:ManageVE):#011Stopped  (unmanaged)
>> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
>> calypso-FS#011(ocf::heartbeat:Filesystem):#011Started hazel (unmanaged)
>> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
>> calypso-SendArp#011(ocf::heartbeat:SendArp):#011Stopped
>> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
>> calypso-VE#011(ocf::heartbeat:ManageVE):#011Started hazel (unmanaged)
>> Apr 27 11:06:35 fig pengine: [394]: notice: native_pr

Re: [Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated pacemaker's problem or mine?

2012-04-27 Thread Tim Small
On 27/04/12 20:40, Lars Ellenberg wrote:
>
> Ok, colo calypso with essex03... but then, why ...
>
>   
>> order essex02-lvm-before-calypso-FS inf: essex02-LVM calypso-FS
>> 
> Order essex02 with calypso? typo? is this supposed to be essex03?
>
>   

Yes, that seems to have been it - my typo...  Grrr.  That's what happens
when you work until 2am I suppose.

Thanks very much for spotting it.

Tim.

-- 
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.  
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53  http://seoss.co.uk/ +44-(0)1273-808309


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated pacemaker's problem or mine?

2012-04-27 Thread Lars Ellenberg
On Fri, Apr 27, 2012 at 11:31:23AM +0100, Tim Small wrote:
> Hi,
> 
> I'm trying to get to the bottom of a problem I'm seeing with a cluster. 
> At this stage I'm unclear as to whether the issue is with the config or
> not - the generated error messages seem unclear.  So I'm not sure
> whether I should be staring at the config or the source code at this
> point, and would appreciate a clue!
> 
> I'm running with some of the (live) resources in an unmanaged state
> whilst testing fail-over with other (non-dependant) resources.
> 
> The managed resources are a number of OpenVZ virtual machines (each
> comprising 3 primitives - file-system + OpenVZ VE + SendArp).  The
> filesystems are on LVM volume groups, and the single LVM PV for each
> volume group resides on a DRBD volume.  There are n virtual machines per
> DRBD volume.
> 
> I'm running pacemaker 1.0.9.1+hg15626-1 on Debian 6.0.  Here are some of
> the messages (configuration follows at the end of the email):

Upgrading to 1.0.12, or 1.1.7, may get you a little further.
It would not solve the "I need to stop that resource first, but I can
not as it is unmanaged" dependency problem you apparently have here.

I think you simply have some copy'n'paste errors in your constraints,
calypso should be ordered with essex03, not 02.

May not be the only problem, though.

BTW, LCMC, respectively the cluster resource and constraint graph view
it presents to you, can help to "just see" this kind of error.

Some more comments inline.

> Apr 27 11:06:35 fig crmd: [395]: info: crm_timer_popped: PEngine Recheck 
> Timer (I_PE_CALC) just popped! 
> Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: State transition 
> S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED 
> origin=crm_timer_popped ] 
> Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: Progressed to 
> state S_POLICY_ENGINE after C_TIMER_POPPED 
> Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: All 2 cluster 
> nodes are eligible to run resources. 
> Apr 27 11:06:35 fig crmd: [395]: info: do_pe_invoke: Query 985: Requesting 
> the current CIB: S_POLICY_ENGINE 
> Apr 27 11:06:35 fig crmd: [395]: info: do_pe_invoke_callback: Invoking the 
> PE: query=985, ref=pe_calc-dc-1335521195-1437, seq=184, quorate=1 
> Apr 27 11:06:35 fig pengine: [394]: notice: unpack_config: On loss of CCM 
> Quorum: Ignore 
> Apr 27 11:06:35 fig pengine: [394]: info: unpack_config: Node scores: 'red' = 
> -INFINITY, 'yellow' = 0, 'green' = 0 
> Apr 27 11:06:35 fig pengine: [394]: info: determine_online_status: Node fig 
> is online 
> Apr 27 11:06:35 fig pengine: [394]: info: determine_online_status: Node hazel 
> is online 
> Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource 
> essex03-LVM isnt managed 
> Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource 
> calypso-FS isnt managed 
> Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource 
> calypso-VE isnt managed 
> Apr 27 11:06:35 fig pengine: [394]: notice: unpack_rsc_op: Operation 
> essex03-DRBD:0_monitor_0 found resource essex03-DRBD:0 active in master mode 
> on hazel 
> Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource 
> essex03-DRBD:0 isnt managed 
> Apr 27 11:06:35 fig pengine: [394]: notice: unpack_rsc_op: Operation 
> essex02-DRBD:1_monitor_0 found resource essex02-DRBD:1 active in master mode 
> on hazel 
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
> artemis-FS#011(ocf::heartbeat:Filesystem):#011Stopped 
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
> artemis-SendArp#011(ocf::heartbeat:SendArp):#011Stopped 
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
> artemis-VE#011(ocf::heartbeat:ManageVE):#011Stopped  (unmanaged) 
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
> athena-FS#011(ocf::heartbeat:Filesystem):#011Stopped 
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
> athena-SendArp#011(ocf::heartbeat:SendArp):#011Stopped 
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
> athena-VE#011(ocf::heartbeat:ManageVE):#011Stopped  (unmanaged) 
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
> calypso-FS#011(ocf::heartbeat:Filesystem):#011Started hazel (unmanaged) 
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
> calypso-SendArp#011(ocf::heartbeat:SendArp):#011Stopped 
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
> calypso-VE#011(ocf::heartbeat:ManageVE):#011Started hazel (unmanaged) 
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
> epione-FS#011(ocf::heartbeat:Filesystem):#011Stopped 
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
> epione-SendArp#011(ocf::heartbeat:SendArp):#011Stopped 
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
> epione-VE#011(ocf::heartbeat:ManageVE):#011Stopped  (unmanaged) 
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
> essex02-LVM#011(ocf::hear

Re: [Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated pacemaker's problem or mine?

2012-04-27 Thread Tim Small
On 27/04/12 15:00, David Vossel wrote:
>
> I'm betting this transition error is a result of how the un-managed resources 
> are used in the colocation and order constraints with the managed resources.  
> Can you produce a hb_report/crm_report for this.   It isn't obvious (at least 
> to me) what is causing this by looking at the logs.

Hi David,

Thanks for the reply...

I've put up a report at http://buttersideup.com/files/fighazelreport.tbz2

The various *-FS resources are managed, and they can be in the state
whereby they are running on one node, but then if you migration to the
other node, all the prerequisites happen, but the -FS resources
themselves don't actually get started

Cheers,

Tim.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated pacemaker's problem or mine?

2012-04-27 Thread David Vossel
- Original Message -
> From: "Tim Small" 
> To: pacemaker@oss.clusterlabs.org
> Sent: Friday, April 27, 2012 5:31:23 AM
> Subject: [Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated 
> pacemaker's problem or mine?
> 
> Hi,
> 
> I'm trying to get to the bottom of a problem I'm seeing with a
> cluster.
> At this stage I'm unclear as to whether the issue is with the config
> or
> not - the generated error messages seem unclear.  So I'm not sure
> whether I should be staring at the config or the source code at this
> point, and would appreciate a clue!
> 
> I'm running with some of the (live) resources in an unmanaged state
> whilst testing fail-over with other (non-dependant) resources.
>

I'm betting this transition error is a result of how the un-managed resources 
are used in the colocation and order constraints with the managed resources.  
Can you produce a hb_report/crm_report for this.   It isn't obvious (at least 
to me) what is causing this by looking at the logs.

-- Vossel

> The managed resources are a number of OpenVZ virtual machines (each
> comprising 3 primitives - file-system + OpenVZ VE + SendArp).  The
> filesystems are on LVM volume groups, and the single LVM PV for each
> volume group resides on a DRBD volume.  There are n virtual machines
> per
> DRBD volume.
> 
> I'm running pacemaker 1.0.9.1+hg15626-1 on Debian 6.0.  Here are some
> of
> the messages (configuration follows at the end of the email):
> 
> 
> Apr 27 11:06:35 fig crmd: [395]: info: crm_timer_popped: PEngine
> Recheck
> Timer (I_PE_CALC) just popped!
> Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition:
> Progressed
> to state S_POLICY_ENGINE after C_TIMER_POPPED
> Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: All 2
> cluster nodes are eligible to run resources.
> Apr 27 11:06:35 fig crmd: [395]: info: do_pe_invoke: Query 985:
> Requesting the current CIB: S_POLICY_ENGINE
> Apr 27 11:06:35 fig crmd: [395]: info: do_pe_invoke_callback:
> Invoking
> the PE: query=985, ref=pe_calc-dc-1335521195-1437, seq=184, quorate=1
> Apr 27 11:06:35 fig pengine: [394]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Apr 27 11:06:35 fig pengine: [394]: info: unpack_config: Node scores:
> 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> Apr 27 11:06:35 fig pengine: [394]: info: determine_online_status:
> Node
> fig is online
> Apr 27 11:06:35 fig pengine: [394]: info: determine_online_status:
> Node
> hazel is online
> Apr 27 11:06:35 fig pengine: [394]: info: native_add_running:
> resource
> essex03-LVM isnt managed
> Apr 27 11:06:35 fig pengine: [394]: info: native_add_running:
> resource
> calypso-FS isnt managed
> Apr 27 11:06:35 fig pengine: [394]: info: native_add_running:
> resource
> calypso-VE isnt managed
> Apr 27 11:06:35 fig pengine: [394]: notice: unpack_rsc_op: Operation
> essex03-DRBD:0_monitor_0 found resource essex03-DRBD:0 active in
> master
> mode on hazel
> Apr 27 11:06:35 fig pengine: [394]: info: native_add_running:
> resource
> essex03-DRBD:0 isnt managed
> Apr 27 11:06:35 fig pengine: [394]: notice: unpack_rsc_op: Operation
> essex02-DRBD:1_monitor_0 found resource essex02-DRBD:1 active in
> master
> mode on hazel
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
> artemis-FS#011(ocf::heartbeat:Filesystem):#011Stopped
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
> artemis-SendArp#011(ocf::heartbeat:SendArp):#011Stopped
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
> artemis-VE#011(ocf::heartbeat:ManageVE):#011Stopped  (unmanaged)
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
> athena-FS#011(ocf::heartbeat:Filesystem):#011Stopped
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
> athena-SendArp#011(ocf::heartbeat:SendArp):#011Stopped
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
> athena-VE#011(ocf::heartbeat:ManageVE):#011Stopped  (unmanaged)
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
> calypso-FS#011(ocf::heartbeat:Filesystem):#011Started hazel
> (unmanaged)
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
> calypso-SendArp#011(ocf::heartbeat:SendArp):#011Stopped
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
> calypso-VE#011(ocf::heartbeat:ManageVE):#011Started hazel (unmanaged)
> Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
> epione-FS#011(ocf::heartbeat:Filesystem):#011Stopped
> A

[Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated pacemaker's problem or mine?

2012-04-27 Thread Tim Small
Hi,

I'm trying to get to the bottom of a problem I'm seeing with a cluster. 
At this stage I'm unclear as to whether the issue is with the config or
not - the generated error messages seem unclear.  So I'm not sure
whether I should be staring at the config or the source code at this
point, and would appreciate a clue!

I'm running with some of the (live) resources in an unmanaged state
whilst testing fail-over with other (non-dependant) resources.

The managed resources are a number of OpenVZ virtual machines (each
comprising 3 primitives - file-system + OpenVZ VE + SendArp).  The
filesystems are on LVM volume groups, and the single LVM PV for each
volume group resides on a DRBD volume.  There are n virtual machines per
DRBD volume.

I'm running pacemaker 1.0.9.1+hg15626-1 on Debian 6.0.  Here are some of
the messages (configuration follows at the end of the email):


Apr 27 11:06:35 fig crmd: [395]: info: crm_timer_popped: PEngine Recheck
Timer (I_PE_CALC) just popped!
Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
cause=C_TIMER_POPPED origin=crm_timer_popped ]
Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: Progressed
to state S_POLICY_ENGINE after C_TIMER_POPPED
Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: All 2
cluster nodes are eligible to run resources.
Apr 27 11:06:35 fig crmd: [395]: info: do_pe_invoke: Query 985:
Requesting the current CIB: S_POLICY_ENGINE
Apr 27 11:06:35 fig crmd: [395]: info: do_pe_invoke_callback: Invoking
the PE: query=985, ref=pe_calc-dc-1335521195-1437, seq=184, quorate=1
Apr 27 11:06:35 fig pengine: [394]: notice: unpack_config: On loss of
CCM Quorum: Ignore
Apr 27 11:06:35 fig pengine: [394]: info: unpack_config: Node scores:
'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Apr 27 11:06:35 fig pengine: [394]: info: determine_online_status: Node
fig is online
Apr 27 11:06:35 fig pengine: [394]: info: determine_online_status: Node
hazel is online
Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource
essex03-LVM isnt managed
Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource
calypso-FS isnt managed
Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource
calypso-VE isnt managed
Apr 27 11:06:35 fig pengine: [394]: notice: unpack_rsc_op: Operation
essex03-DRBD:0_monitor_0 found resource essex03-DRBD:0 active in master
mode on hazel
Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource
essex03-DRBD:0 isnt managed
Apr 27 11:06:35 fig pengine: [394]: notice: unpack_rsc_op: Operation
essex02-DRBD:1_monitor_0 found resource essex02-DRBD:1 active in master
mode on hazel
Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
artemis-FS#011(ocf::heartbeat:Filesystem):#011Stopped
Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
artemis-SendArp#011(ocf::heartbeat:SendArp):#011Stopped
Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
artemis-VE#011(ocf::heartbeat:ManageVE):#011Stopped  (unmanaged)
Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
athena-FS#011(ocf::heartbeat:Filesystem):#011Stopped
Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
athena-SendArp#011(ocf::heartbeat:SendArp):#011Stopped
Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
athena-VE#011(ocf::heartbeat:ManageVE):#011Stopped  (unmanaged)
Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
calypso-FS#011(ocf::heartbeat:Filesystem):#011Started hazel (unmanaged)
Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
calypso-SendArp#011(ocf::heartbeat:SendArp):#011Stopped
Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
calypso-VE#011(ocf::heartbeat:ManageVE):#011Started hazel (unmanaged)
Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
epione-FS#011(ocf::heartbeat:Filesystem):#011Stopped
Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
epione-SendArp#011(ocf::heartbeat:SendArp):#011Stopped
Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
epione-VE#011(ocf::heartbeat:ManageVE):#011Stopped  (unmanaged)
Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
essex02-LVM#011(ocf::heartbeat:LVM):#011Started hazel
Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
essex03-LVM#011(ocf::heartbeat:LVM):#011Started hazel (unmanaged)
Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
essextest-FS#011(ocf::heartbeat:Filesystem):#011Stopped
Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
essextest-SendArp#011(ocf::heartbeat:SendArp):#011Stopped
Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
essextest-VE#011(ocf::heartbeat:ManageVE):#011Stopped
Apr 27 11:06:35 fig pengine: [394]: notice: clone_print:  Master/Slave
Set: ms-drbd-essex02
Apr 27 11:06:35 fig pengine: [394]: notice: short_print:  Masters: [
hazel ]
Apr 27 11:06:35 fig pengine: [394]: notice: short_print:  Slaves: [
fig ]
Apr 27 11:06:35 fig pengine: [394]: notice: clone_pri