Re: [Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated pacemaker's problem or mine?

2012-05-01 Thread Andrew Beekhof
On Mon, Apr 30, 2012 at 10:44 PM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Mon, Apr 30, 2012 at 01:00:11PM +1000, Andrew Beekhof wrote:
 On Sat, Apr 28, 2012 at 5:40 AM, Lars Ellenberg
 lars.ellenb...@linbit.com wrote:
  On Fri, Apr 27, 2012 at 11:31:23AM +0100, Tim Small wrote:
  Hi,
 
  I'm trying to get to the bottom of a problem I'm seeing with a cluster.
  At this stage I'm unclear as to whether the issue is with the config or
  not - the generated error messages seem unclear.  So I'm not sure
  whether I should be staring at the config or the source code at this
  point, and would appreciate a clue!
 
  I'm running with some of the (live) resources in an unmanaged state
  whilst testing fail-over with other (non-dependant) resources.
 
  The managed resources are a number of OpenVZ virtual machines (each
  comprising 3 primitives - file-system + OpenVZ VE + SendArp).  The
  filesystems are on LVM volume groups, and the single LVM PV for each
  volume group resides on a DRBD volume.  There are n virtual machines per
  DRBD volume.
 
  I'm running pacemaker 1.0.9.1+hg15626-1 on Debian 6.0.  Here are some of
  the messages (configuration follows at the end of the email):
 
  Upgrading to 1.0.12, or 1.1.7, may get you a little further.
  It would not solve the I need to stop that resource first, but I can
  not as it is unmanaged dependency problem you apparently have here.

 There's really not a lot the cluster can do in this situation, there's
 a 50% chance of getting it wrong no matter what we do.
 In the most recent versions we now log as loudly as possible
 (LOG_CRIT) that we cant shutdown because something depends on an
 unmanaged resource.

 That's in fact what I meant ;-)

 Not only the cryptic ERROR: te_graph_trigger: Transition failed: terminated
 but Hey you fool, I cannot do that because you told me not to manage
 that resource, but the other ones depend on it.

 Though, you still have to spot that line in the flood...

We're working on that part too :-)

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated pacemaker's problem or mine?

2012-04-30 Thread Lars Ellenberg
On Mon, Apr 30, 2012 at 01:00:11PM +1000, Andrew Beekhof wrote:
 On Sat, Apr 28, 2012 at 5:40 AM, Lars Ellenberg
 lars.ellenb...@linbit.com wrote:
  On Fri, Apr 27, 2012 at 11:31:23AM +0100, Tim Small wrote:
  Hi,
 
  I'm trying to get to the bottom of a problem I'm seeing with a cluster.
  At this stage I'm unclear as to whether the issue is with the config or
  not - the generated error messages seem unclear.  So I'm not sure
  whether I should be staring at the config or the source code at this
  point, and would appreciate a clue!
 
  I'm running with some of the (live) resources in an unmanaged state
  whilst testing fail-over with other (non-dependant) resources.
 
  The managed resources are a number of OpenVZ virtual machines (each
  comprising 3 primitives - file-system + OpenVZ VE + SendArp).  The
  filesystems are on LVM volume groups, and the single LVM PV for each
  volume group resides on a DRBD volume.  There are n virtual machines per
  DRBD volume.
 
  I'm running pacemaker 1.0.9.1+hg15626-1 on Debian 6.0.  Here are some of
  the messages (configuration follows at the end of the email):
 
  Upgrading to 1.0.12, or 1.1.7, may get you a little further.
  It would not solve the I need to stop that resource first, but I can
  not as it is unmanaged dependency problem you apparently have here.
 
 There's really not a lot the cluster can do in this situation, there's
 a 50% chance of getting it wrong no matter what we do.
 In the most recent versions we now log as loudly as possible
 (LOG_CRIT) that we cant shutdown because something depends on an
 unmanaged resource.

That's in fact what I meant ;-)

Not only the cryptic ERROR: te_graph_trigger: Transition failed: terminated
but Hey you fool, I cannot do that because you told me not to manage
that resource, but the other ones depend on it.

Though, you still have to spot that line in the flood...

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated pacemaker's problem or mine?

2012-04-29 Thread Andrew Beekhof
On Sat, Apr 28, 2012 at 5:40 AM, Lars Ellenberg
lars.ellenb...@linbit.com wrote:
 On Fri, Apr 27, 2012 at 11:31:23AM +0100, Tim Small wrote:
 Hi,

 I'm trying to get to the bottom of a problem I'm seeing with a cluster.
 At this stage I'm unclear as to whether the issue is with the config or
 not - the generated error messages seem unclear.  So I'm not sure
 whether I should be staring at the config or the source code at this
 point, and would appreciate a clue!

 I'm running with some of the (live) resources in an unmanaged state
 whilst testing fail-over with other (non-dependant) resources.

 The managed resources are a number of OpenVZ virtual machines (each
 comprising 3 primitives - file-system + OpenVZ VE + SendArp).  The
 filesystems are on LVM volume groups, and the single LVM PV for each
 volume group resides on a DRBD volume.  There are n virtual machines per
 DRBD volume.

 I'm running pacemaker 1.0.9.1+hg15626-1 on Debian 6.0.  Here are some of
 the messages (configuration follows at the end of the email):

 Upgrading to 1.0.12, or 1.1.7, may get you a little further.
 It would not solve the I need to stop that resource first, but I can
 not as it is unmanaged dependency problem you apparently have here.

There's really not a lot the cluster can do in this situation, there's
a 50% chance of getting it wrong no matter what we do.
In the most recent versions we now log as loudly as possible
(LOG_CRIT) that we cant shutdown because something depends on an
unmanaged resource.

 I think you simply have some copy'n'paste errors in your constraints,
 calypso should be ordered with essex03, not 02.

 May not be the only problem, though.

 BTW, LCMC, respectively the cluster resource and constraint graph view
 it presents to you, can help to just see this kind of error.

 Some more comments inline.

 Apr 27 11:06:35 fig crmd: [395]: info: crm_timer_popped: PEngine Recheck 
 Timer (I_PE_CALC) just popped!
 Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: State transition 
 S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED 
 origin=crm_timer_popped ]
 Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: Progressed to 
 state S_POLICY_ENGINE after C_TIMER_POPPED
 Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: All 2 cluster 
 nodes are eligible to run resources.
 Apr 27 11:06:35 fig crmd: [395]: info: do_pe_invoke: Query 985: Requesting 
 the current CIB: S_POLICY_ENGINE
 Apr 27 11:06:35 fig crmd: [395]: info: do_pe_invoke_callback: Invoking the 
 PE: query=985, ref=pe_calc-dc-1335521195-1437, seq=184, quorate=1
 Apr 27 11:06:35 fig pengine: [394]: notice: unpack_config: On loss of CCM 
 Quorum: Ignore
 Apr 27 11:06:35 fig pengine: [394]: info: unpack_config: Node scores: 'red' 
 = -INFINITY, 'yellow' = 0, 'green' = 0
 Apr 27 11:06:35 fig pengine: [394]: info: determine_online_status: Node fig 
 is online
 Apr 27 11:06:35 fig pengine: [394]: info: determine_online_status: Node 
 hazel is online
 Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource 
 essex03-LVM isnt managed
 Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource 
 calypso-FS isnt managed
 Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource 
 calypso-VE isnt managed
 Apr 27 11:06:35 fig pengine: [394]: notice: unpack_rsc_op: Operation 
 essex03-DRBD:0_monitor_0 found resource essex03-DRBD:0 active in master mode 
 on hazel
 Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource 
 essex03-DRBD:0 isnt managed
 Apr 27 11:06:35 fig pengine: [394]: notice: unpack_rsc_op: Operation 
 essex02-DRBD:1_monitor_0 found resource essex02-DRBD:1 active in master mode 
 on hazel
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 artemis-FS#011(ocf::heartbeat:Filesystem):#011Stopped
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 artemis-SendArp#011(ocf::heartbeat:SendArp):#011Stopped
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 artemis-VE#011(ocf::heartbeat:ManageVE):#011Stopped  (unmanaged)
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 athena-FS#011(ocf::heartbeat:Filesystem):#011Stopped
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 athena-SendArp#011(ocf::heartbeat:SendArp):#011Stopped
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 athena-VE#011(ocf::heartbeat:ManageVE):#011Stopped  (unmanaged)
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 calypso-FS#011(ocf::heartbeat:Filesystem):#011Started hazel (unmanaged)
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 calypso-SendArp#011(ocf::heartbeat:SendArp):#011Stopped
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 calypso-VE#011(ocf::heartbeat:ManageVE):#011Started hazel (unmanaged)
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 epione-FS#011(ocf::heartbeat:Filesystem):#011Stopped
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 

Re: [Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated pacemaker's problem or mine?

2012-04-27 Thread Tim Small
On 27/04/12 15:00, David Vossel wrote:

 I'm betting this transition error is a result of how the un-managed resources 
 are used in the colocation and order constraints with the managed resources.  
 Can you produce a hb_report/crm_report for this.   It isn't obvious (at least 
 to me) what is causing this by looking at the logs.

Hi David,

Thanks for the reply...

I've put up a report at http://buttersideup.com/files/fighazelreport.tbz2

The various *-FS resources are managed, and they can be in the state
whereby they are running on one node, but then if you migration to the
other node, all the prerequisites happen, but the -FS resources
themselves don't actually get started

Cheers,

Tim.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated pacemaker's problem or mine?

2012-04-27 Thread Lars Ellenberg
On Fri, Apr 27, 2012 at 11:31:23AM +0100, Tim Small wrote:
 Hi,
 
 I'm trying to get to the bottom of a problem I'm seeing with a cluster. 
 At this stage I'm unclear as to whether the issue is with the config or
 not - the generated error messages seem unclear.  So I'm not sure
 whether I should be staring at the config or the source code at this
 point, and would appreciate a clue!
 
 I'm running with some of the (live) resources in an unmanaged state
 whilst testing fail-over with other (non-dependant) resources.
 
 The managed resources are a number of OpenVZ virtual machines (each
 comprising 3 primitives - file-system + OpenVZ VE + SendArp).  The
 filesystems are on LVM volume groups, and the single LVM PV for each
 volume group resides on a DRBD volume.  There are n virtual machines per
 DRBD volume.
 
 I'm running pacemaker 1.0.9.1+hg15626-1 on Debian 6.0.  Here are some of
 the messages (configuration follows at the end of the email):

Upgrading to 1.0.12, or 1.1.7, may get you a little further.
It would not solve the I need to stop that resource first, but I can
not as it is unmanaged dependency problem you apparently have here.

I think you simply have some copy'n'paste errors in your constraints,
calypso should be ordered with essex03, not 02.

May not be the only problem, though.

BTW, LCMC, respectively the cluster resource and constraint graph view
it presents to you, can help to just see this kind of error.

Some more comments inline.

 Apr 27 11:06:35 fig crmd: [395]: info: crm_timer_popped: PEngine Recheck 
 Timer (I_PE_CALC) just popped! 
 Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: State transition 
 S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED 
 origin=crm_timer_popped ] 
 Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: Progressed to 
 state S_POLICY_ENGINE after C_TIMER_POPPED 
 Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: All 2 cluster 
 nodes are eligible to run resources. 
 Apr 27 11:06:35 fig crmd: [395]: info: do_pe_invoke: Query 985: Requesting 
 the current CIB: S_POLICY_ENGINE 
 Apr 27 11:06:35 fig crmd: [395]: info: do_pe_invoke_callback: Invoking the 
 PE: query=985, ref=pe_calc-dc-1335521195-1437, seq=184, quorate=1 
 Apr 27 11:06:35 fig pengine: [394]: notice: unpack_config: On loss of CCM 
 Quorum: Ignore 
 Apr 27 11:06:35 fig pengine: [394]: info: unpack_config: Node scores: 'red' = 
 -INFINITY, 'yellow' = 0, 'green' = 0 
 Apr 27 11:06:35 fig pengine: [394]: info: determine_online_status: Node fig 
 is online 
 Apr 27 11:06:35 fig pengine: [394]: info: determine_online_status: Node hazel 
 is online 
 Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource 
 essex03-LVM isnt managed 
 Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource 
 calypso-FS isnt managed 
 Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource 
 calypso-VE isnt managed 
 Apr 27 11:06:35 fig pengine: [394]: notice: unpack_rsc_op: Operation 
 essex03-DRBD:0_monitor_0 found resource essex03-DRBD:0 active in master mode 
 on hazel 
 Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource 
 essex03-DRBD:0 isnt managed 
 Apr 27 11:06:35 fig pengine: [394]: notice: unpack_rsc_op: Operation 
 essex02-DRBD:1_monitor_0 found resource essex02-DRBD:1 active in master mode 
 on hazel 
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 artemis-FS#011(ocf::heartbeat:Filesystem):#011Stopped 
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 artemis-SendArp#011(ocf::heartbeat:SendArp):#011Stopped 
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 artemis-VE#011(ocf::heartbeat:ManageVE):#011Stopped  (unmanaged) 
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 athena-FS#011(ocf::heartbeat:Filesystem):#011Stopped 
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 athena-SendArp#011(ocf::heartbeat:SendArp):#011Stopped 
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 athena-VE#011(ocf::heartbeat:ManageVE):#011Stopped  (unmanaged) 
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 calypso-FS#011(ocf::heartbeat:Filesystem):#011Started hazel (unmanaged) 
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 calypso-SendArp#011(ocf::heartbeat:SendArp):#011Stopped 
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 calypso-VE#011(ocf::heartbeat:ManageVE):#011Started hazel (unmanaged) 
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 epione-FS#011(ocf::heartbeat:Filesystem):#011Stopped 
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 epione-SendArp#011(ocf::heartbeat:SendArp):#011Stopped 
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 epione-VE#011(ocf::heartbeat:ManageVE):#011Stopped  (unmanaged) 
 Apr 27 11:06:35 fig pengine: [394]: notice: native_print: 
 essex02-LVM#011(ocf::heartbeat:LVM):#011Started hazel 
 Apr 27 11:06:35 fig pengine: [394]: notice: 

Re: [Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated pacemaker's problem or mine?

2012-04-27 Thread Tim Small
On 27/04/12 20:40, Lars Ellenberg wrote:

 Ok, colo calypso with essex03... but then, why ...

   
 order essex02-lvm-before-calypso-FS inf: essex02-LVM calypso-FS
 
 Order essex02 with calypso? typo? is this supposed to be essex03?

   

Yes, that seems to have been it - my typo...  Grrr.  That's what happens
when you work until 2am I suppose.

Thanks very much for spotting it.

Tim.

-- 
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.  
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53  http://seoss.co.uk/ +44-(0)1273-808309


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org