Re: [Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated pacemaker's problem or mine?
On Mon, Apr 30, 2012 at 10:44 PM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Mon, Apr 30, 2012 at 01:00:11PM +1000, Andrew Beekhof wrote: On Sat, Apr 28, 2012 at 5:40 AM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Fri, Apr 27, 2012 at 11:31:23AM +0100, Tim Small wrote: Hi, I'm trying to get to the bottom of a problem I'm seeing with a cluster. At this stage I'm unclear as to whether the issue is with the config or not - the generated error messages seem unclear. So I'm not sure whether I should be staring at the config or the source code at this point, and would appreciate a clue! I'm running with some of the (live) resources in an unmanaged state whilst testing fail-over with other (non-dependant) resources. The managed resources are a number of OpenVZ virtual machines (each comprising 3 primitives - file-system + OpenVZ VE + SendArp). The filesystems are on LVM volume groups, and the single LVM PV for each volume group resides on a DRBD volume. There are n virtual machines per DRBD volume. I'm running pacemaker 1.0.9.1+hg15626-1 on Debian 6.0. Here are some of the messages (configuration follows at the end of the email): Upgrading to 1.0.12, or 1.1.7, may get you a little further. It would not solve the I need to stop that resource first, but I can not as it is unmanaged dependency problem you apparently have here. There's really not a lot the cluster can do in this situation, there's a 50% chance of getting it wrong no matter what we do. In the most recent versions we now log as loudly as possible (LOG_CRIT) that we cant shutdown because something depends on an unmanaged resource. That's in fact what I meant ;-) Not only the cryptic ERROR: te_graph_trigger: Transition failed: terminated but Hey you fool, I cannot do that because you told me not to manage that resource, but the other ones depend on it. Though, you still have to spot that line in the flood... We're working on that part too :-) ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated pacemaker's problem or mine?
On Mon, Apr 30, 2012 at 01:00:11PM +1000, Andrew Beekhof wrote: On Sat, Apr 28, 2012 at 5:40 AM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Fri, Apr 27, 2012 at 11:31:23AM +0100, Tim Small wrote: Hi, I'm trying to get to the bottom of a problem I'm seeing with a cluster. At this stage I'm unclear as to whether the issue is with the config or not - the generated error messages seem unclear. So I'm not sure whether I should be staring at the config or the source code at this point, and would appreciate a clue! I'm running with some of the (live) resources in an unmanaged state whilst testing fail-over with other (non-dependant) resources. The managed resources are a number of OpenVZ virtual machines (each comprising 3 primitives - file-system + OpenVZ VE + SendArp). The filesystems are on LVM volume groups, and the single LVM PV for each volume group resides on a DRBD volume. There are n virtual machines per DRBD volume. I'm running pacemaker 1.0.9.1+hg15626-1 on Debian 6.0. Here are some of the messages (configuration follows at the end of the email): Upgrading to 1.0.12, or 1.1.7, may get you a little further. It would not solve the I need to stop that resource first, but I can not as it is unmanaged dependency problem you apparently have here. There's really not a lot the cluster can do in this situation, there's a 50% chance of getting it wrong no matter what we do. In the most recent versions we now log as loudly as possible (LOG_CRIT) that we cant shutdown because something depends on an unmanaged resource. That's in fact what I meant ;-) Not only the cryptic ERROR: te_graph_trigger: Transition failed: terminated but Hey you fool, I cannot do that because you told me not to manage that resource, but the other ones depend on it. Though, you still have to spot that line in the flood... -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated pacemaker's problem or mine?
On Sat, Apr 28, 2012 at 5:40 AM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Fri, Apr 27, 2012 at 11:31:23AM +0100, Tim Small wrote: Hi, I'm trying to get to the bottom of a problem I'm seeing with a cluster. At this stage I'm unclear as to whether the issue is with the config or not - the generated error messages seem unclear. So I'm not sure whether I should be staring at the config or the source code at this point, and would appreciate a clue! I'm running with some of the (live) resources in an unmanaged state whilst testing fail-over with other (non-dependant) resources. The managed resources are a number of OpenVZ virtual machines (each comprising 3 primitives - file-system + OpenVZ VE + SendArp). The filesystems are on LVM volume groups, and the single LVM PV for each volume group resides on a DRBD volume. There are n virtual machines per DRBD volume. I'm running pacemaker 1.0.9.1+hg15626-1 on Debian 6.0. Here are some of the messages (configuration follows at the end of the email): Upgrading to 1.0.12, or 1.1.7, may get you a little further. It would not solve the I need to stop that resource first, but I can not as it is unmanaged dependency problem you apparently have here. There's really not a lot the cluster can do in this situation, there's a 50% chance of getting it wrong no matter what we do. In the most recent versions we now log as loudly as possible (LOG_CRIT) that we cant shutdown because something depends on an unmanaged resource. I think you simply have some copy'n'paste errors in your constraints, calypso should be ordered with essex03, not 02. May not be the only problem, though. BTW, LCMC, respectively the cluster resource and constraint graph view it presents to you, can help to just see this kind of error. Some more comments inline. Apr 27 11:06:35 fig crmd: [395]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped! Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: All 2 cluster nodes are eligible to run resources. Apr 27 11:06:35 fig crmd: [395]: info: do_pe_invoke: Query 985: Requesting the current CIB: S_POLICY_ENGINE Apr 27 11:06:35 fig crmd: [395]: info: do_pe_invoke_callback: Invoking the PE: query=985, ref=pe_calc-dc-1335521195-1437, seq=184, quorate=1 Apr 27 11:06:35 fig pengine: [394]: notice: unpack_config: On loss of CCM Quorum: Ignore Apr 27 11:06:35 fig pengine: [394]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 Apr 27 11:06:35 fig pengine: [394]: info: determine_online_status: Node fig is online Apr 27 11:06:35 fig pengine: [394]: info: determine_online_status: Node hazel is online Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource essex03-LVM isnt managed Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource calypso-FS isnt managed Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource calypso-VE isnt managed Apr 27 11:06:35 fig pengine: [394]: notice: unpack_rsc_op: Operation essex03-DRBD:0_monitor_0 found resource essex03-DRBD:0 active in master mode on hazel Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource essex03-DRBD:0 isnt managed Apr 27 11:06:35 fig pengine: [394]: notice: unpack_rsc_op: Operation essex02-DRBD:1_monitor_0 found resource essex02-DRBD:1 active in master mode on hazel Apr 27 11:06:35 fig pengine: [394]: notice: native_print: artemis-FS#011(ocf::heartbeat:Filesystem):#011Stopped Apr 27 11:06:35 fig pengine: [394]: notice: native_print: artemis-SendArp#011(ocf::heartbeat:SendArp):#011Stopped Apr 27 11:06:35 fig pengine: [394]: notice: native_print: artemis-VE#011(ocf::heartbeat:ManageVE):#011Stopped (unmanaged) Apr 27 11:06:35 fig pengine: [394]: notice: native_print: athena-FS#011(ocf::heartbeat:Filesystem):#011Stopped Apr 27 11:06:35 fig pengine: [394]: notice: native_print: athena-SendArp#011(ocf::heartbeat:SendArp):#011Stopped Apr 27 11:06:35 fig pengine: [394]: notice: native_print: athena-VE#011(ocf::heartbeat:ManageVE):#011Stopped (unmanaged) Apr 27 11:06:35 fig pengine: [394]: notice: native_print: calypso-FS#011(ocf::heartbeat:Filesystem):#011Started hazel (unmanaged) Apr 27 11:06:35 fig pengine: [394]: notice: native_print: calypso-SendArp#011(ocf::heartbeat:SendArp):#011Stopped Apr 27 11:06:35 fig pengine: [394]: notice: native_print: calypso-VE#011(ocf::heartbeat:ManageVE):#011Started hazel (unmanaged) Apr 27 11:06:35 fig pengine: [394]: notice: native_print: epione-FS#011(ocf::heartbeat:Filesystem):#011Stopped Apr 27 11:06:35 fig pengine: [394]: notice: native_print:
Re: [Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated pacemaker's problem or mine?
On 27/04/12 15:00, David Vossel wrote: I'm betting this transition error is a result of how the un-managed resources are used in the colocation and order constraints with the managed resources. Can you produce a hb_report/crm_report for this. It isn't obvious (at least to me) what is causing this by looking at the logs. Hi David, Thanks for the reply... I've put up a report at http://buttersideup.com/files/fighazelreport.tbz2 The various *-FS resources are managed, and they can be in the state whereby they are running on one node, but then if you migration to the other node, all the prerequisites happen, but the -FS resources themselves don't actually get started Cheers, Tim. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated pacemaker's problem or mine?
On Fri, Apr 27, 2012 at 11:31:23AM +0100, Tim Small wrote: Hi, I'm trying to get to the bottom of a problem I'm seeing with a cluster. At this stage I'm unclear as to whether the issue is with the config or not - the generated error messages seem unclear. So I'm not sure whether I should be staring at the config or the source code at this point, and would appreciate a clue! I'm running with some of the (live) resources in an unmanaged state whilst testing fail-over with other (non-dependant) resources. The managed resources are a number of OpenVZ virtual machines (each comprising 3 primitives - file-system + OpenVZ VE + SendArp). The filesystems are on LVM volume groups, and the single LVM PV for each volume group resides on a DRBD volume. There are n virtual machines per DRBD volume. I'm running pacemaker 1.0.9.1+hg15626-1 on Debian 6.0. Here are some of the messages (configuration follows at the end of the email): Upgrading to 1.0.12, or 1.1.7, may get you a little further. It would not solve the I need to stop that resource first, but I can not as it is unmanaged dependency problem you apparently have here. I think you simply have some copy'n'paste errors in your constraints, calypso should be ordered with essex03, not 02. May not be the only problem, though. BTW, LCMC, respectively the cluster resource and constraint graph view it presents to you, can help to just see this kind of error. Some more comments inline. Apr 27 11:06:35 fig crmd: [395]: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped! Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ] Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED Apr 27 11:06:35 fig crmd: [395]: info: do_state_transition: All 2 cluster nodes are eligible to run resources. Apr 27 11:06:35 fig crmd: [395]: info: do_pe_invoke: Query 985: Requesting the current CIB: S_POLICY_ENGINE Apr 27 11:06:35 fig crmd: [395]: info: do_pe_invoke_callback: Invoking the PE: query=985, ref=pe_calc-dc-1335521195-1437, seq=184, quorate=1 Apr 27 11:06:35 fig pengine: [394]: notice: unpack_config: On loss of CCM Quorum: Ignore Apr 27 11:06:35 fig pengine: [394]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 Apr 27 11:06:35 fig pengine: [394]: info: determine_online_status: Node fig is online Apr 27 11:06:35 fig pengine: [394]: info: determine_online_status: Node hazel is online Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource essex03-LVM isnt managed Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource calypso-FS isnt managed Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource calypso-VE isnt managed Apr 27 11:06:35 fig pengine: [394]: notice: unpack_rsc_op: Operation essex03-DRBD:0_monitor_0 found resource essex03-DRBD:0 active in master mode on hazel Apr 27 11:06:35 fig pengine: [394]: info: native_add_running: resource essex03-DRBD:0 isnt managed Apr 27 11:06:35 fig pengine: [394]: notice: unpack_rsc_op: Operation essex02-DRBD:1_monitor_0 found resource essex02-DRBD:1 active in master mode on hazel Apr 27 11:06:35 fig pengine: [394]: notice: native_print: artemis-FS#011(ocf::heartbeat:Filesystem):#011Stopped Apr 27 11:06:35 fig pengine: [394]: notice: native_print: artemis-SendArp#011(ocf::heartbeat:SendArp):#011Stopped Apr 27 11:06:35 fig pengine: [394]: notice: native_print: artemis-VE#011(ocf::heartbeat:ManageVE):#011Stopped (unmanaged) Apr 27 11:06:35 fig pengine: [394]: notice: native_print: athena-FS#011(ocf::heartbeat:Filesystem):#011Stopped Apr 27 11:06:35 fig pengine: [394]: notice: native_print: athena-SendArp#011(ocf::heartbeat:SendArp):#011Stopped Apr 27 11:06:35 fig pengine: [394]: notice: native_print: athena-VE#011(ocf::heartbeat:ManageVE):#011Stopped (unmanaged) Apr 27 11:06:35 fig pengine: [394]: notice: native_print: calypso-FS#011(ocf::heartbeat:Filesystem):#011Started hazel (unmanaged) Apr 27 11:06:35 fig pengine: [394]: notice: native_print: calypso-SendArp#011(ocf::heartbeat:SendArp):#011Stopped Apr 27 11:06:35 fig pengine: [394]: notice: native_print: calypso-VE#011(ocf::heartbeat:ManageVE):#011Started hazel (unmanaged) Apr 27 11:06:35 fig pengine: [394]: notice: native_print: epione-FS#011(ocf::heartbeat:Filesystem):#011Stopped Apr 27 11:06:35 fig pengine: [394]: notice: native_print: epione-SendArp#011(ocf::heartbeat:SendArp):#011Stopped Apr 27 11:06:35 fig pengine: [394]: notice: native_print: epione-VE#011(ocf::heartbeat:ManageVE):#011Stopped (unmanaged) Apr 27 11:06:35 fig pengine: [394]: notice: native_print: essex02-LVM#011(ocf::heartbeat:LVM):#011Started hazel Apr 27 11:06:35 fig pengine: [394]: notice:
Re: [Pacemaker] ERROR: te_graph_trigger: Transition failed: terminated pacemaker's problem or mine?
On 27/04/12 20:40, Lars Ellenberg wrote: Ok, colo calypso with essex03... but then, why ... order essex02-lvm-before-calypso-FS inf: essex02-LVM calypso-FS Order essex02 with calypso? typo? is this supposed to be essex03? Yes, that seems to have been it - my typo... Grrr. That's what happens when you work until 2am I suppose. Thanks very much for spotting it. Tim. -- South East Open Source Solutions Limited Registered in England and Wales with company number 06134732. Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ VAT number: 900 6633 53 http://seoss.co.uk/ +44-(0)1273-808309 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org