Re: [Pacemaker] Master/Slave not failing over

Andrew Beekhof Tue, 29 Jun 2010 00:09:45 -0700

On Tue, Jun 29, 2010 at 1:05 AM, Eliot Gable <ega...@broadvox.com> wrote:
> As an update on this: I modified my monitoring action and it now determines 
> master/slave status at the moment the action triggers. This fixed the light 
> monitoring for all cases I have been able to test thus far. However, it still 
> causes a problem on the medium monitoring. The interval on the medium 
> monitoring when it is a slave is 11,000 ms. When it is a master, it is 7,000 
> ms. I am getting cases where the medium monitoring starts with the 11,000 ms 
> interval at nearly the same time that a promote occurs. The monitoring task 
> is suspended or delayed in some fashion just long enough for the promote to 
> execute some code that then causes the monitoring task to flag the state as 
> MASTER. However, since the monitoring task was started as a slave, the system 
> expects it to return slave status. However, it returns master status instead, 
> so it marks the node as failed and causes the system to fail over again. 
> Determining the state of the system (master or slave) is the very first thing 
> I do in the monitoring section, yet it can still be delayed until after the 
> promote.


Thats strange.
Can you show us your current config?

>
> Jun 28 09:50:28 node-2 lrmd: [2934]: info: perform_op:2873: operation 
> monitor[407] on ocf::frs::frs:0 for client 2937, its parame
> ters: profile_start_delay=[45] CRM_meta_notify_active_resource=[ ] 
> CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_inactive_resource=[frs:0 ] 
> ips=[ ] CRM_meta_notify_start_resource=[frs:0 ] OCF_CHECK_LEVEL=[10] 
> CRM_meta_notify_active_uname=[ ] fsroot=[] managetype=[screen] 
> CRM_meta_globally_unique=[false] stop_delay=[1 for rsc is already running.
> Jun 28 09:50:28 node-2 lrmd: [2934]: info: perform_op:2883: postponing all 
> ops on resource frs:0 by 1000 ms
> Jun 28 09:50:29 node-2 frs[907]: INFO: Monitor medium returned: OCF_SUCCESS 
> for SLAVE status
> Jun 28 09:50:29 node-2 frs[1262]: INFO: Light monitoring of the frs system 
> completed: RUNNING_SLAVE
> Jun 28 09:50:32 node-2 crm_resource: [1554]: info: Invoked: crm_resource 
> --meta -r frs-MS -p target-role -v Master
> Jun 28 09:50:33 node-2 lrmd: [2934]: info: rsc:frs:0:533: promote
> <snip>
> ...
> </snip>
> Jun 28 09:50:42 node-2 frs[2567]: INFO: Monitor medium returned: 
> OCF_RUNNING_MASTER for MASTER status
> Jun 28 09:50:42 node-2 crmd: [2937]: info: process_lrm_event: LRM operation 
> frs:0_monitor_11000 (call=407, rc=8, cib-update=694, confirmed=false) master
> Jun 28 09:50:42 node-2 attrd: [2935]: info: attrd_local_callback: Expanded 
> fail-count-frs:0=value++ to 19
> Jun 28 09:50:42 node-2 attrd: [2935]: info: attrd_trigger_update: Sending 
> flush op to all hosts for: fail-count-frs:0 (19)
> Jun 28 09:50:42 node-2 attrd: [2935]: info: attrd_perform_update: Sent update 
> 1366: fail-count-frs:0=19
> Jun 28 09:50:42 node-2 attrd: [2935]: info: attrd_trigger_update: Sending 
> flush op to all hosts for: last-failure-frs:0 (1277736969)
> Jun 28 09:50:42 node-2 attrd: [2935]: info: attrd_perform_update: Sent update 
> 1368: last-failure-frs:0=1277736969
> Jun 28 09:50:42 node-2 frs[2991]: INFO: Light monitoring of the frs system 
> completed: RUNNING_MASTER
>
> The only real fix to this I can see is to alter the way that lrmd/crmd 
> function so that the return status must match the current expected state of 
> the cluster, not the state that existed when the monitor action was launched. 
> Does anyone else have any other suggestions I can try?
>
>
>
> Eliot Gable
> Senior Product Developer
> 1228 Euclid Ave, Suite 390
> Cleveland, OH 44115
>
> Direct: 216-373-4808
> Fax: 216-373-4657
> ega...@broadvox.net
>
>
> CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it 
> are confidential and are intended solely for the use of the individual or 
> entity to whom it is addressed. If you are not the intended recipient, please 
> call me immediately.  BROADVOX is a registered trademark of Broadvox, LLC.
>
>
> -----Original Message-----
> From: Eliot Gable [mailto:ega...@broadvox.com]
> Sent: Friday, June 25, 2010 3:14 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Master/Slave not failing over
>
> I modified my resource to set "migration-threshold=1" and 
> "failure-timeout=5s". Now the resource is finally switching to Master on the 
> slave node when the original master fails. However, shortly after it switches 
> to Master, it reports FAILED_MASTER and fails back over. Looking at the logs, 
> I see the following events take place:
>
> A monitor medium action is started on node-1 as a Slave
> Demote occurs on node-2 as a Master
> Stop occurs on node-2 as a Master
> Promote occurs on node-1
> Promote finishes on node-1 and it is now a Master
> The monitor medium action finishes and returns RUNNING_MASTER, even though it 
> was started as a slave. It gets reported as a monitoring failure, and node-1 
> gets marked as FAILED_MASTER.
>
> Here is the tail end of these actions:
>
> Jun 25 11:06:34 node-1 frs[29700]: INFO: Light monitoring of the frs system 
> completed: RUNNING_MASTER
> Jun 25 11:06:34 node-1 crmd: [3005]: info: process_lrm_event: LRM operation 
> frs:1_monitor_11000 (call=396, rc=8, cib-update=1146, confirmed=false)
> master
> Jun 25 11:06:34 node-1 crmd: [3005]: info: process_graph_event: Detected 
> action frs:1_monitor_11000 from a different transition: 185 vs. 255
> Jun 25 11:06:34 node-1 crmd: [3005]: info: abort_transition_graph: 
> process_graph_event:462 - Triggered transition abort (complete=1, 
> tag=lrm_rsc_op, id=Fr
> eeSWITCH:1_monitor_11000, 
> magic=0:8;49:185:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320, cib=0.476.273) : Old 
> event
> Jun 25 11:06:34 node-1 crmd: [3005]: WARN: update_failcount: Updating 
> failcount for frs:1 on node-1 after failed monitor: rc=8 (update=value++,
> time=1277478394)
> Jun 25 11:06:34 node-1 attrd: [3003]: info: attrd_local_callback: Expanded 
> fail-count-frs:1=value++ to 6
> Jun 25 11:06:34 node-1 attrd: [3003]: info: attrd_trigger_update: Sending 
> flush op to all hosts for: fail-count-frs:1 (6)
> Jun 25 11:06:34 node-1 attrd: [3003]: info: attrd_perform_update: Sent update 
> 1790: fail-count-frs:1=6
> Jun 25 11:06:34 node-1 attrd: [3003]: info: attrd_trigger_update: Sending 
> flush op to all hosts for: last-failure-frs:1 (1277478394)
> Jun 25 11:06:34 node-1 attrd: [3003]: info: attrd_perform_update: Sent update 
> 1792: last-failure-frs:1=1277478394
>
> The odd part is that you can see that the system understands it is from a 
> previous state and tries to abort it, yet it still counts it as a failure.
>
> I can probably fix this in my resource agent by determining the current 
> expected state of the node right from the start (master or slave) and then 
> performing the other monitoring actions. Since the other monitoring actions 
> take some time to complete, they are currently delaying the determination of 
> master/slave status until after it has switched to a Master. Although, right 
> now, it is arguably doing things correctly, because it is reporting the 
> current state as it exists at the time of returning from the monitoring 
> action.
>
>
> Eliot Gable
> Senior Product Developer
> 1228 Euclid Ave, Suite 390
> Cleveland, OH 44115
>
> Direct: 216-373-4808
> Fax: 216-373-4657
> ega...@broadvox.net
>
>
> CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it 
> are confidential and are intended solely for the use of the individual or 
> entity to whom it is addressed. If you are not the intended recipient, please 
> call me immediately.  BROADVOX is a registered trademark of Broadvox, LLC.
>
>
> -----Original Message-----
> From: Eliot Gable [mailto:ega...@broadvox.com]
> Sent: Friday, June 25, 2010 1:45 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Master/Slave not failing over
>
> When I issue the 'ip addr flush eth1' command on the Master node (node-2), it 
> detects the failure of network resources that my resource agent monitors. 
> Then I get alerts from my RA for the following actions:
>
> frs ERROR on node-2 at Fri Jun 25 08:14:44 2010 EDT: 192.168.3.4/24 is in a 
> failed state on eth1 on Master node.
>
> frs Master Node Failure Alert on node-2 at Fri Jun 25 08:14:44 2010 EDT: IP 
> Address 192.168.3.4/24 no longer exists on eth1. Furthermore, eth1 has ARP 
> enabled, so it looks like node-2 is supposed to be the master node.
>
> frs Demote Action Taken on node-2 at Fri Jun 25 08:14:46 2010 EDT
> frs Demote Completed Successfully on node-2 at Fri Jun 25 08:14:46 2010 EDT
> frs Stop Action Taken on node-2 at Fri Jun 25 08:14:46 2010 EDT
> frs Stop Completed Successfully on node-2 at Fri Jun 25 08:14:51 2010 EDT
> frs Start Action Taken on node-2 at Fri Jun 25 08:14:52 2010 EDT
> frs Start Completed Successfully on node-2 at Fri Jun 25 08:15:19 2010 EDT
> frs Promote Action Taken on node-2 at Fri Jun 25 08:15:20 2010 EDT
> frs Promote Completed Successfully on node-2 at Fri Jun 25 08:15:21 2010 EDT
>
> During the failure, if I dump the CIB, I see this:
>
>      <transient_attributes id="node-2">
>        <instance_attributes id="status-node-2">
>          <nvpair id="status-node-2-probe_complete" name="probe_complete" 
> value="true"/>
>          <nvpair id="status-node-2-fail-count-frs:0" name="fail-count-frs:0" 
> value="5"/>
>          <nvpair id="status-node-2-last-failure-frs:0" 
> name="last-failure-frs:0" value="1277472008"/>
>        </instance_attributes>
>      </transient_attributes>
>    </node_state>
>    <node_state uname="node-1" ha="active" in_ccm="true" crmd="online" 
> join="member" shutdown="0" expected="member" id="node-1" crm-debug-or
> igin="do_update_resource">
>      <transient_attributes id="node-1">
>        <instance_attributes id="status-node-1">
>          <nvpair id="status-node-1-probe_complete" name="probe_complete" 
> value="true"/>
>          <nvpair id="status-node-1-last-failure-frs:1" 
> name="last-failure-frs:1" value="1277469246"/>
>          <nvpair id="status-node-1-fail-count-frs:1" name="fail-count-frs:1" 
> value="8"/>
>          <nvpair id="status-node-1-master-frs" name="master-frs" value="100"/>
>        </instance_attributes>
>      </transient_attributes>
>
> This indicates that the $CRM_MASTER -D really is deleting the master-frs 
> resource on node-2. I definitely see it there with the same value of 100 when 
> it is active. The node-1 resource remains a Slave the entire time.
>
> I have no constraints or order commands in place that might be blocking it. 
> In the log file, during the failover, I see this:
>
>
>
> Jun 25 08:26:02 node-2 lrmd: [2934]: info: rsc:frs:0:91: stop
> Jun 25 08:26:02 node-2 crm_attribute: [27185]: info: Invoked: crm_attribute 
> -N node-2 -n master-frs -l reboot -D
> Jun 25 08:26:02 node-2 attrd: [2935]: info: attrd_trigger_update: Sending 
> flush op to all hosts for: master-frs (<null>)
> Jun 25 08:26:02 node-2 attrd: [2935]: info: attrd_perform_update: Sent delete 
> 213: node=node-2, attr=master-frs, id=<n/a>, set=(null), section=status
> Jun 25 08:26:02 node-2 attrd: [2935]: info: attrd_perform_update: Sent delete 
> 215: node=node-2, attr=master-frs, id=<n/a>, set=(null), section=status
> Jun 25 08:26:06 node-2 frs[27178]: INFO: Done stopping.
> Jun 25 08:26:06 node-2 crmd: [2937]: info: process_lrm_event: LRM operation 
> frs:0_stop_0 (call=91, rc=0, cib-update=111, confirmed=true) ok
> Jun 25 08:26:07 node-2 crmd: [2937]: info: do_lrm_rsc_op: Performing 
> key=41:117:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:0_start_0 )
> Jun 25 08:26:07 node-2 lrmd: [2934]: info: rsc:frs:0:92: start
> Jun 25 08:26:34 node-2 crm_attribute: [30836]: info: Invoked: crm_attribute 
> -N node-2 -n master-frs -l reboot -v 100
> Jun 25 08:26:34 node-2 attrd: [2935]: info: attrd_trigger_update: Sending 
> flush op to all hosts for: master-frs (100)
> Jun 25 08:26:34 node-2 attrd: [2935]: info: attrd_perform_update: Sent update 
> 218: master-frs=100
> Jun 25 08:26:34 node-2 crmd: [2937]: info: process_lrm_event: LRM operation 
> frs:0_start_0 (call=92, rc=0, cib-update=112, confirmed=true) ok
> Jun 25 08:26:34 node-2 crmd: [2937]: info: do_lrm_rsc_op: Performing 
> key=100:117:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:0_notify_0 )
> Jun 25 08:26:34 node-2 lrmd: [2934]: info: rsc:frs:0:93: notify
> Jun 25 08:26:34 node-2 crmd: [2937]: info: process_lrm_event: LRM operation 
> frs:0_notify_0 (call=93, rc=0, cib-update=113, confirmed=true) ok
> Jun 25 08:26:36 node-2 crmd: [2937]: info: do_lrm_rsc_op: Performing 
> key=106:118:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:0_notify_0 )
> Jun 25 08:26:36 node-2 lrmd: [2934]: info: rsc:frs:0:94: notify
> Jun 25 08:26:36 node-2 crmd: [2937]: info: process_lrm_event: LRM operation 
> frs:0_notify_0 (call=94, rc=0, cib-update=114, confirmed=true) ok
> Jun 25 08:26:36 node-2 crmd: [2937]: info: do_lrm_rsc_op: Performing 
> key=42:118:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:0_promote_0 )
> Jun 25 08:26:36 node-2 lrmd: [2934]: info: rsc:frs:0:95: promote
>
> On the other node, I see this:
>
> Jun 25 09:35:12 node-1 frs[2431]: INFO: Light monitoring of the frs system 
> completed: RUNNING_SLAVE
> Jun 25 09:35:14 node-1 frs[2791]: INFO: Light monitoring of the frs system 
> completed: RUNNING_SLAVE
> Jun 25 09:35:16 node-1 crmd: [3005]: info: process_graph_event: Action 
> frs:0_monitor_7000 arrived after a completed transition
> Jun 25 09:35:16 node-1 crmd: [3005]: info: abort_transition_graph: 
> process_graph_event:467 - Triggered transition abort (complete=1, 
> tag=lrm_rsc_op, id=frs:0_monitor_7000, 
> magic=0:9;44:118:8:621ad9c9-5546-4e6d-9d7b-fb09e9034320, cib=0.468.100) : 
> Inactive graph
> Jun 25 09:35:16 node-1 crmd: [3005]: WARN: update_failcount: Updating 
> failcount for frs:0 on node-2 after failed monitor: rc=9 (update=value++, 
> time=1277472916)
> Jun 25 09:35:16 node-1 crmd: [3005]: info: do_state_transition: State 
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
> origin=abort_transition_graph ]
> Jun 25 09:35:16 node-1 crmd: [3005]: info: do_state_transition: All 2 cluster 
> nodes are eligible to run resources.
> Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke: Query 698: 
> Requesting the current CIB: S_POLICY_ENGINE
> Jun 25 09:35:16 node-1 crmd: [3005]: info: abort_transition_graph: 
> te_update_diff:146 - Triggered transition abort (complete=1, 
> tag=transient_attributes, id=node-2, magic=NA, cib=0.468.101) : Transient 
> attribute: update
> Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke: Query 699: 
> Requesting the current CIB: S_POLICY_ENGINE
> Jun 25 09:35:16 node-1 crmd: [3005]: info: abort_transition_graph: 
> te_update_diff:146 - Triggered transition abort (complete=1, 
> tag=transient_attributes, id=node-2, magic=NA, cib=0.468.102) : Transient 
> attribute: update
> Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke: Query 700: 
> Requesting the current CIB: S_POLICY_ENGINE
> Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke_callback: Invoking 
> the PE: query=700, ref=pe_calc-dc-1277472916-541, seq=3616, quorate=1
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: unpack_config: On loss of CCM 
> Quorum: Ignore
> Jun 25 09:35:16 node-1 pengine: [3004]: info: unpack_config: Node scores: 
> 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> Jun 25 09:35:16 node-1 pengine: [3004]: info: determine_online_status: Node 
> node-2 is online
> Jun 25 09:35:16 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing 
> failed op frs:0_monitor_7000 on node-2: master (failed) (9)
> Jun 25 09:35:16 node-1 pengine: [3004]: info: determine_online_status: Node 
> node-1 is online
> Jun 25 09:35:16 node-1 pengine: [3004]: info: find_clone: Internally renamed 
> Apache:0 on node-1 to Apache:1
> Jun 25 09:35:16 node-1 pengine: [3004]: info: find_clone: Internally renamed 
> pgpool-II:0 on node-1 to pgpool-II:1
> Jun 25 09:35:16 node-1 pengine: [3004]: info: find_clone: Internally renamed 
> Postfix:0 on node-1 to Postfix:1
> Jun 25 09:35:16 node-1 pengine: [3004]: info: find_clone: Internally renamed 
> MySQL:0 on node-1 to MySQL:1
> Jun 25 09:35:16 node-1 pengine: [3004]: info: find_clone: Internally renamed 
> frs:0 on node-1 to frs:1
> Jun 25 09:35:16 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing 
> failed op frs:1_monitor_3000 on node-1: unknown error (1)
> Jun 25 09:35:16 node-1 pengine: [3004]: info: find_clone: Internally renamed 
> PostgreSQL:0 on node-1 to PostgreSQL:2 (ORPHAN)
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
> Apache-Clone
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: short_print:      Started: [ 
> node-2 node-1 ]
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
> pgpool-II-Clone
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: short_print:      Started: [ 
> node-2 node-1 ]
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
> MySQL-Clone
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: short_print:      Started: [ 
> node-2 node-1 ]
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: clone_print:  Master/Slave 
> Set: frs-MS
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: native_print:      frs:0      
> (ocf::broadvox:frs):     Master node-2 FAILED
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: short_print:      Slaves: [ 
> node-1 ]
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
> Postfix-Clone
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: short_print:      Started: [ 
> node-2 node-1 ]
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
> PostgreSQL-Clone
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: short_print:      Started: [ 
> node-2 node-1 ]
> Jun 25 09:35:16 node-1 pengine: [3004]: info: get_failcount: frs-MS has 
> failed 8 times on node-2
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: common_apply_stickiness: 
> frs-MS can fail 999992 more times on node-2 before being forced off
> Jun 25 09:35:16 node-1 pengine: [3004]: info: get_failcount: frs-MS has 
> failed 8 times on node-1
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: common_apply_stickiness: 
> frs-MS can fail 999992 more times on node-1 before being forced off
> Jun 25 09:35:16 node-1 pengine: [3004]: info: master_color: Promoting frs:0 
> (Master node-2)
> Jun 25 09:35:16 node-1 pengine: [3004]: info: master_color: frs-MS: Promoted 
> 1 instances of a possible 1 to master
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
> monitor (3s) for frs:0 on node-2
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
> monitor (7s) for frs:0 on node-2
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
> monitor (3s) for frs:0 on node-2
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
> monitor (7s) for frs:0 on node-2
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> Apache:0  (Started node-2)
> Jun 25 09:35:16 node-1 crmd: [3005]: info: process_graph_event: Action 
> frs:0_monitor_3000 arrived after a completed transition
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> Apache:1  (Started node-1)
> Jun 25 09:35:16 node-1 crmd: [3005]: info: abort_transition_graph: 
> process_graph_event:467 - Triggered transition abort (complete=1, 
> tag=lrm_rsc_op, id=frs:0_monitor_3000, 
> magic=0:9;43:118:8:621ad9c9-5546-4e6d-9d7b-fb09e9034320, cib=0.468.103) : 
> Inactive graph
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> pgpool-II:0       (Started node-2)
> Jun 25 09:35:16 node-1 crmd: [3005]: WARN: update_failcount: Updating 
> failcount for frs:0 on node-2 after failed monitor: rc=9 (update=value++, 
> time=1277472916)
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> pgpool-II:1       (Started node-1)
> Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke: Query 701: 
> Requesting the current CIB: S_POLICY_ENGINE
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> MySQL:0   (Started node-2)
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> MySQL:1   (Started node-1)
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Recover resource 
> frs:0    (Master node-2)
> Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke_callback: Invoking 
> the PE: query=701, ref=pe_calc-dc-1277472916-542, seq=3616, quorate=1
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> frs:1      (Slave node-1)
> Jun 25 09:35:16 node-1 crmd: [3005]: info: abort_transition_graph: 
> te_update_diff:146 - Triggered transition abort (complete=1, 
> tag=transient_attributes, id=node-2, magic=NA, cib=0.468.104) : Transient 
> attribute: update
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> Postfix:0 (Started node-2)
> Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke: Query 702: 
> Requesting the current CIB: S_POLICY_ENGINE
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> Postfix:1 (Started node-1)
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> PostgreSQL:0      (Started node-2)
> Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> PostgreSQL:1      (Started node-1)
> Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke_callback: Invoking 
> the PE: query=702, ref=pe_calc-dc-1277472916-543, seq=3616, quorate=1
> Jun 25 09:35:17 node-1 crmd: [3005]: info: handle_response: pe_calc 
> calculation pe_calc-dc-1277472916-541 is obsolete
> Jun 25 09:35:17 node-1 pengine: [3004]: info: process_pe_message: Transition 
> 119: PEngine Input stored in: /var/lib/pengine/pe-input-633.bz2
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: unpack_config: On loss of CCM 
> Quorum: Ignore
> Jun 25 09:35:17 node-1 pengine: [3004]: info: unpack_config: Node scores: 
> 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> Jun 25 09:35:17 node-1 pengine: [3004]: info: determine_online_status: Node 
> node-2 is online
> Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing 
> failed op frs:0_monitor_3000 on node-2: master (failed) (9)
> Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing 
> failed op frs:0_monitor_7000 on node-2: master (failed) (9)
> Jun 25 09:35:17 node-1 pengine: [3004]: info: determine_online_status: Node 
> node-1 is online
> Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
> Apache:0 on node-1 to Apache:1
> Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
> pgpool-II:0 on node-1 to pgpool-II:1
> Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
> Postfix:0 on node-1 to Postfix:1
> Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
> MySQL:0 on node-1 to MySQL:1
> Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
> frs:0 on node-1 to frs:1
> Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing 
> failed op frs:1_monitor_3000 on node-1: unknown error (1)
> Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
> PostgreSQL:0 on node-1 to PostgreSQL:2 (ORPHAN)
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
> Apache-Clone
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: short_print:      Started: [ 
> node-2 node-1 ]
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
> pgpool-II-Clone
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: short_print:      Started: [ 
> node-2 node-1 ]
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
> MySQL-Clone
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: short_print:      Started: [ 
> node-2 node-1 ]
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: clone_print:  Master/Slave 
> Set: frs-MS
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: native_print:      frs:0      
> (ocf::broadvox:frs):     Master node-2 FAILED
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: short_print:      Slaves: [ 
> node-1 ]
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
> Postfix-Clone
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: short_print:      Started: [ 
> node-2 node-1 ]
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
> PostgreSQL-Clone
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: short_print:      Started: [ 
> node-2 node-1 ]
> Jun 25 09:35:17 node-1 pengine: [3004]: info: get_failcount: frs-MS has 
> failed 8 times on node-2
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: common_apply_stickiness: 
> frs-MS can fail 999992 more times on node-2 before being forced off
> Jun 25 09:35:17 node-1 pengine: [3004]: info: get_failcount: frs-MS has 
> failed 8 times on node-1
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: common_apply_stickiness: 
> frs-MS can fail 999992 more times on node-1 before being forced off
> Jun 25 09:35:17 node-1 pengine: [3004]: info: master_color: Promoting frs:0 
> (Master node-2)
> Jun 25 09:35:17 node-1 pengine: [3004]: info: master_color: frs-MS: Promoted 
> 1 instances of a possible 1 to master
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
> monitor (3s) for frs:0 on node-2
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
> monitor (7s) for frs:0 on node-2
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
> monitor (3s) for frs:0 on node-2
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
> monitor (7s) for frs:0 on node-2
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> Apache:0  (Started node-2)
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> Apache:1  (Started node-1)
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> pgpool-II:0       (Started node-2)
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> pgpool-II:1       (Started node-1)
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> MySQL:0   (Started node-2)
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> MySQL:1   (Started node-1)
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Recover resource 
> frs:0    (Master node-2)
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> frs:1      (Slave node-1)
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> Postfix:0 (Started node-2)
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> Postfix:1 (Started node-1)
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> PostgreSQL:0      (Started node-2)
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> PostgreSQL:1      (Started node-1)
> Jun 25 09:35:17 node-1 crmd: [3005]: info: handle_response: pe_calc 
> calculation pe_calc-dc-1277472916-542 is obsolete
> Jun 25 09:35:17 node-1 pengine: [3004]: info: process_pe_message: Transition 
> 120: PEngine Input stored in: /var/lib/pengine/pe-input-634.bz2
> Jun 25 09:35:17 node-1 pengine: [3004]: notice: unpack_config: On loss of CCM 
> Quorum: Ignore
> Jun 25 09:35:17 node-1 pengine: [3004]: info: unpack_config: Node scores: 
> 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> Jun 25 09:35:17 node-1 pengine: [3004]: info: determine_online_status: Node 
> node-2 is online
> Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing 
> failed op frs:0_monitor_3000 on node-2: master (failed) (9)
> Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing 
> failed op frs:0_monitor_7000 on node-2: master (failed) (9)
> Jun 25 09:35:17 node-1 pengine: [3004]: info: determine_online_status: Node 
> node-1 is online
> Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
> Apache:0 on node-1 to Apache:1
> Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
> pgpool-II:0 on node-1 to pgpool-II:1
> Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
> Postfix:0 on node-1 to Postfix:1
> Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
> MySQL:0 on node-1 to MySQL:1
> Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
> frs:0 on node-1 to frs:1
> Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing 
> failed op frs:1_monitor_3000 on node-1: unknown error (1)
> Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
> PostgreSQL:0 on node-1 to PostgreSQL:2 (ORPHAN)
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
> Apache-Clone
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: short_print:      Started: [ 
> node-2 node-1 ]
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
> pgpool-II-Clone
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: short_print:      Started: [ 
> node-2 node-1 ]
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
> MySQL-Clone
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: short_print:      Started: [ 
> node-2 node-1 ]
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: clone_print:  Master/Slave 
> Set: frs-MS
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: native_print:      frs:0      
> (ocf::broadvox:frs):     Master node-2 FAILED
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: short_print:      Slaves: [ 
> node-1 ]
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
> Postfix-Clone
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: short_print:      Started: [ 
> node-2 node-1 ]
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
> PostgreSQL-Clone
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: short_print:      Started: [ 
> node-2 node-1 ]
> Jun 25 09:35:18 node-1 pengine: [3004]: info: get_failcount: frs-MS has 
> failed 9 times on node-2
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: common_apply_stickiness: 
> frs-MS can fail 999991 more times on node-2 before being forced off
> Jun 25 09:35:18 node-1 pengine: [3004]: info: get_failcount: frs-MS has 
> failed 8 times on node-1
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: common_apply_stickiness: 
> frs-MS can fail 999992 more times on node-1 before being forced off
> Jun 25 09:35:18 node-1 pengine: [3004]: info: master_color: Promoting frs:0 
> (Master node-2)
> Jun 25 09:35:18 node-1 pengine: [3004]: info: master_color: frs-MS: Promoted 
> 1 instances of a possible 1 to master
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
> monitor (3s) for frs:0 on node-2
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
> monitor (7s) for frs:0 on node-2
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
> monitor (3s) for frs:0 on node-2
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
> monitor (7s) for frs:0 on node-2
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> Apache:0  (Started node-2)
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> Apache:1  (Started node-1)
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> pgpool-II:0       (Started node-2)
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> pgpool-II:1       (Started node-1)
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> MySQL:0   (Started node-2)
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> MySQL:1   (Started node-1)
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Recover resource 
> frs:0    (Master node-2)
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> frs:1      (Slave node-1)
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> Postfix:0 (Started node-2)
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> Postfix:1 (Started node-1)
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> PostgreSQL:0      (Started node-2)
> Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
> PostgreSQL:1      (Started node-1)
> Jun 25 09:35:18 node-1 crmd: [3005]: info: do_state_transition: State 
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
> cause=C_IPC_MESSAGE origin=handle_response ]
> Jun 25 09:35:18 node-1 crmd: [3005]: info: unpack_graph: Unpacked transition 
> 121: 45 actions in 45 synapses
> Jun 25 09:35:18 node-1 crmd: [3005]: info: do_te_invoke: Processing graph 121 
> (ref=pe_calc-dc-1277472916-543) derived from /var/lib/pengine/pe-input-635.bz2
> Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 70 
> fired and confirmed
> Jun 25 09:35:18 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 
> 108: notify frs:0_pre_notify_demote_0 on node-2
> Jun 25 09:35:18 node-1 pengine: [3004]: info: process_pe_message: Transition 
> 121: PEngine Input stored in: /var/lib/pengine/pe-input-635.bz2
> Jun 25 09:35:18 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 
> 110: notify frs:1_pre_notify_demote_0 on node-1 (local)
> Jun 25 09:35:18 node-1 crmd: [3005]: info: do_lrm_rsc_op: Performing 
> key=110:121:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:1_notify_0 )
> Jun 25 09:35:18 node-1 lrmd: [3002]: info: rsc:frs:1:313: notify
> Jun 25 09:35:18 node-1 crmd: [3005]: info: match_graph_event: Action 
> frs:0_pre_notify_demote_0 (108) confirmed on node-2 (rc=0)
> Jun 25 09:35:18 node-1 crmd: [3005]: info: process_lrm_event: LRM operation 
> frs:1_notify_0 (call=313, rc=0, cib-update=703, confirmed=true) ok
> Jun 25 09:35:18 node-1 crmd: [3005]: info: match_graph_event: Action 
> frs:1_pre_notify_demote_0 (110) confirmed on node-1 (rc=0)
> Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 71 
> fired and confirmed
> Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 68 
> fired and confirmed
> Jun 25 09:35:18 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 
> 44: demote frs:0_demote_0 on node-2
> Jun 25 09:35:18 node-1 crmd: [3005]: info: match_graph_event: Action 
> frs:0_demote_0 (44) confirmed on node-2 (rc=0)
> Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 69 
> fired and confirmed
> Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 72 
> fired and confirmed
> Jun 25 09:35:18 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 
> 109: notify frs:0_post_notify_demote_0 on node-2
> Jun 25 09:35:18 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 
> 111: notify frs:1_post_notify_demote_0 on node-1 (local)
> Jun 25 09:35:18 node-1 crmd: [3005]: info: do_lrm_rsc_op: Performing 
> key=111:121:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:1_notify_0 )
> Jun 25 09:35:18 node-1 lrmd: [3002]: info: rsc:frs:1:314: notify
> Jun 25 09:35:18 node-1 crmd: [3005]: info: process_lrm_event: LRM operation 
> frs:1_notify_0 (call=314, rc=0, cib-update=704, confirmed=true) ok
> Jun 25 09:35:18 node-1 crmd: [3005]: info: match_graph_event: Action 
> frs:1_post_notify_demote_0 (111) confirmed on node-1 (rc=0)
> Jun 25 09:35:18 node-1 crmd: [3005]: info: match_graph_event: Action 
> frs:0_post_notify_demote_0 (109) confirmed on node-2 (rc=0)
> Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 73 
> fired and confirmed
> Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 58 
> fired and confirmed
> Jun 25 09:35:18 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 
> 101: notify frs:0_pre_notify_stop_0 on node-2
> Jun 25 09:35:19 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 
> 102: notify frs:1_pre_notify_stop_0 on node-1 (local)
> Jun 25 09:35:19 node-1 crmd: [3005]: info: do_lrm_rsc_op: Performing 
> key=102:121:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:1_notify_0 )
> Jun 25 09:35:19 node-1 lrmd: [3002]: info: rsc:frs:1:315: notify
> Jun 25 09:35:19 node-1 crmd: [3005]: info: match_graph_event: Action 
> frs:0_pre_notify_stop_0 (101) confirmed on node-2 (rc=0)
> Jun 25 09:35:19 node-1 crmd: [3005]: info: process_lrm_event: LRM operation 
> frs:1_notify_0 (call=315, rc=0, cib-update=705, confirmed=true) ok
> Jun 25 09:35:19 node-1 crmd: [3005]: info: match_graph_event: Action 
> frs:1_pre_notify_stop_0 (102) confirmed on node-1 (rc=0)
> Jun 25 09:35:19 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 59 
> fired and confirmed
> Jun 25 09:35:19 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 56 
> fired and confirmed
> Jun 25 09:35:19 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 
> 7: stop frs:0_stop_0 on node-2
> Jun 25 09:35:19 node-1 crmd: [3005]: info: abort_transition_graph: 
> te_update_diff:157 - Triggered transition abort (complete=0, 
> tag=transient_attributes, id=node-2, magic=NA, cib=0.468.112) : Transient 
> attribute: removal
> Jun 25 09:35:19 node-1 crmd: [3005]: info: update_abort_priority: Abort 
> priority upgraded from 0 to 1000000
> Jun 25 09:35:19 node-1 crmd: [3005]: info: update_abort_priority: Abort 
> action done superceeded by restart
> Jun 25 09:35:21 node-1 frs[3339]: INFO: Light monitoring of the frs system 
> completed: RUNNING_SLAVE
>
>
>
> I did notice this line:
>
> Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing 
> failed op frs:1_monitor_3000 on node-1: unknown error (1)
>
> Which makes me think maybe this is related to this failed operator from 
> yesterday. However, I have stopped and started the resource several times on 
> node-1 since this failed op occurred. Do I need to clear these things 
> (cleanup the resource) each time I start the resource?
>
>
>
> Eliot Gable
> Senior Product Developer
> 1228 Euclid Ave, Suite 390
> Cleveland, OH 44115
>
> Direct: 216-373-4808
> Fax: 216-373-4657
> ega...@broadvox.net
>
>
> CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it 
> are confidential and are intended solely for the use of the individual or 
> entity to whom it is addressed. If you are not the intended recipient, please 
> call me immediately.  BROADVOX is a registered trademark of Broadvox, LLC.
>
>
> -----Original Message-----
> From: Eliot Gable [mailto:ega...@broadvox.com]
> Sent: Friday, June 25, 2010 1:08 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Master/Slave not failing over
>
> Ok; I'm still not having any luck with this....
>
> In my START action, right before I return $OCF_SUCCESS, I do:
>
> $CRM_MASTER -v 100
>
> Where CRM_MASTER is defined as in the drbd resource with '-l reboot'. In my 
> STOP action, right at the beginning of it, I do:
>
> $CRM_MASTER -D
>
> I copied the new RA to both nodes, stopped the resource on both nodes, 
> started it, promoted it, then caused the monitoring action to fail. I see 
> Pacemaker generating the usual string of actions, including a STOP, so it 
> should be deleting the master preference. However, no failover occurs. The 
> slave still just sits there as a slave. Is there something else I am missing?
>
> Thanks again.
>
>
> Eliot Gable
> Senior Product Developer
> 1228 Euclid Ave, Suite 390
> Cleveland, OH 44115
>
> Direct: 216-373-4808
> Fax: 216-373-4657
> ega...@broadvox.net
>
>
> CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it 
> are confidential and are intended solely for the use of the individual or 
> entity to whom it is addressed. If you are not the intended recipient, please 
> call me immediately.  BROADVOX is a registered trademark of Broadvox, LLC.
>
>
> -----Original Message-----
> From: Eliot Gable [mailto:ega...@broadvox.com]
> Sent: Friday, June 25, 2010 12:27 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Master/Slave not failing over
>
> After looking at the drbd master/slave RA, I think it is now clear. It looks 
> like crm_master, being a wrapper for crm_attribute, actually specifies 
> everything I need, and all I need to add to the command line are the few 
> additional options like lifetime of the attribute modification, value to set 
> it to, or whether to delete the attribute.
>
> So, if I delete the attribute when a STOP is issued and keep the attribute's 
> lifetime set to "reboot", it should be sufficient to cause a failover, 
> correct?
>
> Also, I am thinking that in my START action, after I have performed enough 
> monitoring on it to ensure that everything came up correctly, I should at 
> that point issue crm_master again with -v option to set a score for the node 
> so it is a good candidate to become master, correct?
>
>
> Eliot Gable
> Senior Product Developer
> 1228 Euclid Ave, Suite 390
> Cleveland, OH 44115
>
> Direct: 216-373-4808
> Fax: 216-373-4657
> ega...@broadvox.net
>
>
> CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it 
> are confidential and are intended solely for the use of the individual or 
> entity to whom it is addressed. If you are not the intended recipient, please 
> call me immediately.  BROADVOX is a registered trademark of Broadvox, LLC.
>
>
> -----Original Message-----
> From: Eliot Gable [mailto:ega...@broadvox.com]
> Sent: Friday, June 25, 2010 12:17 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Master/Slave not failing over
>
> Thanks. Should I update my RA to use crm_master when it detects the resource 
> in FAILED_MASTER state, or should I put it in the demote action or something 
> else?
>
> What's the command line needed to "reduce the promotion score"? I looked at 
> the Pacemaker_Explained.pdf document, and while it mentions using crm_master 
> to provide a promotion score, it does not tell me what actual attribute it is 
> that needs to be modified. Is there another command that can print out all 
> available attributes, or a document somewhere that lists them?
>
>
> Eliot Gable
> Senior Product Developer
> 1228 Euclid Ave, Suite 390
> Cleveland, OH 44115
>
> Direct: 216-373-4808
> Fax: 216-373-4657
> ega...@broadvox.net
>
>
> CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it 
> are confidential and are intended solely for the use of the individual or 
> entity to whom it is addressed. If you are not the intended recipient, please 
> call me immediately.  BROADVOX is a registered trademark of Broadvox, LLC.
>
>
> -----Original Message-----
> From: Andrew Beekhof [mailto:and...@beekhof.net]
> Sent: Friday, June 25, 2010 8:26 AM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Master/Slave not failing over
>
> On Fri, Jun 25, 2010 at 12:43 AM, Eliot Gable <ega...@broadvox.com> wrote:
>> Thanks for pointing that out.
>>
>> I am still having issues with the master/slave resource. When I cause one of 
>> the monitoring actions to fail,
>
> as well as failing it should also use crm_master to reduce the promotion score
>
>> the master node gets a DEMOTE, STOP, START, PROMOTE and the slave resource 
>> just sits there. I want to see DEMOTE on the failed master, then PROMOTE on 
>> the slave, then STOP on the failed master, followed by START on the failed 
>> master.
>
> The stop will always happen before the promote. Regardless of which
> instance is being promoted.
>
>> How can I achieve this? Is there some sort of constraint or something I can 
>> put in place to make it happen?
>>
>> Thanks again for any insights.
>>
>>
>>
>> Eliot Gable
>> Senior Product Developer
>> 1228 Euclid Ave, Suite 390
>> Cleveland, OH 44115
>>
>> Direct: 216-373-4808
>> Fax: 216-373-4657
>> ega...@broadvox.net
>>
>>
>> CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it 
>> are confidential and are intended solely for the use of the individual or 
>> entity to whom it is addressed. If you are not the intended recipient, 
>> please call me immediately.  BROADVOX is a registered trademark of Broadvox, 
>> LLC.
>>
>>
>> -----Original Message-----
>> From: Dejan Muhamedagic [mailto:deja...@fastmail.fm]
>> Sent: Thursday, June 24, 2010 12:37 PM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Master/Slave not failing over
>>
>> Hi,
>>
>> On Thu, Jun 24, 2010 at 12:12:34PM -0400, Eliot Gable wrote:
>>> On another note, I cannot seem to get Pacemaker to monitor the master node. 
>>> It monitors the slave node just fine. These are the operations I have 
>>> defined:
>>>
>>>         op monitor interval="5" timeout="30s" \
>>>         op monitor interval="10" timeout="30s" OCF_CHECK_LEVEL="10" \
>>>         op monitor interval="5" role="Master" timeout="30s" \
>>>         op monitor interval="10" role="Master" timeout="30s" 
>>> OCF_CHECK_LEVEL="10" \
>>>         op start interval="0" timeout="40s" \
>>>         op stop interval="0" timeout="20s"
>>>
>>> Did I do something wrong?
>>
>> Yes, all monitor intervals have to be different. I don't know
>> what happened without looking at the logs, but you should set sth
>> like this:
>>
>>         op monitor interval="6" role="Master" timeout="30s" \
>>         op monitor interval="11" role="Master" timeout="30s" 
>> OCF_CHECK_LEVEL="10" \
>>
>> Thanks,
>>
>> Dejan
>>
>>> Eliot Gable
>>> Senior Product Developer
>>> 1228 Euclid Ave, Suite 390
>>> Cleveland, OH 44115
>>>
>>> Direct: 216-373-4808
>>> Fax: 216-373-4657
>>> ega...@broadvox.net<mailto:ega...@broadvox.net>
>>>
>>> [cid:image001.gif@01CB1396.87214DC0]
>>> CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it 
>>> are confidential and are intended solely for the use of the individual or 
>>> entity to whom it is addressed. If you are not the intended recipient, 
>>> please call me immediately.  BROADVOX is a registered trademark of 
>>> Broadvox, LLC.
>>>
>>> From: Eliot Gable [mailto:ega...@broadvox.com]
>>> Sent: Thursday, June 24, 2010 11:55 AM
>>> To: The Pacemaker cluster resource manager
>>> Subject: [Pacemaker] Master/Slave not failing over
>>>
>>> I am using the latest CentOS 5.5 packages for pacemaker/corosync. I have a 
>>> master/slave resource up and running, and when I make the master fail, 
>>> instead of immediately promoting the slave, it restarts the failed master 
>>> and re-promotes it back to master. This takes longer than if it would just 
>>> immediately promote the slave. I can understand it waiting for a DEMOTE 
>>> action to succeed on the failed master before it promotes the slave, but 
>>> that is all it should need to do it. Is there any way I can change this 
>>> behavior? Am I missing some key point in the process?
>>>
>>>
>>> Eliot Gable
>>> Senior Product Developer
>>> 1228 Euclid Ave, Suite 390
>>> Cleveland, OH 44115
>>>
>>> Direct: 216-373-4808
>>> Fax: 216-373-4657
>>> ega...@broadvox.net<mailto:ega...@broadvox.net>
>>>
>>> [cid:image001.gif@01CB1396.87214DC0]
>>> CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it 
>>> are confidential and are intended solely for the use of the individual or 
>>> entity to whom it is addressed. If you are not the intended recipient, 
>>> please call me immediately.  BROADVOX is a registered trademark of 
>>> Broadvox, LLC.
>>>
>>>
>>> ________________________________
>>> CONFIDENTIAL. This e-mail and any attached files are confidential and 
>>> should be destroyed and/or returned if you are not the intended and proper 
>>> recipient.
>>>
>>> ________________________________
>>> CONFIDENTIAL. This e-mail and any attached files are confidential and 
>>> should be destroyed and/or returned if you are not the intended and proper 
>>> recipient.
>>
>>
>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: 
>>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: 
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>> CONFIDENTIAL.  This e-mail and any attached files are confidential and 
>> should be destroyed and/or returned if you are not the intended and proper 
>> recipient.
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: 
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
> CONFIDENTIAL.  This e-mail and any attached files are confidential and should 
> be destroyed and/or returned if you are not the intended and proper recipient.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
> CONFIDENTIAL.  This e-mail and any attached files are confidential and should 
> be destroyed and/or returned if you are not the intended and proper recipient.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
> CONFIDENTIAL.  This e-mail and any attached files are confidential and should 
> be destroyed and/or returned if you are not the intended and proper recipient.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
> CONFIDENTIAL.  This e-mail and any attached files are confidential and should 
> be destroyed and/or returned if you are not the intended and proper recipient.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
> CONFIDENTIAL.  This e-mail and any attached files are confidential and should 
> be destroyed and/or returned if you are not the intended and proper recipient.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
> CONFIDENTIAL.  This e-mail and any attached files are confidential and should 
> be destroyed and/or returned if you are not the intended and proper recipient.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] Master/Slave not failing over

Reply via email to