Re: [Pacemaker] Master/Slave not failing over

Eliot Gable Mon, 28 Jun 2010 16:14:13 -0700

As an update on this: I modified my monitoring action and it now determines 
master/slave status at the moment the action triggers. This fixed the light 
monitoring for all cases I have been able to test thus far. However, it still 
causes a problem on the medium monitoring. The interval on the medium 
monitoring when it is a slave is 11,000 ms. When it is a master, it is 7,000 
ms. I am getting cases where the medium monitoring starts with the 11,000 ms 
interval at nearly the same time that a promote occurs. The monitoring task is 
suspended or delayed in some fashion just long enough for the promote to 
execute some code that then causes the monitoring task to flag the state as 
MASTER. However, since the monitoring task was started as a slave, the system 
expects it to return slave status. However, it returns master status instead, 
so it marks the node as failed and causes the system to fail over again. 
Determining the state of the system (master or slave) is the very first thing I 
do in the monitoring section, yet it can still be delayed until after the 
promote.

Jun 28 09:50:28 node-2 lrmd: [2934]: info: perform_op:2873: operation 
monitor[407] on ocf::frs::frs:0 for client 2937, its parame
ters: profile_start_delay=[45] CRM_meta_notify_active_resource=[ ] 
CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_inactive_resource=[frs:0 ] 
ips=[ ] CRM_meta_notify_start_resource=[frs:0 ] OCF_CHECK_LEVEL=[10] 
CRM_meta_notify_active_uname=[ ] fsroot=[] managetype=[screen] 
CRM_meta_globally_unique=[false] stop_delay=[1 for rsc is already running.
Jun 28 09:50:28 node-2 lrmd: [2934]: info: perform_op:2883: postponing all ops 
on resource frs:0 by 1000 ms
Jun 28 09:50:29 node-2 frs[907]: INFO: Monitor medium returned: OCF_SUCCESS for 
SLAVE status
Jun 28 09:50:29 node-2 frs[1262]: INFO: Light monitoring of the frs system 
completed: RUNNING_SLAVE
Jun 28 09:50:32 node-2 crm_resource: [1554]: info: Invoked: crm_resource --meta 
-r frs-MS -p target-role -v Master
Jun 28 09:50:33 node-2 lrmd: [2934]: info: rsc:frs:0:533: promote
<snip>
...
</snip>
Jun 28 09:50:42 node-2 frs[2567]: INFO: Monitor medium returned: 
OCF_RUNNING_MASTER for MASTER status
Jun 28 09:50:42 node-2 crmd: [2937]: info: process_lrm_event: LRM operation 
frs:0_monitor_11000 (call=407, rc=8, cib-update=694, confirmed=false) master
Jun 28 09:50:42 node-2 attrd: [2935]: info: attrd_local_callback: Expanded 
fail-count-frs:0=value++ to 19
Jun 28 09:50:42 node-2 attrd: [2935]: info: attrd_trigger_update: Sending flush 
op to all hosts for: fail-count-frs:0 (19)
Jun 28 09:50:42 node-2 attrd: [2935]: info: attrd_perform_update: Sent update 
1366: fail-count-frs:0=19
Jun 28 09:50:42 node-2 attrd: [2935]: info: attrd_trigger_update: Sending flush 
op to all hosts for: last-failure-frs:0 (1277736969)
Jun 28 09:50:42 node-2 attrd: [2935]: info: attrd_perform_update: Sent update 
1368: last-failure-frs:0=1277736969
Jun 28 09:50:42 node-2 frs[2991]: INFO: Light monitoring of the frs system 
completed: RUNNING_MASTER

The only real fix to this I can see is to alter the way that lrmd/crmd function 
so that the return status must match the current expected state of the cluster, 
not the state that existed when the monitor action was launched. Does anyone 
else have any other suggestions I can try?

Eliot Gable
Senior Product Developer
1228 Euclid Ave, Suite 390
Cleveland, OH 44115

Direct: 216-373-4808
Fax: 216-373-4657
ega...@broadvox.net

CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it are 
confidential and are intended solely for the use of the individual or entity to 
whom it is addressed. If you are not the intended recipient, please call me 
immediately.  BROADVOX is a registered trademark of Broadvox, LLC.

-----Original Message-----
From: Eliot Gable [mailto:ega...@broadvox.com]
Sent: Friday, June 25, 2010 3:14 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Master/Slave not failing over

I modified my resource to set "migration-threshold=1" and "failure-timeout=5s". 
Now the resource is finally switching to Master on the slave node when the 
original master fails. However, shortly after it switches to Master, it reports 
FAILED_MASTER and fails back over. Looking at the logs, I see the following 
events take place:

A monitor medium action is started on node-1 as a Slave
Demote occurs on node-2 as a Master
Stop occurs on node-2 as a Master
Promote occurs on node-1
Promote finishes on node-1 and it is now a Master
The monitor medium action finishes and returns RUNNING_MASTER, even though it 
was started as a slave. It gets reported as a monitoring failure, and node-1 
gets marked as FAILED_MASTER.

Here is the tail end of these actions:

Jun 25 11:06:34 node-1 frs[29700]: INFO: Light monitoring of the frs system 
completed: RUNNING_MASTER
Jun 25 11:06:34 node-1 crmd: [3005]: info: process_lrm_event: LRM operation 
frs:1_monitor_11000 (call=396, rc=8, cib-update=1146, confirmed=false)
master
Jun 25 11:06:34 node-1 crmd: [3005]: info: process_graph_event: Detected action 
frs:1_monitor_11000 from a different transition: 185 vs. 255
Jun 25 11:06:34 node-1 crmd: [3005]: info: abort_transition_graph: 
process_graph_event:462 - Triggered transition abort (complete=1, 
tag=lrm_rsc_op, id=Fr
eeSWITCH:1_monitor_11000, 
magic=0:8;49:185:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320, cib=0.476.273) : Old 
event
Jun 25 11:06:34 node-1 crmd: [3005]: WARN: update_failcount: Updating failcount 
for frs:1 on node-1 after failed monitor: rc=8 (update=value++,
time=1277478394)
Jun 25 11:06:34 node-1 attrd: [3003]: info: attrd_local_callback: Expanded 
fail-count-frs:1=value++ to 6
Jun 25 11:06:34 node-1 attrd: [3003]: info: attrd_trigger_update: Sending flush 
op to all hosts for: fail-count-frs:1 (6)
Jun 25 11:06:34 node-1 attrd: [3003]: info: attrd_perform_update: Sent update 
1790: fail-count-frs:1=6
Jun 25 11:06:34 node-1 attrd: [3003]: info: attrd_trigger_update: Sending flush 
op to all hosts for: last-failure-frs:1 (1277478394)
Jun 25 11:06:34 node-1 attrd: [3003]: info: attrd_perform_update: Sent update 
1792: last-failure-frs:1=1277478394

The odd part is that you can see that the system understands it is from a 
previous state and tries to abort it, yet it still counts it as a failure.

I can probably fix this in my resource agent by determining the current 
expected state of the node right from the start (master or slave) and then 
performing the other monitoring actions. Since the other monitoring actions 
take some time to complete, they are currently delaying the determination of 
master/slave status until after it has switched to a Master. Although, right 
now, it is arguably doing things correctly, because it is reporting the current 
state as it exists at the time of returning from the monitoring action.

Eliot Gable
Senior Product Developer
1228 Euclid Ave, Suite 390
Cleveland, OH 44115

Direct: 216-373-4808
Fax: 216-373-4657
ega...@broadvox.net

CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it are 
confidential and are intended solely for the use of the individual or entity to 
whom it is addressed. If you are not the intended recipient, please call me 
immediately.  BROADVOX is a registered trademark of Broadvox, LLC.

-----Original Message-----
From: Eliot Gable [mailto:ega...@broadvox.com]
Sent: Friday, June 25, 2010 1:45 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Master/Slave not failing over

When I issue the 'ip addr flush eth1' command on the Master node (node-2), it 
detects the failure of network resources that my resource agent monitors. Then 
I get alerts from my RA for the following actions:

frs ERROR on node-2 at Fri Jun 25 08:14:44 2010 EDT: 192.168.3.4/24 is in a 
failed state on eth1 on Master node.

frs Master Node Failure Alert on node-2 at Fri Jun 25 08:14:44 2010 EDT: IP 
Address 192.168.3.4/24 no longer exists on eth1. Furthermore, eth1 has ARP 
enabled, so it looks like node-2 is supposed to be the master node.

frs Demote Action Taken on node-2 at Fri Jun 25 08:14:46 2010 EDT
frs Demote Completed Successfully on node-2 at Fri Jun 25 08:14:46 2010 EDT
frs Stop Action Taken on node-2 at Fri Jun 25 08:14:46 2010 EDT
frs Stop Completed Successfully on node-2 at Fri Jun 25 08:14:51 2010 EDT
frs Start Action Taken on node-2 at Fri Jun 25 08:14:52 2010 EDT
frs Start Completed Successfully on node-2 at Fri Jun 25 08:15:19 2010 EDT
frs Promote Action Taken on node-2 at Fri Jun 25 08:15:20 2010 EDT
frs Promote Completed Successfully on node-2 at Fri Jun 25 08:15:21 2010 EDT

During the failure, if I dump the CIB, I see this:

      <transient_attributes id="node-2">
        <instance_attributes id="status-node-2">
          <nvpair id="status-node-2-probe_complete" name="probe_complete" 
value="true"/>
          <nvpair id="status-node-2-fail-count-frs:0" name="fail-count-frs:0" 
value="5"/>
          <nvpair id="status-node-2-last-failure-frs:0" 
name="last-failure-frs:0" value="1277472008"/>
        </instance_attributes>
      </transient_attributes>
    </node_state>
    <node_state uname="node-1" ha="active" in_ccm="true" crmd="online" 
join="member" shutdown="0" expected="member" id="node-1" crm-debug-or
igin="do_update_resource">
      <transient_attributes id="node-1">
        <instance_attributes id="status-node-1">
          <nvpair id="status-node-1-probe_complete" name="probe_complete" 
value="true"/>
          <nvpair id="status-node-1-last-failure-frs:1" 
name="last-failure-frs:1" value="1277469246"/>
          <nvpair id="status-node-1-fail-count-frs:1" name="fail-count-frs:1" 
value="8"/>
          <nvpair id="status-node-1-master-frs" name="master-frs" value="100"/>
        </instance_attributes>
      </transient_attributes>

This indicates that the $CRM_MASTER -D really is deleting the master-frs 
resource on node-2. I definitely see it there with the same value of 100 when 
it is active. The node-1 resource remains a Slave the entire time.

I have no constraints or order commands in place that might be blocking it. In 
the log file, during the failover, I see this:

Jun 25 08:26:02 node-2 lrmd: [2934]: info: rsc:frs:0:91: stop
Jun 25 08:26:02 node-2 crm_attribute: [27185]: info: Invoked: crm_attribute -N 
node-2 -n master-frs -l reboot -D
Jun 25 08:26:02 node-2 attrd: [2935]: info: attrd_trigger_update: Sending flush 
op to all hosts for: master-frs (<null>)
Jun 25 08:26:02 node-2 attrd: [2935]: info: attrd_perform_update: Sent delete 
213: node=node-2, attr=master-frs, id=<n/a>, set=(null), section=status
Jun 25 08:26:02 node-2 attrd: [2935]: info: attrd_perform_update: Sent delete 
215: node=node-2, attr=master-frs, id=<n/a>, set=(null), section=status
Jun 25 08:26:06 node-2 frs[27178]: INFO: Done stopping.
Jun 25 08:26:06 node-2 crmd: [2937]: info: process_lrm_event: LRM operation 
frs:0_stop_0 (call=91, rc=0, cib-update=111, confirmed=true) ok
Jun 25 08:26:07 node-2 crmd: [2937]: info: do_lrm_rsc_op: Performing 
key=41:117:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:0_start_0 )
Jun 25 08:26:07 node-2 lrmd: [2934]: info: rsc:frs:0:92: start
Jun 25 08:26:34 node-2 crm_attribute: [30836]: info: Invoked: crm_attribute -N 
node-2 -n master-frs -l reboot -v 100
Jun 25 08:26:34 node-2 attrd: [2935]: info: attrd_trigger_update: Sending flush 
op to all hosts for: master-frs (100)
Jun 25 08:26:34 node-2 attrd: [2935]: info: attrd_perform_update: Sent update 
218: master-frs=100
Jun 25 08:26:34 node-2 crmd: [2937]: info: process_lrm_event: LRM operation 
frs:0_start_0 (call=92, rc=0, cib-update=112, confirmed=true) ok
Jun 25 08:26:34 node-2 crmd: [2937]: info: do_lrm_rsc_op: Performing 
key=100:117:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:0_notify_0 )
Jun 25 08:26:34 node-2 lrmd: [2934]: info: rsc:frs:0:93: notify
Jun 25 08:26:34 node-2 crmd: [2937]: info: process_lrm_event: LRM operation 
frs:0_notify_0 (call=93, rc=0, cib-update=113, confirmed=true) ok
Jun 25 08:26:36 node-2 crmd: [2937]: info: do_lrm_rsc_op: Performing 
key=106:118:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:0_notify_0 )
Jun 25 08:26:36 node-2 lrmd: [2934]: info: rsc:frs:0:94: notify
Jun 25 08:26:36 node-2 crmd: [2937]: info: process_lrm_event: LRM operation 
frs:0_notify_0 (call=94, rc=0, cib-update=114, confirmed=true) ok
Jun 25 08:26:36 node-2 crmd: [2937]: info: do_lrm_rsc_op: Performing 
key=42:118:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:0_promote_0 )
Jun 25 08:26:36 node-2 lrmd: [2934]: info: rsc:frs:0:95: promote

On the other node, I see this:

Jun 25 09:35:12 node-1 frs[2431]: INFO: Light monitoring of the frs system 
completed: RUNNING_SLAVE
Jun 25 09:35:14 node-1 frs[2791]: INFO: Light monitoring of the frs system 
completed: RUNNING_SLAVE
Jun 25 09:35:16 node-1 crmd: [3005]: info: process_graph_event: Action 
frs:0_monitor_7000 arrived after a completed transition
Jun 25 09:35:16 node-1 crmd: [3005]: info: abort_transition_graph: 
process_graph_event:467 - Triggered transition abort (complete=1, 
tag=lrm_rsc_op, id=frs:0_monitor_7000, 
magic=0:9;44:118:8:621ad9c9-5546-4e6d-9d7b-fb09e9034320, cib=0.468.100) : 
Inactive graph
Jun 25 09:35:16 node-1 crmd: [3005]: WARN: update_failcount: Updating failcount 
for frs:0 on node-2 after failed monitor: rc=9 (update=value++, time=1277472916)
Jun 25 09:35:16 node-1 crmd: [3005]: info: do_state_transition: State 
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Jun 25 09:35:16 node-1 crmd: [3005]: info: do_state_transition: All 2 cluster 
nodes are eligible to run resources.
Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke: Query 698: Requesting 
the current CIB: S_POLICY_ENGINE
Jun 25 09:35:16 node-1 crmd: [3005]: info: abort_transition_graph: 
te_update_diff:146 - Triggered transition abort (complete=1, 
tag=transient_attributes, id=node-2, magic=NA, cib=0.468.101) : Transient 
attribute: update
Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke: Query 699: Requesting 
the current CIB: S_POLICY_ENGINE
Jun 25 09:35:16 node-1 crmd: [3005]: info: abort_transition_graph: 
te_update_diff:146 - Triggered transition abort (complete=1, 
tag=transient_attributes, id=node-2, magic=NA, cib=0.468.102) : Transient 
attribute: update
Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke: Query 700: Requesting 
the current CIB: S_POLICY_ENGINE
Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke_callback: Invoking the 
PE: query=700, ref=pe_calc-dc-1277472916-541, seq=3616, quorate=1
Jun 25 09:35:16 node-1 pengine: [3004]: notice: unpack_config: On loss of CCM 
Quorum: Ignore
Jun 25 09:35:16 node-1 pengine: [3004]: info: unpack_config: Node scores: 'red' 
= -INFINITY, 'yellow' = 0, 'green' = 0
Jun 25 09:35:16 node-1 pengine: [3004]: info: determine_online_status: Node 
node-2 is online
Jun 25 09:35:16 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing failed 
op frs:0_monitor_7000 on node-2: master (failed) (9)
Jun 25 09:35:16 node-1 pengine: [3004]: info: determine_online_status: Node 
node-1 is online
Jun 25 09:35:16 node-1 pengine: [3004]: info: find_clone: Internally renamed 
Apache:0 on node-1 to Apache:1
Jun 25 09:35:16 node-1 pengine: [3004]: info: find_clone: Internally renamed 
pgpool-II:0 on node-1 to pgpool-II:1
Jun 25 09:35:16 node-1 pengine: [3004]: info: find_clone: Internally renamed 
Postfix:0 on node-1 to Postfix:1
Jun 25 09:35:16 node-1 pengine: [3004]: info: find_clone: Internally renamed 
MySQL:0 on node-1 to MySQL:1
Jun 25 09:35:16 node-1 pengine: [3004]: info: find_clone: Internally renamed 
frs:0 on node-1 to frs:1
Jun 25 09:35:16 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing failed 
op frs:1_monitor_3000 on node-1: unknown error (1)
Jun 25 09:35:16 node-1 pengine: [3004]: info: find_clone: Internally renamed 
PostgreSQL:0 on node-1 to PostgreSQL:2 (ORPHAN)
Jun 25 09:35:16 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
Apache-Clone
Jun 25 09:35:16 node-1 pengine: [3004]: notice: short_print:      Started: [ 
node-2 node-1 ]
Jun 25 09:35:16 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
pgpool-II-Clone
Jun 25 09:35:16 node-1 pengine: [3004]: notice: short_print:      Started: [ 
node-2 node-1 ]
Jun 25 09:35:16 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
MySQL-Clone
Jun 25 09:35:16 node-1 pengine: [3004]: notice: short_print:      Started: [ 
node-2 node-1 ]
Jun 25 09:35:16 node-1 pengine: [3004]: notice: clone_print:  Master/Slave Set: 
frs-MS
Jun 25 09:35:16 node-1 pengine: [3004]: notice: native_print:      frs:0      
(ocf::broadvox:frs):     Master node-2 FAILED
Jun 25 09:35:16 node-1 pengine: [3004]: notice: short_print:      Slaves: [ 
node-1 ]
Jun 25 09:35:16 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
Postfix-Clone
Jun 25 09:35:16 node-1 pengine: [3004]: notice: short_print:      Started: [ 
node-2 node-1 ]
Jun 25 09:35:16 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
PostgreSQL-Clone
Jun 25 09:35:16 node-1 pengine: [3004]: notice: short_print:      Started: [ 
node-2 node-1 ]
Jun 25 09:35:16 node-1 pengine: [3004]: info: get_failcount: frs-MS has failed 
8 times on node-2
Jun 25 09:35:16 node-1 pengine: [3004]: notice: common_apply_stickiness: frs-MS 
can fail 999992 more times on node-2 before being forced off
Jun 25 09:35:16 node-1 pengine: [3004]: info: get_failcount: frs-MS has failed 
8 times on node-1
Jun 25 09:35:16 node-1 pengine: [3004]: notice: common_apply_stickiness: frs-MS 
can fail 999992 more times on node-1 before being forced off
Jun 25 09:35:16 node-1 pengine: [3004]: info: master_color: Promoting frs:0 
(Master node-2)
Jun 25 09:35:16 node-1 pengine: [3004]: info: master_color: frs-MS: Promoted 1 
instances of a possible 1 to master
Jun 25 09:35:16 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
monitor (3s) for frs:0 on node-2
Jun 25 09:35:16 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
monitor (7s) for frs:0 on node-2
Jun 25 09:35:16 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
monitor (3s) for frs:0 on node-2
Jun 25 09:35:16 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
monitor (7s) for frs:0 on node-2
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
Apache:0  (Started node-2)
Jun 25 09:35:16 node-1 crmd: [3005]: info: process_graph_event: Action 
frs:0_monitor_3000 arrived after a completed transition
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
Apache:1  (Started node-1)
Jun 25 09:35:16 node-1 crmd: [3005]: info: abort_transition_graph: 
process_graph_event:467 - Triggered transition abort (complete=1, 
tag=lrm_rsc_op, id=frs:0_monitor_3000, 
magic=0:9;43:118:8:621ad9c9-5546-4e6d-9d7b-fb09e9034320, cib=0.468.103) : 
Inactive graph
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
pgpool-II:0       (Started node-2)
Jun 25 09:35:16 node-1 crmd: [3005]: WARN: update_failcount: Updating failcount 
for frs:0 on node-2 after failed monitor: rc=9 (update=value++, time=1277472916)
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
pgpool-II:1       (Started node-1)
Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke: Query 701: Requesting 
the current CIB: S_POLICY_ENGINE
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
MySQL:0   (Started node-2)
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
MySQL:1   (Started node-1)
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Recover resource 
frs:0    (Master node-2)
Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke_callback: Invoking the 
PE: query=701, ref=pe_calc-dc-1277472916-542, seq=3616, quorate=1
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
frs:1      (Slave node-1)
Jun 25 09:35:16 node-1 crmd: [3005]: info: abort_transition_graph: 
te_update_diff:146 - Triggered transition abort (complete=1, 
tag=transient_attributes, id=node-2, magic=NA, cib=0.468.104) : Transient 
attribute: update
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
Postfix:0 (Started node-2)
Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke: Query 702: Requesting 
the current CIB: S_POLICY_ENGINE
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
Postfix:1 (Started node-1)
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
PostgreSQL:0      (Started node-2)
Jun 25 09:35:16 node-1 pengine: [3004]: notice: LogActions: Leave resource 
PostgreSQL:1      (Started node-1)
Jun 25 09:35:16 node-1 crmd: [3005]: info: do_pe_invoke_callback: Invoking the 
PE: query=702, ref=pe_calc-dc-1277472916-543, seq=3616, quorate=1
Jun 25 09:35:17 node-1 crmd: [3005]: info: handle_response: pe_calc calculation 
pe_calc-dc-1277472916-541 is obsolete
Jun 25 09:35:17 node-1 pengine: [3004]: info: process_pe_message: Transition 
119: PEngine Input stored in: /var/lib/pengine/pe-input-633.bz2
Jun 25 09:35:17 node-1 pengine: [3004]: notice: unpack_config: On loss of CCM 
Quorum: Ignore
Jun 25 09:35:17 node-1 pengine: [3004]: info: unpack_config: Node scores: 'red' 
= -INFINITY, 'yellow' = 0, 'green' = 0
Jun 25 09:35:17 node-1 pengine: [3004]: info: determine_online_status: Node 
node-2 is online
Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing failed 
op frs:0_monitor_3000 on node-2: master (failed) (9)
Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing failed 
op frs:0_monitor_7000 on node-2: master (failed) (9)
Jun 25 09:35:17 node-1 pengine: [3004]: info: determine_online_status: Node 
node-1 is online
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
Apache:0 on node-1 to Apache:1
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
pgpool-II:0 on node-1 to pgpool-II:1
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
Postfix:0 on node-1 to Postfix:1
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
MySQL:0 on node-1 to MySQL:1
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
frs:0 on node-1 to frs:1
Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing failed 
op frs:1_monitor_3000 on node-1: unknown error (1)
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
PostgreSQL:0 on node-1 to PostgreSQL:2 (ORPHAN)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
Apache-Clone
Jun 25 09:35:17 node-1 pengine: [3004]: notice: short_print:      Started: [ 
node-2 node-1 ]
Jun 25 09:35:17 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
pgpool-II-Clone
Jun 25 09:35:17 node-1 pengine: [3004]: notice: short_print:      Started: [ 
node-2 node-1 ]
Jun 25 09:35:17 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
MySQL-Clone
Jun 25 09:35:17 node-1 pengine: [3004]: notice: short_print:      Started: [ 
node-2 node-1 ]
Jun 25 09:35:17 node-1 pengine: [3004]: notice: clone_print:  Master/Slave Set: 
frs-MS
Jun 25 09:35:17 node-1 pengine: [3004]: notice: native_print:      frs:0      
(ocf::broadvox:frs):     Master node-2 FAILED
Jun 25 09:35:17 node-1 pengine: [3004]: notice: short_print:      Slaves: [ 
node-1 ]
Jun 25 09:35:17 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
Postfix-Clone
Jun 25 09:35:17 node-1 pengine: [3004]: notice: short_print:      Started: [ 
node-2 node-1 ]
Jun 25 09:35:17 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
PostgreSQL-Clone
Jun 25 09:35:17 node-1 pengine: [3004]: notice: short_print:      Started: [ 
node-2 node-1 ]
Jun 25 09:35:17 node-1 pengine: [3004]: info: get_failcount: frs-MS has failed 
8 times on node-2
Jun 25 09:35:17 node-1 pengine: [3004]: notice: common_apply_stickiness: frs-MS 
can fail 999992 more times on node-2 before being forced off
Jun 25 09:35:17 node-1 pengine: [3004]: info: get_failcount: frs-MS has failed 
8 times on node-1
Jun 25 09:35:17 node-1 pengine: [3004]: notice: common_apply_stickiness: frs-MS 
can fail 999992 more times on node-1 before being forced off
Jun 25 09:35:17 node-1 pengine: [3004]: info: master_color: Promoting frs:0 
(Master node-2)
Jun 25 09:35:17 node-1 pengine: [3004]: info: master_color: frs-MS: Promoted 1 
instances of a possible 1 to master
Jun 25 09:35:17 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
monitor (3s) for frs:0 on node-2
Jun 25 09:35:17 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
monitor (7s) for frs:0 on node-2
Jun 25 09:35:17 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
monitor (3s) for frs:0 on node-2
Jun 25 09:35:17 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
monitor (7s) for frs:0 on node-2
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
Apache:0  (Started node-2)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
Apache:1  (Started node-1)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
pgpool-II:0       (Started node-2)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
pgpool-II:1       (Started node-1)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
MySQL:0   (Started node-2)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
MySQL:1   (Started node-1)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Recover resource 
frs:0    (Master node-2)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
frs:1      (Slave node-1)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
Postfix:0 (Started node-2)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
Postfix:1 (Started node-1)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
PostgreSQL:0      (Started node-2)
Jun 25 09:35:17 node-1 pengine: [3004]: notice: LogActions: Leave resource 
PostgreSQL:1      (Started node-1)
Jun 25 09:35:17 node-1 crmd: [3005]: info: handle_response: pe_calc calculation 
pe_calc-dc-1277472916-542 is obsolete
Jun 25 09:35:17 node-1 pengine: [3004]: info: process_pe_message: Transition 
120: PEngine Input stored in: /var/lib/pengine/pe-input-634.bz2
Jun 25 09:35:17 node-1 pengine: [3004]: notice: unpack_config: On loss of CCM 
Quorum: Ignore
Jun 25 09:35:17 node-1 pengine: [3004]: info: unpack_config: Node scores: 'red' 
= -INFINITY, 'yellow' = 0, 'green' = 0
Jun 25 09:35:17 node-1 pengine: [3004]: info: determine_online_status: Node 
node-2 is online
Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing failed 
op frs:0_monitor_3000 on node-2: master (failed) (9)
Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing failed 
op frs:0_monitor_7000 on node-2: master (failed) (9)
Jun 25 09:35:17 node-1 pengine: [3004]: info: determine_online_status: Node 
node-1 is online
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
Apache:0 on node-1 to Apache:1
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
pgpool-II:0 on node-1 to pgpool-II:1
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
Postfix:0 on node-1 to Postfix:1
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
MySQL:0 on node-1 to MySQL:1
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
frs:0 on node-1 to frs:1
Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing failed 
op frs:1_monitor_3000 on node-1: unknown error (1)
Jun 25 09:35:17 node-1 pengine: [3004]: info: find_clone: Internally renamed 
PostgreSQL:0 on node-1 to PostgreSQL:2 (ORPHAN)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
Apache-Clone
Jun 25 09:35:18 node-1 pengine: [3004]: notice: short_print:      Started: [ 
node-2 node-1 ]
Jun 25 09:35:18 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
pgpool-II-Clone
Jun 25 09:35:18 node-1 pengine: [3004]: notice: short_print:      Started: [ 
node-2 node-1 ]
Jun 25 09:35:18 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
MySQL-Clone
Jun 25 09:35:18 node-1 pengine: [3004]: notice: short_print:      Started: [ 
node-2 node-1 ]
Jun 25 09:35:18 node-1 pengine: [3004]: notice: clone_print:  Master/Slave Set: 
frs-MS
Jun 25 09:35:18 node-1 pengine: [3004]: notice: native_print:      frs:0      
(ocf::broadvox:frs):     Master node-2 FAILED
Jun 25 09:35:18 node-1 pengine: [3004]: notice: short_print:      Slaves: [ 
node-1 ]
Jun 25 09:35:18 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
Postfix-Clone
Jun 25 09:35:18 node-1 pengine: [3004]: notice: short_print:      Started: [ 
node-2 node-1 ]
Jun 25 09:35:18 node-1 pengine: [3004]: notice: clone_print:  Clone Set: 
PostgreSQL-Clone
Jun 25 09:35:18 node-1 pengine: [3004]: notice: short_print:      Started: [ 
node-2 node-1 ]
Jun 25 09:35:18 node-1 pengine: [3004]: info: get_failcount: frs-MS has failed 
9 times on node-2
Jun 25 09:35:18 node-1 pengine: [3004]: notice: common_apply_stickiness: frs-MS 
can fail 999991 more times on node-2 before being forced off
Jun 25 09:35:18 node-1 pengine: [3004]: info: get_failcount: frs-MS has failed 
8 times on node-1
Jun 25 09:35:18 node-1 pengine: [3004]: notice: common_apply_stickiness: frs-MS 
can fail 999992 more times on node-1 before being forced off
Jun 25 09:35:18 node-1 pengine: [3004]: info: master_color: Promoting frs:0 
(Master node-2)
Jun 25 09:35:18 node-1 pengine: [3004]: info: master_color: frs-MS: Promoted 1 
instances of a possible 1 to master
Jun 25 09:35:18 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
monitor (3s) for frs:0 on node-2
Jun 25 09:35:18 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
monitor (7s) for frs:0 on node-2
Jun 25 09:35:18 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
monitor (3s) for frs:0 on node-2
Jun 25 09:35:18 node-1 pengine: [3004]: notice: RecurringOp:  Start recurring 
monitor (7s) for frs:0 on node-2
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
Apache:0  (Started node-2)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
Apache:1  (Started node-1)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
pgpool-II:0       (Started node-2)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
pgpool-II:1       (Started node-1)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
MySQL:0   (Started node-2)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
MySQL:1   (Started node-1)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Recover resource 
frs:0    (Master node-2)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
frs:1      (Slave node-1)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
Postfix:0 (Started node-2)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
Postfix:1 (Started node-1)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
PostgreSQL:0      (Started node-2)
Jun 25 09:35:18 node-1 pengine: [3004]: notice: LogActions: Leave resource 
PostgreSQL:1      (Started node-1)
Jun 25 09:35:18 node-1 crmd: [3005]: info: do_state_transition: State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Jun 25 09:35:18 node-1 crmd: [3005]: info: unpack_graph: Unpacked transition 
121: 45 actions in 45 synapses
Jun 25 09:35:18 node-1 crmd: [3005]: info: do_te_invoke: Processing graph 121 
(ref=pe_calc-dc-1277472916-543) derived from /var/lib/pengine/pe-input-635.bz2
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 70 
fired and confirmed
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 
108: notify frs:0_pre_notify_demote_0 on node-2
Jun 25 09:35:18 node-1 pengine: [3004]: info: process_pe_message: Transition 
121: PEngine Input stored in: /var/lib/pengine/pe-input-635.bz2
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 
110: notify frs:1_pre_notify_demote_0 on node-1 (local)
Jun 25 09:35:18 node-1 crmd: [3005]: info: do_lrm_rsc_op: Performing 
key=110:121:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:1_notify_0 )
Jun 25 09:35:18 node-1 lrmd: [3002]: info: rsc:frs:1:313: notify
Jun 25 09:35:18 node-1 crmd: [3005]: info: match_graph_event: Action 
frs:0_pre_notify_demote_0 (108) confirmed on node-2 (rc=0)
Jun 25 09:35:18 node-1 crmd: [3005]: info: process_lrm_event: LRM operation 
frs:1_notify_0 (call=313, rc=0, cib-update=703, confirmed=true) ok
Jun 25 09:35:18 node-1 crmd: [3005]: info: match_graph_event: Action 
frs:1_pre_notify_demote_0 (110) confirmed on node-1 (rc=0)
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 71 
fired and confirmed
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 68 
fired and confirmed
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 
44: demote frs:0_demote_0 on node-2
Jun 25 09:35:18 node-1 crmd: [3005]: info: match_graph_event: Action 
frs:0_demote_0 (44) confirmed on node-2 (rc=0)
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 69 
fired and confirmed
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 72 
fired and confirmed
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 
109: notify frs:0_post_notify_demote_0 on node-2
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 
111: notify frs:1_post_notify_demote_0 on node-1 (local)
Jun 25 09:35:18 node-1 crmd: [3005]: info: do_lrm_rsc_op: Performing 
key=111:121:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:1_notify_0 )
Jun 25 09:35:18 node-1 lrmd: [3002]: info: rsc:frs:1:314: notify
Jun 25 09:35:18 node-1 crmd: [3005]: info: process_lrm_event: LRM operation 
frs:1_notify_0 (call=314, rc=0, cib-update=704, confirmed=true) ok
Jun 25 09:35:18 node-1 crmd: [3005]: info: match_graph_event: Action 
frs:1_post_notify_demote_0 (111) confirmed on node-1 (rc=0)
Jun 25 09:35:18 node-1 crmd: [3005]: info: match_graph_event: Action 
frs:0_post_notify_demote_0 (109) confirmed on node-2 (rc=0)
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 73 
fired and confirmed
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 58 
fired and confirmed
Jun 25 09:35:18 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 
101: notify frs:0_pre_notify_stop_0 on node-2
Jun 25 09:35:19 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 
102: notify frs:1_pre_notify_stop_0 on node-1 (local)
Jun 25 09:35:19 node-1 crmd: [3005]: info: do_lrm_rsc_op: Performing 
key=102:121:0:621ad9c9-5546-4e6d-9d7b-fb09e9034320 op=frs:1_notify_0 )
Jun 25 09:35:19 node-1 lrmd: [3002]: info: rsc:frs:1:315: notify
Jun 25 09:35:19 node-1 crmd: [3005]: info: match_graph_event: Action 
frs:0_pre_notify_stop_0 (101) confirmed on node-2 (rc=0)
Jun 25 09:35:19 node-1 crmd: [3005]: info: process_lrm_event: LRM operation 
frs:1_notify_0 (call=315, rc=0, cib-update=705, confirmed=true) ok
Jun 25 09:35:19 node-1 crmd: [3005]: info: match_graph_event: Action 
frs:1_pre_notify_stop_0 (102) confirmed on node-1 (rc=0)
Jun 25 09:35:19 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 59 
fired and confirmed
Jun 25 09:35:19 node-1 crmd: [3005]: info: te_pseudo_action: Pseudo action 56 
fired and confirmed
Jun 25 09:35:19 node-1 crmd: [3005]: info: te_rsc_command: Initiating action 7: 
stop frs:0_stop_0 on node-2
Jun 25 09:35:19 node-1 crmd: [3005]: info: abort_transition_graph: 
te_update_diff:157 - Triggered transition abort (complete=0, 
tag=transient_attributes, id=node-2, magic=NA, cib=0.468.112) : Transient 
attribute: removal
Jun 25 09:35:19 node-1 crmd: [3005]: info: update_abort_priority: Abort 
priority upgraded from 0 to 1000000
Jun 25 09:35:19 node-1 crmd: [3005]: info: update_abort_priority: Abort action 
done superceeded by restart
Jun 25 09:35:21 node-1 frs[3339]: INFO: Light monitoring of the frs system 
completed: RUNNING_SLAVE

I did notice this line:

Jun 25 09:35:17 node-1 pengine: [3004]: WARN: unpack_rsc_op: Processing failed 
op frs:1_monitor_3000 on node-1: unknown error (1)

Which makes me think maybe this is related to this failed operator from 
yesterday. However, I have stopped and started the resource several times on 
node-1 since this failed op occurred. Do I need to clear these things (cleanup 
the resource) each time I start the resource?

Eliot Gable
Senior Product Developer
1228 Euclid Ave, Suite 390
Cleveland, OH 44115

Direct: 216-373-4808
Fax: 216-373-4657
ega...@broadvox.net

CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it are 
confidential and are intended solely for the use of the individual or entity to 
whom it is addressed. If you are not the intended recipient, please call me 
immediately.  BROADVOX is a registered trademark of Broadvox, LLC.

-----Original Message-----
From: Eliot Gable [mailto:ega...@broadvox.com]
Sent: Friday, June 25, 2010 1:08 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Master/Slave not failing over

Ok; I'm still not having any luck with this....

In my START action, right before I return $OCF_SUCCESS, I do:

$CRM_MASTER -v 100

Where CRM_MASTER is defined as in the drbd resource with '-l reboot'. In my 
STOP action, right at the beginning of it, I do:

$CRM_MASTER -D

I copied the new RA to both nodes, stopped the resource on both nodes, started 
it, promoted it, then caused the monitoring action to fail. I see Pacemaker 
generating the usual string of actions, including a STOP, so it should be 
deleting the master preference. However, no failover occurs. The slave still 
just sits there as a slave. Is there something else I am missing?

Thanks again.

Eliot Gable
Senior Product Developer
1228 Euclid Ave, Suite 390
Cleveland, OH 44115

Direct: 216-373-4808
Fax: 216-373-4657
ega...@broadvox.net

CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it are 
confidential and are intended solely for the use of the individual or entity to 
whom it is addressed. If you are not the intended recipient, please call me 
immediately.  BROADVOX is a registered trademark of Broadvox, LLC.

-----Original Message-----
From: Eliot Gable [mailto:ega...@broadvox.com]
Sent: Friday, June 25, 2010 12:27 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Master/Slave not failing over

After looking at the drbd master/slave RA, I think it is now clear. It looks 
like crm_master, being a wrapper for crm_attribute, actually specifies 
everything I need, and all I need to add to the command line are the few 
additional options like lifetime of the attribute modification, value to set it 
to, or whether to delete the attribute.

So, if I delete the attribute when a STOP is issued and keep the attribute's 
lifetime set to "reboot", it should be sufficient to cause a failover, correct?

Also, I am thinking that in my START action, after I have performed enough 
monitoring on it to ensure that everything came up correctly, I should at that 
point issue crm_master again with -v option to set a score for the node so it 
is a good candidate to become master, correct?

Eliot Gable
Senior Product Developer
1228 Euclid Ave, Suite 390
Cleveland, OH 44115

Direct: 216-373-4808
Fax: 216-373-4657
ega...@broadvox.net

CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it are 
confidential and are intended solely for the use of the individual or entity to 
whom it is addressed. If you are not the intended recipient, please call me 
immediately.  BROADVOX is a registered trademark of Broadvox, LLC.

-----Original Message-----
From: Eliot Gable [mailto:ega...@broadvox.com]
Sent: Friday, June 25, 2010 12:17 PM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Master/Slave not failing over

Thanks. Should I update my RA to use crm_master when it detects the resource in 
FAILED_MASTER state, or should I put it in the demote action or something else?

What's the command line needed to "reduce the promotion score"? I looked at the 
Pacemaker_Explained.pdf document, and while it mentions using crm_master to 
provide a promotion score, it does not tell me what actual attribute it is that 
needs to be modified. Is there another command that can print out all available 
attributes, or a document somewhere that lists them?

Eliot Gable
Senior Product Developer
1228 Euclid Ave, Suite 390
Cleveland, OH 44115

Direct: 216-373-4808
Fax: 216-373-4657
ega...@broadvox.net

CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it are 
confidential and are intended solely for the use of the individual or entity to 
whom it is addressed. If you are not the intended recipient, please call me 
immediately.  BROADVOX is a registered trademark of Broadvox, LLC.

-----Original Message-----
From: Andrew Beekhof [mailto:and...@beekhof.net]
Sent: Friday, June 25, 2010 8:26 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Master/Slave not failing over

On Fri, Jun 25, 2010 at 12:43 AM, Eliot Gable <ega...@broadvox.com> wrote:
> Thanks for pointing that out.
>
> I am still having issues with the master/slave resource. When I cause one of 
> the monitoring actions to fail,

as well as failing it should also use crm_master to reduce the promotion score

> the master node gets a DEMOTE, STOP, START, PROMOTE and the slave resource 
> just sits there. I want to see DEMOTE on the failed master, then PROMOTE on 
> the slave, then STOP on the failed master, followed by START on the failed 
> master.

The stop will always happen before the promote. Regardless of which
instance is being promoted.

> How can I achieve this? Is there some sort of constraint or something I can 
> put in place to make it happen?
>
> Thanks again for any insights.
>
>
>
> Eliot Gable
> Senior Product Developer
> 1228 Euclid Ave, Suite 390
> Cleveland, OH 44115
>
> Direct: 216-373-4808
> Fax: 216-373-4657
> ega...@broadvox.net
>
>
> CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it 
> are confidential and are intended solely for the use of the individual or 
> entity to whom it is addressed. If you are not the intended recipient, please 
> call me immediately.  BROADVOX is a registered trademark of Broadvox, LLC.
>
>
> -----Original Message-----
> From: Dejan Muhamedagic [mailto:deja...@fastmail.fm]
> Sent: Thursday, June 24, 2010 12:37 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Master/Slave not failing over
>
> Hi,
>
> On Thu, Jun 24, 2010 at 12:12:34PM -0400, Eliot Gable wrote:
>> On another note, I cannot seem to get Pacemaker to monitor the master node. 
>> It monitors the slave node just fine. These are the operations I have 
>> defined:
>>
>>         op monitor interval="5" timeout="30s" \
>>         op monitor interval="10" timeout="30s" OCF_CHECK_LEVEL="10" \
>>         op monitor interval="5" role="Master" timeout="30s" \
>>         op monitor interval="10" role="Master" timeout="30s" 
>> OCF_CHECK_LEVEL="10" \
>>         op start interval="0" timeout="40s" \
>>         op stop interval="0" timeout="20s"
>>
>> Did I do something wrong?
>
> Yes, all monitor intervals have to be different. I don't know
> what happened without looking at the logs, but you should set sth
> like this:
>
>         op monitor interval="6" role="Master" timeout="30s" \
>         op monitor interval="11" role="Master" timeout="30s" 
> OCF_CHECK_LEVEL="10" \
>
> Thanks,
>
> Dejan
>
>> Eliot Gable
>> Senior Product Developer
>> 1228 Euclid Ave, Suite 390
>> Cleveland, OH 44115
>>
>> Direct: 216-373-4808
>> Fax: 216-373-4657
>> ega...@broadvox.net<mailto:ega...@broadvox.net>
>>
>> [cid:image001.gif@01CB1396.87214DC0]
>> CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it 
>> are confidential and are intended solely for the use of the individual or 
>> entity to whom it is addressed. If you are not the intended recipient, 
>> please call me immediately.  BROADVOX is a registered trademark of Broadvox, 
>> LLC.
>>
>> From: Eliot Gable [mailto:ega...@broadvox.com]
>> Sent: Thursday, June 24, 2010 11:55 AM
>> To: The Pacemaker cluster resource manager
>> Subject: [Pacemaker] Master/Slave not failing over
>>
>> I am using the latest CentOS 5.5 packages for pacemaker/corosync. I have a 
>> master/slave resource up and running, and when I make the master fail, 
>> instead of immediately promoting the slave, it restarts the failed master 
>> and re-promotes it back to master. This takes longer than if it would just 
>> immediately promote the slave. I can understand it waiting for a DEMOTE 
>> action to succeed on the failed master before it promotes the slave, but 
>> that is all it should need to do it. Is there any way I can change this 
>> behavior? Am I missing some key point in the process?
>>
>>
>> Eliot Gable
>> Senior Product Developer
>> 1228 Euclid Ave, Suite 390
>> Cleveland, OH 44115
>>
>> Direct: 216-373-4808
>> Fax: 216-373-4657
>> ega...@broadvox.net<mailto:ega...@broadvox.net>
>>
>> [cid:image001.gif@01CB1396.87214DC0]
>> CONFIDENTIAL COMMUNICATION.  This e-mail and any files transmitted with it 
>> are confidential and are intended solely for the use of the individual or 
>> entity to whom it is addressed. If you are not the intended recipient, 
>> please call me immediately.  BROADVOX is a registered trademark of Broadvox, 
>> LLC.
>>
>>
>> ________________________________
>> CONFIDENTIAL. This e-mail and any attached files are confidential and should 
>> be destroyed and/or returned if you are not the intended and proper 
>> recipient.
>>
>> ________________________________
>> CONFIDENTIAL. This e-mail and any attached files are confidential and should 
>> be destroyed and/or returned if you are not the intended and proper 
>> recipient.
>
>
>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: 
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
> CONFIDENTIAL.  This e-mail and any attached files are confidential and should 
> be destroyed and/or returned if you are not the intended and proper recipient.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

CONFIDENTIAL.  This e-mail and any attached files are confidential and should 
be destroyed and/or returned if you are not the intended and proper recipient.

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

CONFIDENTIAL.  This e-mail and any attached files are confidential and should 
be destroyed and/or returned if you are not the intended and proper recipient.

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

CONFIDENTIAL.  This e-mail and any attached files are confidential and should 
be destroyed and/or returned if you are not the intended and proper recipient.

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

CONFIDENTIAL.  This e-mail and any attached files are confidential and should 
be destroyed and/or returned if you are not the intended and proper recipient.

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

CONFIDENTIAL.  This e-mail and any attached files are confidential and should 
be destroyed and/or returned if you are not the intended and proper recipient.

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

CONFIDENTIAL.  This e-mail and any attached files are confidential and should 
be destroyed and/or returned if you are not the intended and proper recipient.

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] Master/Slave not failing over

Reply via email to