Re: [Linux-HA] transition graphs during fail-over process

Alan Robertson Thu, 19 Apr 2007 22:02:44 -0700

Junko IKEDA wrote:
> Hi,
> 
> This is not a serious problem but I just take notice of this, so please let
> me know whether this is a common behavior for Heartbeat or not, if you know
> anything about it.
> 
> There are two nodes, a virtual IP (IPaddr) is running on one of them.
> If the IPaddr is taken away, fail-over process is sure to succeed.
> What I notice is;
> Heartbeat starts IPaddr on the node which it has already been dead first (it
> would fail), and next, do it on the stand-by node.
> Why does Heartbeat try to (re)start the resource on the failed node again?
> Is it a necessary operation for some other process, something like, updates
> its failcount?
> I can see it with both of Heartbeat 2.0.8-1 and 2.0.9-1.


It tries it until its resource_failure_stickiness is exceeded.

By adjusting that, you can control how many times it tries before moving
it to another machine.

The rationale for this is that moving a a resource and all the things
that have to move with it is typically slower than restarting the
resource in place (and all the things that depend on it).

> pengine[10031]: 2007/04/19_23:34:12 info: determine_online_status: Node
> guest1 is online
> pengine[10031]: 2007/04/19_23:34:13 WARN: unpack_rsc_op: Processing failed
> op (vip_monitor_10000) on guest1
> pengine[10031]: 2007/04/19_23:34:13 info: determine_online_status: Node
> guest2 is online
> pengine[10031]: 2007/04/19_23:34:13 info: native_print: vip
> (heartbeat::ocf:IPaddr):      Started guest1 FAILED 
> ^^^^^^ guest1 is the node which VIP has stopped. ^^^^^
> cib[10018]: 2007/04/19_23:34:13 info: cib_diff_notify: Update (client:
> 10030, call:3): 0.3.13 -> 0.3.14 (ok)
> crmd[10022]: 2007/04/19_23:34:13 info: do_state_transition: guest2: State
> transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_IPC_MESSAGE origin=route_message ]
> crmd[10022]: 2007/04/19_23:34:13 info: do_state_transition: All 2 cluster
> nodes are eligible to run resources.
> tengine[10030]: 2007/04/19_23:34:13 info: unpack_graph: Unpacked transition
> 3: 3 actions in 3 synapses
> pengine[10031]: 2007/04/19_23:34:13 notice: NoRoleChange: Recover resource
> vip   (guest1)
> pengine[10031]: 2007/04/19_23:34:13 notice: StopRsc:   guest1 Stop vip
> tengine[10030]: 2007/04/19_23:34:13 info: send_rsc_command: Initiating
> action 2: vip_stop_0 on guest1
> pengine[10031]: 2007/04/19_23:34:13 notice: StartRsc:  guest1 Start vip
> pengine[10031]: 2007/04/19_23:34:13 notice: RecurringOp: guest1
> vip_monitor_10000
> pengine[10031]: 2007/04/19_23:34:13 info: process_pe_message: Transition 3:
> PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-3.bz2
> pengine[10031]: 2007/04/19_23:34:13 info: log_data_element:
> process_pe_message: [generation] <cib admin_epoch="0" epoch="3"
> num_updates="14" have_quorum="true" cib_feature_revision="1.3"
> generated="true" ignore_dtd="false" num_peers="2" ccm_transition="2"
> dc_uuid="762a84bc-5633-4dcc-97ab-db3986cc778f"/>
> tengine[10030]: 2007/04/19_23:34:13 info: process_te_message: Another
> transition is already active
> tengine[10030]: 2007/04/19_23:34:13 info: update_abort_priority: Abort
> priority upgraded to 1000000
> crmd[10022]: 2007/04/19_23:34:13 info: do_state_transition: guest2: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=route_message ]
> tengine[10030]: 2007/04/19_23:34:13 info: update_abort_priority: Abort
> action 0 superceeded by 2
> pengine[10031]: 2007/04/19_23:34:13 notice: unpack_config: On loss of CCM
> Quorum: Ignore
> pengine[10031]: 2007/04/19_23:34:13 info: determine_online_status: Node
> guest1 is online
> pengine[10031]: 2007/04/19_23:34:13 WARN: unpack_rsc_op: Processing failed
> op (vip_monitor_10000) on guest1
> pengine[10031]: 2007/04/19_23:34:13 info: determine_online_status: Node
> guest2 is online
> pengine[10031]: 2007/04/19_23:34:13 info: native_print: vip
> (heartbeat::ocf:IPaddr):      Started guest1 FAILED
> pengine[10031]: 2007/04/19_23:34:13 notice: NoRoleChange: Recover resource
> vip   (guest2)
> pengine[10031]: 2007/04/19_23:34:13 notice: StopRsc:   guest1 Stop vip
> pengine[10031]: 2007/04/19_23:34:13 notice: StartRsc:  guest2 Start vip
> pengine[10031]: 2007/04/19_23:34:13 notice: RecurringOp: guest2
> vip_monitor_10000
> pengine[10031]: 2007/04/19_23:34:13 info: process_pe_message: Transition 4:
> PEngine Input stored in: /var/lib/heartbeat/pengine/pe-input-4.bz2
> 
> Some "*.dot" files generated by transition graphs (pe-input3.bz2, pe-input
> 4.bz2) include the following information.
> (See attached file. Hostname; stand-by = guest1, DC = guest2)
> pe-input3.dot(gif); vip_stop_0(guest1) => vip_start_0(guest1) =>
> vip_monitor_10000(guest1) pe-input4.dot(gif); vip_stop_0(guest1) =>
> vip_start_0(guest2) => vip_monitor_10000(guest2)
> 
> I don't understand why "pe-input3.bz2" was needed.

To help us find bugs.  Trust me, that's all the motivation we need to
create a file ;-).

-- 
    Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] transition graphs during fail-over process

Reply via email to