Hi , I have a cman-based cluster that uses pcmk-fencing. we have
configured an ipmilan fencing device and an apc fencing device with
stonith.
I set a fencing order like this:
<fencing-topology> \
<fencing-level devices="ipmi_gw2" id="fencing-gw2-2" index="2"
target="gw2"/> \
<fencing-level devices="apc_power_gw2" id="fencing-gw2-1" index="1"
target="gw2"/> \
</fencing-topology>
This all works as intended, i.e. the apc is used as first device and
shuts down the server.
But when I try to simulate a failure of the APC device by setting an
alternate IP (that is not reachable), fencing takes a very long time
before it succeeds.
stonith-timeout for cluster is 20 secs. So I would expect that after 20
secs it moves on to the second device. However what I see is this:
Oct 31 16:09:59 corosync [TOTEM ] A processor failed, forming new
configuration. # machine powered off manually
Oct 31 16:10:00 [5875] gw1 stonith-ng: info:
stonith_action_create: Initiating action list for agent fence_apc
(target=(null))
Oct 31 16:10:00 [5875] gw1 stonith-ng: info:
internal_stonith_action_execute: Attempt 2 to execute fence_apc
(list). remaining timeout is 120
Oct 31 16:10:01 [5875] gw1 stonith-ng: info:
update_remaining_timeout: Attempted to execute agent fence_apc
(list) the maximum number of times (2) allowed
(...)
Oct 31 16:10:06 [5879] gw1 crmd: notice: te_fence_node:
Executing reboot fencing operation (151) on gw2 (timeout=20000)
Oct 31 16:10:06 [5875] gw1 stonith-ng: notice: handle_request:
Client crmd.5879.4fe77863 wants to fence (reboot) 'gw2' with device '(any)'
Oct 31 16:10:06 [5875] gw1 stonith-ng: notice: merge_duplicates:
Merging stonith action reboot for node gw2 originating from client
crmd.5879.aa3600b2 with identical request from
stonith_admin.27011@gw1.ad19c5b3 (24s)
Oct 31 16:10:06 [5875] gw1 stonith-ng: info:
initiate_remote_stonith_op: Initiating remote operation reboot for
gw2: aa3600b2-7b6d-4243-b525-de0a0a7399a8 (duplicate)
Oct 31 16:10:06 [5875] gw1 stonith-ng: info: stonith_command:
Processed st_fence from crmd.5879: Operation now in progress (-115)
*Oct 31 16:11:30 [5879] gw1 crmd: error:
stonith_async_timeout_handler: Async call 4 timed out after 84000ms*
Oct 31 16:11:30 [5879] gw1 crmd: notice:
tengine_stonith_callback: Stonith operation
4/151:40:0:304a2845-8177-4018-9fb6-7b94d0d1288a: Timer expired (-62)
Oct 31 16:11:30 [5879] gw1 crmd: notice:
tengine_stonith_callback: Stonith operation 4 for gw2 failed (Timer
expired): aborting transition.
Oct 31 16:11:30 [5879] gw1 crmd: info:
abort_transition_graph: tengine_stonith_callback:447 - Triggered
transition abort (complete=0) : Stonith failed
Oct 31 16:11:30 [5879] gw1 crmd: notice: te_fence_node:
Executing reboot fencing operation (151) on gw2 (timeout=20000)
Oct 31 16:11:30 [5875] gw1 stonith-ng: notice: handle_request:
Client crmd.5879.4fe77863 wants to fence (reboot) 'gw2' with device '(any)'
Oct 31 16:11:30 [5875] gw1 stonith-ng: notice: merge_duplicates:
Merging stonith action reboot for node gw2 originating from client
crmd.5879.429b7850 with identical request from
stonith_admin.27011@gw1.ad19c5b3 (24s)
Oct 31 16:11:30 [5875] gw1 stonith-ng: info:
initiate_remote_stonith_op: Initiating remote operation reboot for
gw2: 429b7850-9d23-49f0-abe4-1f18eb8d122a (duplicate)
Oct 31 16:11:30 [5875] gw1 stonith-ng: info: stonith_command:
Processed st_fence from crmd.5879: Operation now in progress (-115)
Oct 31 16:12:24 [5875] gw1 stonith-ng: info: call_remote_stonith:
Total remote op timeout set to 240 for fencing of node gw2 for
stonith_admin.27011.ad19c5b3
*Oct 31 16:12:24 [5875] gw1 stonith-ng: info: call_remote_stonith:
Requesting that gw1 perform op reboot gw2 with ipmi_gw2 for
stonith_admin.27011 (144s)*
Oct 31 16:12:24 [5875] gw1 stonith-ng: info: stonith_command:
Processed st_fence from gw1: Operation now in progress (-115)
Oct 31 16:12:24 [5875] gw1 stonith-ng: info:
stonith_action_create: Initiating action reboot for agent
fence_ipmilan (target=gw2)
*Oct 31 16:12:26 [5875] gw1 stonith-ng: notice: log_operation:
Operation 'reboot' [27473] (call 0 from stonith_admin.27011) for host
'gw2' with device 'ipmi_gw2' returned: 0 (OK)*
So it does not take 25 seconds to reboot gw2 withthe fallback stonith
device but more than two minutes, although we even see the 20 seconds
timeout values. What am I doing wrong?
Here is the device config :
primitive apc_power_gw2 stonith:fence_apc \
params ipaddr="192.168.33.64" pcmk_host_list="gw2"
pcmk_host_check="static-list" pcmk_host_argument="none" passwd="***"
login="***" port="1" action="reboot"
primitive ipmi_gw1 stonith:fence_ipmilan \
params ipaddr="192.168.33.4" pcmk_host_list="gw1"
pcmk_host_check="static-list" pcmk_host_argument="none" passwd="***"
login="***" lanplus="1" privlvl="operator" power_wait="2" timeout="20"
stonith-timeout="15s"
primitive ipmi_gw2 stonith:fence_ipmilan \
params ipaddr="192.168.33.5" pcmk_host_list="gw2"
pcmk_host_check="static-list" pcmk_host_argument="none" passwd="***"
login="***" lanplus="1" privlvl="operator" power_wait="2" timeout="15"
stonith-timeout="10s"
property $id="cib-bootstrap-options" \
cluster-infrastructure="cman" \
no-quorum-policy="ignore" \
stonith-enabled="true" \
stonith-timeout="20s" \
Regards,
Jakbo Curdes
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems