Hi , I have a cman-based cluster that uses pcmk-fencing. we have configured an ipmilan fencing device and an apc fencing device with stonith.
I set a fencing order like this:
<fencing-topology> \
<fencing-level devices="ipmi_gw2" id="fencing-gw2-2" index="2" target="gw2"/> \ <fencing-level devices="apc_power_gw2" id="fencing-gw2-1" index="1" target="gw2"/> \
</fencing-topology>

This all works as intended, i.e. the apc is used as first device and shuts down the server. But when I try to simulate a failure of the APC device by setting an alternate IP (that is not reachable), fencing takes a very long time before it succeeds. stonith-timeout for cluster is 20 secs. So I would expect that after 20 secs it moves on to the second device. However what I see is this:

Oct 31 16:09:59 corosync [TOTEM ] A processor failed, forming new configuration. # machine powered off manually Oct 31 16:10:00 [5875] gw1 stonith-ng: info: stonith_action_create: Initiating action list for agent fence_apc (target=(null)) Oct 31 16:10:00 [5875] gw1 stonith-ng: info: internal_stonith_action_execute: Attempt 2 to execute fence_apc (list). remaining timeout is 120 Oct 31 16:10:01 [5875] gw1 stonith-ng: info: update_remaining_timeout: Attempted to execute agent fence_apc (list) the maximum number of times (2) allowed
(...)
Oct 31 16:10:06 [5879] gw1 crmd: notice: te_fence_node: Executing reboot fencing operation (151) on gw2 (timeout=20000) Oct 31 16:10:06 [5875] gw1 stonith-ng: notice: handle_request: Client crmd.5879.4fe77863 wants to fence (reboot) 'gw2' with device '(any)' Oct 31 16:10:06 [5875] gw1 stonith-ng: notice: merge_duplicates: Merging stonith action reboot for node gw2 originating from client crmd.5879.aa3600b2 with identical request from stonith_admin.27011@gw1.ad19c5b3 (24s) Oct 31 16:10:06 [5875] gw1 stonith-ng: info: initiate_remote_stonith_op: Initiating remote operation reboot for gw2: aa3600b2-7b6d-4243-b525-de0a0a7399a8 (duplicate) Oct 31 16:10:06 [5875] gw1 stonith-ng: info: stonith_command: Processed st_fence from crmd.5879: Operation now in progress (-115) *Oct 31 16:11:30 [5879] gw1 crmd: error: stonith_async_timeout_handler: Async call 4 timed out after 84000ms* Oct 31 16:11:30 [5879] gw1 crmd: notice: tengine_stonith_callback: Stonith operation 4/151:40:0:304a2845-8177-4018-9fb6-7b94d0d1288a: Timer expired (-62) Oct 31 16:11:30 [5879] gw1 crmd: notice: tengine_stonith_callback: Stonith operation 4 for gw2 failed (Timer expired): aborting transition. Oct 31 16:11:30 [5879] gw1 crmd: info: abort_transition_graph: tengine_stonith_callback:447 - Triggered transition abort (complete=0) : Stonith failed Oct 31 16:11:30 [5879] gw1 crmd: notice: te_fence_node: Executing reboot fencing operation (151) on gw2 (timeout=20000) Oct 31 16:11:30 [5875] gw1 stonith-ng: notice: handle_request: Client crmd.5879.4fe77863 wants to fence (reboot) 'gw2' with device '(any)' Oct 31 16:11:30 [5875] gw1 stonith-ng: notice: merge_duplicates: Merging stonith action reboot for node gw2 originating from client crmd.5879.429b7850 with identical request from stonith_admin.27011@gw1.ad19c5b3 (24s) Oct 31 16:11:30 [5875] gw1 stonith-ng: info: initiate_remote_stonith_op: Initiating remote operation reboot for gw2: 429b7850-9d23-49f0-abe4-1f18eb8d122a (duplicate) Oct 31 16:11:30 [5875] gw1 stonith-ng: info: stonith_command: Processed st_fence from crmd.5879: Operation now in progress (-115) Oct 31 16:12:24 [5875] gw1 stonith-ng: info: call_remote_stonith: Total remote op timeout set to 240 for fencing of node gw2 for stonith_admin.27011.ad19c5b3 *Oct 31 16:12:24 [5875] gw1 stonith-ng: info: call_remote_stonith: Requesting that gw1 perform op reboot gw2 with ipmi_gw2 for stonith_admin.27011 (144s)* Oct 31 16:12:24 [5875] gw1 stonith-ng: info: stonith_command: Processed st_fence from gw1: Operation now in progress (-115) Oct 31 16:12:24 [5875] gw1 stonith-ng: info: stonith_action_create: Initiating action reboot for agent fence_ipmilan (target=gw2) *Oct 31 16:12:26 [5875] gw1 stonith-ng: notice: log_operation: Operation 'reboot' [27473] (call 0 from stonith_admin.27011) for host 'gw2' with device 'ipmi_gw2' returned: 0 (OK)*

So it does not take 25 seconds to reboot gw2 withthe fallback stonith device but more than two minutes, although we even see the 20 seconds timeout values. What am I doing wrong?

Here is the device config :

primitive apc_power_gw2 stonith:fence_apc \
params ipaddr="192.168.33.64" pcmk_host_list="gw2" pcmk_host_check="static-list" pcmk_host_argument="none" passwd="***" login="***" port="1" action="reboot"
primitive ipmi_gw1 stonith:fence_ipmilan \
params ipaddr="192.168.33.4" pcmk_host_list="gw1" pcmk_host_check="static-list" pcmk_host_argument="none" passwd="***" login="***" lanplus="1" privlvl="operator" power_wait="2" timeout="20" stonith-timeout="15s"
primitive ipmi_gw2 stonith:fence_ipmilan \
params ipaddr="192.168.33.5" pcmk_host_list="gw2" pcmk_host_check="static-list" pcmk_host_argument="none" passwd="***" login="***" lanplus="1" privlvl="operator" power_wait="2" timeout="15" stonith-timeout="10s"

property $id="cib-bootstrap-options" \
        cluster-infrastructure="cman" \
        no-quorum-policy="ignore" \
        stonith-enabled="true" \
        stonith-timeout="20s" \


Regards,
Jakbo Curdes
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to