[Pacemaker] some questions about STONITH

Andrey Groshev Tue, 19 Nov 2013 10:16:39 -0800

Hi everyone again.

I started training with STONITH.
I wrote a little STONITH external script.
Its basic moments:
* send the command "reboot" with SSH authentication using a key.
* The script takes a single argument - the path to the private key.
* Any node can send reboot any node (even yourself).


In the crm config it looks like this:
property $id="cib-bootstrap-options" \
        stonith-enabled="true"
primitive st1 stonith:external/sshbykey \
        params path2key="/opt/cluster_tools_2/keys/root@dev-cluster2-master" 
pcmk_host_check="none"
clone cloneStonith st1

Made the first test - Ok, node was rebooted and resource are started.
#export  
path2key=/opt/cluster_tools_2/keys/r...@dev-cluster2-master.unix.tensor.ru
# stonith -t external/sshbykey -E dev-cluster2-node1
info: external_run_cmd: '/usr/lib64/stonith/plugins/external/sshbykey reset 
dev-cluster2-node1' output: Now boot time 1384850888, send reboot

info: external_run_cmd: '/usr/lib64/stonith/plugins/external/sshbykey reset 
dev-cluster2-node1' output: Daration: 1340 sec.

info: external_run_cmd: '/usr/lib64/stonith/plugins/external/sshbykey reset 
dev-cluster2-node1' output: GOOD NEWS: dev-cluster2-node1 booted in 1384864288

Do not worry about attention to the "Duration", this because of the jump time 
before synchronization time in the virtual machine and the server. Here the 
meaning of a change, rather than a specific number of seconds. Next time reboot 
 10 - 20 sec.

But farther, there are problems and questions. :)
1. 
Make next test:
#stonith_admin --reboot=dev-cluster2-node2
Node reboot, but resource don't start.
In crm_mon status - Node dev-cluster2-node2 (172793105): pending.
And it will be hung.
Next, if I reboot this node in console, or stonith or stonith_admin (the same 
command!) - resources stats.

Portions of the logs:
   trace: unpack_status:        Processing node id=172793105, 
uname=dev-cluster2-node2
   trace: find_xml_node:        Could not find transient_attributes in 
node_state.
   trace: unpack_instance_attributes:   No instance attributes
   trace: unpack_status:        determining node state
   trace: determine_online_status_fencing:      dev-cluster2-node2: 
in_cluster=false, is_peer=online, join=down, expected=down, term=0
    info: determine_online_status_fencing:      - Node dev-cluster2-node2 is 
not ready to run resources
   trace: determine_online_status:      Node dev-cluster2-node2 is offline

   ........
   
   trace: unpack_status:        Processing lrm resource entries on healthy 
node: dev-cluster2-node2
   trace: find_xml_node:        Could not find lrm in node_state.
   trace: find_xml_node:        Could not find lrm_resources in <NULL>.
   trace: unpack_lrm_resources:         Unpacking resources on 
dev-cluster2-node2

   ..............
   trace: can_run_resources:    dev-cluster2-node2: online=0, unclean=0, 
standby=1, maintenance=0
   trace: check_actions:        Skipping param check for dev-cluster2-node2: 
cant run resources
.......
   trace: native_color:         Pre-allloc: VirtualIP allocation score on 
dev-cluster2-node2: 0
...........


      <node id="172793105" uname="dev-cluster2-node2">
        <instance_attributes id="nodes-172793105">
          <nvpair id="nodes-172793105-pgsql-data-status" 
name="pgsql-data-status" value="DISCONNECT"/>
          <nvpair id="nodes-172793105-standby" name="standby" value="false"/>
          <nvpair id="nodes-172793105-thisquorumnode" name="thisquorumnode" 
value="no"/>
        </instance_attributes>
      </node>

Why do that behavior?

2. 
There is a slight discrepancy in the Pacemaker Expl. and stonith_admin --help.
stonith_admin --reboot nodename. 
In one case, the sign of equality is, in other - no.
Not very important, because operate both.
But when you start to work and something goes wrong, do you think at all 
suspicious things. :)

3. 
Andrew! You promised post about STONITH debug.

4. (to ALL)
Also, please tell me the real arguments against the use of the SSH in STONITH.
I have my own guesses and thoughts, but I would like to know your experience.

My environment:
corosync-2.3.2
resource-agents-3.9.5
pacemaker 1.1.11
----
Thanks in advance,
Andrey Groshev

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] some questions about STONITH

Reply via email to