Hi, I'm trying to get familiar with STONITH Block Devices (SBD) on a 3-node CentOS7 built in VirtualBox. The complete setup is available at https://github.com/marcindulak/vagrant-sbd-tutorial-centos7.git so hopefully with some help I'll be able to make it work.
Question 1: The shared device /dev/sbd1 is the VirtualBox's "shareable hard disk" https://www.virtualbox.org/manual/ch05.html#hdimagewrites will SBD fencing work with that type of storage? I start the cluster using vagrant_1.8.1 and virtualbox-4.3 with: $ vagrant up # takes ~15 minutes The setup brings up the nodes, installs the necessary packages, and prepares for the configuration of the pcs cluster. You can see which scripts the nodes execute at the bottom of the Vagrantfile. While there is 'yum -y install sbd' on CentOS7 the fence_sbd agent has not been packaged yet. Therefore I rebuild Fedora 24 package using the latest https://github.com/ClusterLabs/fence-agents/archive/v4.0.22.tar.gz plus the update to the fence_sbd from https://github.com/ClusterLabs/fence-agents/pull/73 The configuration is inspired by https://www.novell.com/support/kb/doc.php?id=7009485 and https://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_storage_protect_fencing.html Question 2: After reading http://blog.clusterlabs.org/blog/2015/sbd-fun-and-profit I expect with just one stonith resource configured a node will be fenced when I stop pacemaker and corosync `pcs cluster stop node-1` or just `stonith_admin -F node-1`, but this is not the case. As can be seen below from uptime, the node-1 is not shutdown by `pcs cluster stop node-1` executed on itself. I found some discussions on users@clusterlabs.org about whether a node running SBD resource can fence itself, but the conclusion was not clear to me. Question 3: Neither node-1 is fenced by `stonith_admin -F node-1` executed on node-2, despite the fact /var/log/messages on node-2 (the one currently running MyStonith) reporting: ... notice: Operation 'off' [3309] (call 2 from stonith_admin.3288) for host 'node-1' with device 'MyStonith' returned: 0 (OK) ... What is happening here? Question 4 (for the future): Assuming the node-1 was fenced, what is the way of operating SBD? I see the sbd lists now: 0 node-3 clear 1 node-1 off node-2 2 node-2 clear How to clear the status of node-1? Question 5 (also for the future): While the relation 'stonith-timeout = Timeout (msgwait) + 20%' presented at https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_storage_protect_fencing.html is clearly described, I wonder about the relation of 'stonith-timeout' to other timeouts like the 'monitor interval=60s' reported by `pcs stonith show MyStonith`. Here is how I configure the cluster and test it. The run.sh script is attached. $ sh -x run01.sh 2>&1 | tee run01.txt with the result: $ cat run01.txt Each block below shows the executed ssh command and the result. ############################ ssh node-1 -c sudo su - -c 'pcs cluster auth -u hacluster -p password node-1 node-2 node-3' node-1: Authorized node-3: Authorized node-2: Authorized ############################ ssh node-1 -c sudo su - -c 'pcs cluster setup --name mycluster node-1 node-2 node-3' Shutting down pacemaker/corosync services... Redirecting to /bin/systemctl stop pacemaker.service Redirecting to /bin/systemctl stop corosync.service Killing any remaining services... Removing all cluster configuration files... node-1: Succeeded node-2: Succeeded node-3: Succeeded Synchronizing pcsd certificates on nodes node-1, node-2, node-3... node-1: Success node-3: Success node-2: Success Restaring pcsd on the nodes in order to reload the certificates... node-1: Success node-3: Success node-2: Success ############################ ssh node-1 -c sudo su - -c 'pcs cluster start --all' node-3: Starting Cluster... node-2: Starting Cluster... node-1: Starting Cluster... ############################ ssh node-1 -c sudo su - -c 'corosync-cfgtool -s' Printing ring status. Local node ID 1 RING ID 0 id = 192.168.10.11 status = ring 0 active with no faults ############################ ssh node-1 -c sudo su - -c 'pcs status corosync' Membership information ---------------------- Nodeid Votes Name 1 1 node-1 (local) 2 1 node-2 3 1 node-3 ############################ ssh node-1 -c sudo su - -c 'pcs status' Cluster name: mycluster WARNING: no stonith devices and stonith-enabled is not false Last updated: Sat Jun 25 15:40:51 2016 Last change: Sat Jun 25 15:40:33 2016 by hacluster via crmd on node-2 Stack: corosync Current DC: node-2 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum 3 nodes and 0 resources configured Online: [ node-1 node-2 node-3 ] Full list of resources: PCSD Status: node-1: Online node-2: Online node-3: Online Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ############################ ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 list' 0 node-3 clear 1 node-2 clear 2 node-1 clear ############################ ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 dump' ==Dumping header on disk /dev/sdb1 Header version : 2.1 UUID : 79f28167-a207-4f2a-a723-aa1c00bf1dee Number of slots : 255 Sector size : 512 Timeout (watchdog) : 10 Timeout (allocate) : 2 Timeout (loop) : 1 Timeout (msgwait) : 20 ==Header on disk /dev/sdb1 is dumped ############################ ssh node-1 -c sudo su - -c 'pcs stonith list' fence_sbd - Fence agent for sbd ############################ ssh node-1 -c sudo su - -c 'pcs stonith create MyStonith fence_sbd devices=/dev/sdb1 power_timeout=21 action=off' ssh node-1 -c sudo su - -c 'pcs property set stonith-enabled=true' ssh node-1 -c sudo su - -c 'pcs property set stonith-timeout=24s' ssh node-1 -c sudo su - -c 'pcs property' Cluster Properties: cluster-infrastructure: corosync cluster-name: mycluster dc-version: 1.1.13-10.el7_2.2-44eb2dd have-watchdog: true stonith-enabled: true stonith-timeout: 24s stonith-watchdog-timeout: 10s ############################ ssh node-1 -c sudo su - -c 'pcs stonith show MyStonith' Resource: MyStonith (class=stonith type=fence_sbd) Attributes: devices=/dev/sdb1 power_timeout=21 action=off Operations: monitor interval=60s (MyStonith-monitor-interval-60s) ############################ ssh node-1 -c sudo su - -c 'pcs cluster stop node-1 ' node-1: Stopping Cluster (pacemaker)... node-1: Stopping Cluster (corosync)... ############################ ssh node-2 -c sudo su - -c 'pcs status' Cluster name: mycluster Last updated: Sat Jun 25 15:42:29 2016 Last change: Sat Jun 25 15:41:09 2016 by root via cibadmin on node-1 Stack: corosync Current DC: node-2 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum 3 nodes and 1 resource configured Online: [ node-2 node-3 ] OFFLINE: [ node-1 ] Full list of resources: MyStonith (stonith:fence_sbd): Started node-2 PCSD Status: node-1: Online node-2: Online node-3: Online Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled ############################ ssh node-2 -c sudo su - -c 'stonith_admin -F node-1 ' ############################ ssh node-2 -c sudo su - -c 'grep stonith-ng /var/log/messages' Jun 25 15:40:11 localhost stonith-ng[3102]: notice: Additional logging available in /var/log/cluster/corosync.log Jun 25 15:40:11 localhost stonith-ng[3102]: notice: Connecting to cluster infrastructure: corosync Jun 25 15:40:11 localhost stonith-ng[3102]: notice: crm_update_peer_proc: Node node-2[2] - state is now member (was (null)) Jun 25 15:40:12 localhost stonith-ng[3102]: notice: Watching for stonith topology changes Jun 25 15:40:12 localhost stonith-ng[3102]: notice: Added 'watchdog' to the device list (1 active devices) Jun 25 15:40:12 localhost stonith-ng[3102]: notice: crm_update_peer_proc: Node node-3[3] - state is now member (was (null)) Jun 25 15:40:12 localhost stonith-ng[3102]: notice: crm_update_peer_proc: Node node-1[1] - state is now member (was (null)) Jun 25 15:40:12 localhost stonith-ng[3102]: notice: New watchdog timeout 10s (was 0s) Jun 25 15:41:03 localhost stonith-ng[3102]: notice: Relying on watchdog integration for fencing Jun 25 15:41:04 localhost stonith-ng[3102]: notice: Added 'MyStonith' to the device list (2 active devices) Jun 25 15:41:54 localhost stonith-ng[3102]: notice: crm_update_peer_proc: Node node-1[1] - state is now lost (was member) Jun 25 15:41:54 localhost stonith-ng[3102]: notice: Removing node-1/1 from the membership list Jun 25 15:41:54 localhost stonith-ng[3102]: notice: Purged 1 peers with id=1 and/or uname=node-1 from the membership cache Jun 25 15:42:33 localhost stonith-ng[3102]: notice: Client stonith_admin.3288.eb400ac9 wants to fence (off) 'node-1' with device '(any)' Jun 25 15:42:33 localhost stonith-ng[3102]: notice: Initiating remote operation off for node-1: 848cd1e9-55e4-4abc-8d7a-3762eaaf9ab4 (0) Jun 25 15:42:33 localhost stonith-ng[3102]: notice: watchdog can not fence (off) node-1: static-list Jun 25 15:42:33 localhost stonith-ng[3102]: notice: MyStonith can fence (off) node-1: dynamic-list Jun 25 15:42:33 localhost stonith-ng[3102]: notice: watchdog can not fence (off) node-1: static-list Jun 25 15:42:54 localhost stonith-ng[3102]: notice: Operation 'off' [3309] (call 2 from stonith_admin.3288) for host 'node-1' with device 'MyStonith' returned: 0 (OK) Jun 25 15:42:54 localhost stonith-ng[3102]: notice: Operation off of node-1 by node-2 for stonith_admin.3288@node-2.848cd1e9: OK Jun 25 15:42:54 localhost stonith-ng[3102]: warning: new_event_notification (3102-3288-12): Broken pipe (32) Jun 25 15:42:54 localhost stonith-ng[3102]: warning: st_notify_fence notification of client stonith_admin.3288.eb400a failed: Broken pipe (-32) ############################ ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 list' 0 node-3 clear 1 node-2 clear 2 node-1 off node-2 ############################ ssh node-1 -c sudo su - -c 'uptime' 15:43:31 up 21 min, 2 users, load average: 0.25, 0.18, 0.11 Cheers, Marcin
run01.sh
Description: Bourne shell script
_______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org