On Tue, Jun 28, 2016 at 5:04 AM, Andrew Beekhof <abeek...@redhat.com> wrote:
> On Sun, Jun 26, 2016 at 6:05 AM, Marcin Dulak <marcin.du...@gmail.com> > wrote: > > Hi, > > > > I'm trying to get familiar with STONITH Block Devices (SBD) on a 3-node > > CentOS7 built in VirtualBox. > > The complete setup is available at > > https://github.com/marcindulak/vagrant-sbd-tutorial-centos7.git > > so hopefully with some help I'll be able to make it work. > > > > Question 1: > > The shared device /dev/sbd1 is the VirtualBox's "shareable hard disk" > > https://www.virtualbox.org/manual/ch05.html#hdimagewrites > > will SBD fencing work with that type of storage? > > unknown > > > > > I start the cluster using vagrant_1.8.1 and virtualbox-4.3 with: > > $ vagrant up # takes ~15 minutes > > > > The setup brings up the nodes, installs the necessary packages, and > prepares > > for the configuration of the pcs cluster. > > You can see which scripts the nodes execute at the bottom of the > > Vagrantfile. > > While there is 'yum -y install sbd' on CentOS7 the fence_sbd agent has > not > > been packaged yet. > > you're not supposed to use it > > > Therefore I rebuild Fedora 24 package using the latest > > https://github.com/ClusterLabs/fence-agents/archive/v4.0.22.tar.gz > > plus the update to the fence_sbd from > > https://github.com/ClusterLabs/fence-agents/pull/73 > > > > The configuration is inspired by > > https://www.novell.com/support/kb/doc.php?id=7009485 and > > > https://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_storage_protect_fencing.html > > > > Question 2: > > After reading http://blog.clusterlabs.org/blog/2015/sbd-fun-and-profit I > > expect with just one stonith resource configured > > there shouldn't be any stonith resources configured > It's a test setup. Found https://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_storage_protect_fencing.html crm configure property stonith-enabled="true" property stonith-timeout="40s" primitive stonith_sbd stonith:external/sbd op start interval="0" timeout="15" start-delay="10" commit quit and trying to configure CentOS7 similarly. > > > a node will be fenced when I stop pacemaker and corosync `pcs cluster > stop > > node-1` or just `stonith_admin -F node-1`, but this is not the case. > > > > As can be seen below from uptime, the node-1 is not shutdown by `pcs > cluster > > stop node-1` executed on itself. > > I found some discussions on users@clusterlabs.org about whether a node > > running SBD resource can fence itself, > > but the conclusion was not clear to me. > > on RHEL and derivatives it can ONLY fence itself. the disk based > posion pill isn't supported yet > once it's supported on RHEL I'll be ready :) > > > > > Question 3: > > Neither node-1 is fenced by `stonith_admin -F node-1` executed on node-2, > > despite the fact > > /var/log/messages on node-2 (the one currently running MyStonith) > reporting: > > ... > > notice: Operation 'off' [3309] (call 2 from stonith_admin.3288) for host > > 'node-1' with device 'MyStonith' returned: 0 (OK) > > ... > > What is happening here? > > have you tried looking at the sbd logs? > is the watchdog device functioning correctly? > > it turned out (suggested here http://clusterlabs.org/pipermail/users/2016-June/003355.html) that the reason for node-1 not being fenced by `stonith_admin -F node-1` executed on node-2 was the previously executed `pcs cluster stop node-1`. In my setup SBD seems integrated with corosync/pacemaker and the latter command stopped the sbd service on node-1. Killing corosync on node-1 instead of `pcs cluster stop node-1` fences node-1 as expected: [root at node-1 ~]# killall -15 corosync Broadcast message from systemd-journald at node-1 (Sat 2016-06-25 21:55:07 EDT): sbd[4761]: /dev/sdb1: emerg: do_exit: Rebooting system: off I'm left with further questions: how to setup fence_sbd for the fenced node to shutdown instead of reboot? Both action=off or mode=onoff action=off options passed to fence_sbd when creating the MyStonith resource result in a reboot. [root at node-2 ~]# pcs stonith show MyStonith Resource: MyStonith (class=stonith type=fence_sbd) Attributes: devices=/dev/sdb1 power_timeout=21 action=off Operations: monitor interval=60s (MyStonith-monitor-interval-60s) [root@node-2 ~]# pcs status Cluster name: mycluster Last updated: Tue Jun 28 04:55:43 2016 Last change: Tue Jun 28 04:48:03 2016 by root via cibadmin on node-1 Stack: corosync Current DC: node-3 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum 3 nodes and 1 resource configured Online: [ node-1 node-2 node-3 ] Full list of resources: MyStonith (stonith:fence_sbd): Started node-2 PCSD Status: node-1: Online node-2: Online node-3: Online Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled Starting from the above cluster state: [root@node-2 ~]# stonith_admin -F node-1 results also in a reboot of node-1 instead of shutdown. /var/log/messages after the last command show "reboot" on node-2 ... Jun 28 04:49:39 localhost stonith-ng[3081]: notice: Client stonith_admin.3179.fbc038ee wants to fence (off) 'node-1' with device '(any)' Jun 28 04:49:39 localhost stonith-ng[3081]: notice: Initiating remote operation off for node-1: 8aea4f12-538d-41ab-bf20-0c8b0f72e2a3 (0) Jun 28 04:49:39 localhost stonith-ng[3081]: notice: watchdog can not fence (off) node-1: static-list Jun 28 04:49:40 localhost stonith-ng[3081]: notice: MyStonith can fence (off) node-1: dynamic-list Jun 28 04:49:40 localhost stonith-ng[3081]: notice: watchdog can not fence (off) node-1: static-list Jun 28 04:49:44 localhost stonith-ng[3081]: notice: crm_update_peer_proc: Node node-1[1] - state is now lost (was member) Jun 28 04:49:44 localhost stonith-ng[3081]: notice: Removing node-1/1 from the membership list Jun 28 04:49:44 localhost stonith-ng[3081]: notice: Purged 1 peers with id=1 and/or uname=node-1 from the membership cache Jun 28 04:49:45 localhost stonith-ng[3081]: notice: MyStonith can fence (reboot) node-1: dynamic-list Jun 28 04:49:45 localhost stonith-ng[3081]: notice: watchdog can not fence (reboot) node-1: static-list Jun 28 04:49:46 localhost stonith-ng[3081]: notice: Operation reboot of node-1 by node-3 for crmd.3063@node-3.36859c4e: OK Jun 28 04:50:00 localhost stonith-ng[3081]: notice: Operation 'off' [3200] (call 2 from stonith_admin.3179) for host 'node-1' with device 'MyStonith' returned: 0 (OK) Jun 28 04:50:00 localhost stonith-ng[3081]: notice: Operation off of node-1 by node-2 for stonith_admin.3179@node-2.8aea4f12: OK ... Another question (I think the question is valid also for a potential SUSE setup): What is the proper way of operating a cluster with SBD after node-1 was fenced? [root at node-2 ~]# sbd -d /dev/sdb1 list 0 node-3 clear 1 node-2 clear 2 node-1 off node-2 I found that executing sbd watch on node-1 clears the SBD status: [root at node-1 ~]# sbd -d /dev/sdb1 watch [root at node-1 ~]# sbd -d /dev/sdb1 list 0 node-3 clear 1 node-2 clear 2 node-1 clear Making sure that sbd is not running on node-1 (I can do that because node-1 is currently not a part of the cluster) [root at node-1 ~]# killall -15 sbd I have to kill sbd because it's integrated with corosync and corosync fails to start on node-1 with sbd already running. I can now join node-1 to the cluster from node-2: [root at node-2 ~]# pcs cluster start node-1 Marcin > > > > Question 4 (for the future): > > Assuming the node-1 was fenced, what is the way of operating SBD? > > I see the sbd lists now: > > 0 node-3 clear > > 1 node-1 off node-2 > > 2 node-2 clear > > How to clear the status of node-1? > > > > Question 5 (also for the future): > > While the relation 'stonith-timeout = Timeout (msgwait) + 20%' presented > > at > > > https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_storage_protect_fencing.html > > is clearly described, I wonder about the relation of 'stonith-timeout' > > to other timeouts like the 'monitor interval=60s' reported by `pcs > stonith > > show MyStonith`. > > > > Here is how I configure the cluster and test it. The run.sh script is > > attached. > > > > $ sh -x run01.sh 2>&1 | tee run01.txt > > > > with the result: > > > > $ cat run01.txt > > > > Each block below shows the executed ssh command and the result. > > > > ############################ > > ssh node-1 -c sudo su - -c 'pcs cluster auth -u hacluster -p password > node-1 > > node-2 node-3' > > node-1: Authorized > > node-3: Authorized > > node-2: Authorized > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'pcs cluster setup --name mycluster node-1 > node-2 > > node-3' > > Shutting down pacemaker/corosync services... > > Redirecting to /bin/systemctl stop pacemaker.service > > Redirecting to /bin/systemctl stop corosync.service > > Killing any remaining services... > > Removing all cluster configuration files... > > node-1: Succeeded > > node-2: Succeeded > > node-3: Succeeded > > Synchronizing pcsd certificates on nodes node-1, node-2, node-3... > > node-1: Success > > node-3: Success > > node-2: Success > > Restaring pcsd on the nodes in order to reload the certificates... > > node-1: Success > > node-3: Success > > node-2: Success > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'pcs cluster start --all' > > node-3: Starting Cluster... > > node-2: Starting Cluster... > > node-1: Starting Cluster... > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'corosync-cfgtool -s' > > Printing ring status. > > Local node ID 1 > > RING ID 0 > > id = 192.168.10.11 > > status = ring 0 active with no faults > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'pcs status corosync' > > Membership information > > ---------------------- > > Nodeid Votes Name > > 1 1 node-1 (local) > > 2 1 node-2 > > 3 1 node-3 > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'pcs status' > > Cluster name: mycluster > > WARNING: no stonith devices and stonith-enabled is not false > > Last updated: Sat Jun 25 15:40:51 2016 Last change: Sat Jun 25 > > 15:40:33 2016 by hacluster via crmd on node-2 > > Stack: corosync > > Current DC: node-2 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with > > quorum > > 3 nodes and 0 resources configured > > Online: [ node-1 node-2 node-3 ] > > Full list of resources: > > PCSD Status: > > node-1: Online > > node-2: Online > > node-3: Online > > Daemon Status: > > corosync: active/disabled > > pacemaker: active/disabled > > pcsd: active/enabled > > > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 list' > > 0 node-3 clear > > 1 node-2 clear > > 2 node-1 clear > > > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 dump' > > ==Dumping header on disk /dev/sdb1 > > Header version : 2.1 > > UUID : 79f28167-a207-4f2a-a723-aa1c00bf1dee > > Number of slots : 255 > > Sector size : 512 > > Timeout (watchdog) : 10 > > Timeout (allocate) : 2 > > Timeout (loop) : 1 > > Timeout (msgwait) : 20 > > ==Header on disk /dev/sdb1 is dumped > > > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'pcs stonith list' > > fence_sbd - Fence agent for sbd > > > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'pcs stonith create MyStonith fence_sbd > > devices=/dev/sdb1 power_timeout=21 action=off' > > ssh node-1 -c sudo su - -c 'pcs property set stonith-enabled=true' > > ssh node-1 -c sudo su - -c 'pcs property set stonith-timeout=24s' > > ssh node-1 -c sudo su - -c 'pcs property' > > Cluster Properties: > > cluster-infrastructure: corosync > > cluster-name: mycluster > > dc-version: 1.1.13-10.el7_2.2-44eb2dd > > have-watchdog: true > > stonith-enabled: true > > stonith-timeout: 24s > > stonith-watchdog-timeout: 10s > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'pcs stonith show MyStonith' > > Resource: MyStonith (class=stonith type=fence_sbd) > > Attributes: devices=/dev/sdb1 power_timeout=21 action=off > > Operations: monitor interval=60s (MyStonith-monitor-interval-60s) > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'pcs cluster stop node-1 ' > > node-1: Stopping Cluster (pacemaker)... > > node-1: Stopping Cluster (corosync)... > > > > > > > > ############################ > > ssh node-2 -c sudo su - -c 'pcs status' > > Cluster name: mycluster > > Last updated: Sat Jun 25 15:42:29 2016 Last change: Sat Jun 25 > > 15:41:09 2016 by root via cibadmin on node-1 > > Stack: corosync > > Current DC: node-2 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with > > quorum > > 3 nodes and 1 resource configured > > Online: [ node-2 node-3 ] > > OFFLINE: [ node-1 ] > > Full list of resources: > > MyStonith (stonith:fence_sbd): Started node-2 > > PCSD Status: > > node-1: Online > > node-2: Online > > node-3: Online > > Daemon Status: > > corosync: active/disabled > > pacemaker: active/disabled > > pcsd: active/enabled > > > > > > > > ############################ > > ssh node-2 -c sudo su - -c 'stonith_admin -F node-1 ' > > > > > > > > ############################ > > ssh node-2 -c sudo su - -c 'grep stonith-ng /var/log/messages' > > Jun 25 15:40:11 localhost stonith-ng[3102]: notice: Additional logging > > available in /var/log/cluster/corosync.log > > Jun 25 15:40:11 localhost stonith-ng[3102]: notice: Connecting to > cluster > > infrastructure: corosync > > Jun 25 15:40:11 localhost stonith-ng[3102]: notice: > crm_update_peer_proc: > > Node node-2[2] - state is now member (was (null)) > > Jun 25 15:40:12 localhost stonith-ng[3102]: notice: Watching for stonith > > topology changes > > Jun 25 15:40:12 localhost stonith-ng[3102]: notice: Added 'watchdog' to > the > > device list (1 active devices) > > Jun 25 15:40:12 localhost stonith-ng[3102]: notice: > crm_update_peer_proc: > > Node node-3[3] - state is now member (was (null)) > > Jun 25 15:40:12 localhost stonith-ng[3102]: notice: > crm_update_peer_proc: > > Node node-1[1] - state is now member (was (null)) > > Jun 25 15:40:12 localhost stonith-ng[3102]: notice: New watchdog timeout > > 10s (was 0s) > > Jun 25 15:41:03 localhost stonith-ng[3102]: notice: Relying on watchdog > > integration for fencing > > Jun 25 15:41:04 localhost stonith-ng[3102]: notice: Added 'MyStonith' to > > the device list (2 active devices) > > Jun 25 15:41:54 localhost stonith-ng[3102]: notice: > crm_update_peer_proc: > > Node node-1[1] - state is now lost (was member) > > Jun 25 15:41:54 localhost stonith-ng[3102]: notice: Removing node-1/1 > from > > the membership list > > Jun 25 15:41:54 localhost stonith-ng[3102]: notice: Purged 1 peers with > > id=1 and/or uname=node-1 from the membership cache > > Jun 25 15:42:33 localhost stonith-ng[3102]: notice: Client > > stonith_admin.3288.eb400ac9 wants to fence (off) 'node-1' with device > > '(any)' > > Jun 25 15:42:33 localhost stonith-ng[3102]: notice: Initiating remote > > operation off for node-1: 848cd1e9-55e4-4abc-8d7a-3762eaaf9ab4 (0) > > Jun 25 15:42:33 localhost stonith-ng[3102]: notice: watchdog can not > fence > > (off) node-1: static-list > > Jun 25 15:42:33 localhost stonith-ng[3102]: notice: MyStonith can fence > > (off) node-1: dynamic-list > > Jun 25 15:42:33 localhost stonith-ng[3102]: notice: watchdog can not > fence > > (off) node-1: static-list > > Jun 25 15:42:54 localhost stonith-ng[3102]: notice: Operation 'off' > [3309] > > (call 2 from stonith_admin.3288) for host 'node-1' with device > 'MyStonith' > > returned: 0 (OK) > > Jun 25 15:42:54 localhost stonith-ng[3102]: notice: Operation off of > node-1 > > by node-2 for stonith_admin.3288@node-2.848cd1e9: OK > > Jun 25 15:42:54 localhost stonith-ng[3102]: warning: > new_event_notification > > (3102-3288-12): Broken pipe (32) > > Jun 25 15:42:54 localhost stonith-ng[3102]: warning: st_notify_fence > > notification of client stonith_admin.3288.eb400a failed: Broken pipe > (-32) > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 list' > > 0 node-3 clear > > 1 node-2 clear > > 2 node-1 off node-2 > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'uptime' > > 15:43:31 up 21 min, 2 users, load average: 0.25, 0.18, 0.11 > > > > > > > > Cheers, > > > > Marcin > > > > > > _______________________________________________ > > Users mailing list: Users@clusterlabs.org > > http://clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
_______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org