On 06/28/2016 11:24 AM, Marcin Dulak wrote: > > > On Tue, Jun 28, 2016 at 5:04 AM, Andrew Beekhof <abeek...@redhat.com > <mailto:abeek...@redhat.com>> wrote: > > On Sun, Jun 26, 2016 at 6:05 AM, Marcin Dulak > <marcin.du...@gmail.com <mailto:marcin.du...@gmail.com>> wrote: > > Hi, > > > > I'm trying to get familiar with STONITH Block Devices (SBD) on a > 3-node > > CentOS7 built in VirtualBox. > > The complete setup is available at > > https://github.com/marcindulak/vagrant-sbd-tutorial-centos7.git > > so hopefully with some help I'll be able to make it work. > > > > Question 1: > > The shared device /dev/sbd1 is the VirtualBox's "shareable hard > disk" > > https://www.virtualbox.org/manual/ch05.html#hdimagewrites > > will SBD fencing work with that type of storage? > > unknown > > > > > I start the cluster using vagrant_1.8.1 and virtualbox-4.3 with: > > $ vagrant up # takes ~15 minutes > > > > The setup brings up the nodes, installs the necessary packages, > and prepares > > for the configuration of the pcs cluster. > > You can see which scripts the nodes execute at the bottom of the > > Vagrantfile. > > While there is 'yum -y install sbd' on CentOS7 the fence_sbd > agent has not > > been packaged yet. > > you're not supposed to use it > > > Therefore I rebuild Fedora 24 package using the latest > > https://github.com/ClusterLabs/fence-agents/archive/v4.0.22.tar.gz > > plus the update to the fence_sbd from > > https://github.com/ClusterLabs/fence-agents/pull/73 > > > > The configuration is inspired by > > https://www.novell.com/support/kb/doc.php?id=7009485 and > > > > https://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_storage_protect_fencing.html > > > > Question 2: > > After reading > http://blog.clusterlabs.org/blog/2015/sbd-fun-and-profit I > > expect with just one stonith resource configured > > there shouldn't be any stonith resources configured > > > It's a test setup. > Foundhttps://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_storage_protect_fencing.html > > crm configure > property stonith-enabled="true" > property stonith-timeout="40s" > primitive stonith_sbd stonith:external/sbd op start interval="0" > timeout="15" start-delay="10" > commit > quit
For what is supported (self-fencing by watchdog) the stonith-resource is just not needed because of sbd and pacemaker interacting via cib. > > > and trying to configure CentOS7 similarly. > > > > > a node will be fenced when I stop pacemaker and corosync `pcs > cluster stop > > node-1` or just `stonith_admin -F node-1`, but this is not the case. > > > > As can be seen below from uptime, the node-1 is not shutdown by > `pcs cluster > > stop node-1` executed on itself. > > I found some discussions on users@clusterlabs.org > <mailto:users@clusterlabs.org> about whether a node > > running SBD resource can fence itself, > > but the conclusion was not clear to me. > > on RHEL and derivatives it can ONLY fence itself. the disk based > posion pill isn't supported yet > > > once it's supported on RHEL I'll be ready :) Meaning not supported in this case doesn't (just) mean that you will receive - if at all - very limited help, but that sbd is built with "--disable-shared-disk". So unless you rebuild the package accordingly (with the other type of not supported then ;-) ) testing with a block-device won't make much sense I guess. I'm already a little surprised that you get what you get ;-) > > > > > > Question 3: > > Neither node-1 is fenced by `stonith_admin -F node-1` executed > on node-2, > > despite the fact > > /var/log/messages on node-2 (the one currently running > MyStonith) reporting: > > ... > > notice: Operation 'off' [3309] (call 2 from stonith_admin.3288) > for host > > 'node-1' with device 'MyStonith' returned: 0 (OK) > > ... > > What is happening here? > > have you tried looking at the sbd logs? > is the watchdog device functioning correctly? > > > it turned out (suggested here > http://clusterlabs.org/pipermail/users/2016-June/003355.html) that the > reason for node-1 not being fenced by `stonith_admin -F node-1` > executed on node-2 > was the previously executed `pcs cluster stop node-1`. In my setup SBD > seems integrated with corosync/pacemaker and the latter command > stopped the sbd service on node-1. > Killing corosync on node-1 instead of `pcs cluster stop node-1` fences > node-1 as expected: > > [root at node-1 ~]# killall -15 corosync > Broadcast message from systemd-journald at node-1 (Sat 2016-06-25 > 21:55:07 EDT): > sbd[4761]: /dev/sdb1: emerg: do_exit: Rebooting system: off > > I'm left with further questions: how to setup fence_sbd for the fenced > node to shutdown instead of reboot? > Both action=off or mode=onoff action=off options passed to fence_sbd > when creating the MyStonith resource result in a reboot. > > [root at node-2 ~]# pcs stonith show MyStonith > Resource: MyStonith (class=stonith type=fence_sbd) > Attributes: devices=/dev/sdb1 power_timeout=21 action=off > Operations: monitor interval=60s (MyStonith-monitor-interval-60s) > > [root@node-2 ~]# pcs status > Cluster name: mycluster > Last updated: Tue Jun 28 04:55:43 2016 Last change: Tue Jun 28 > 04:48:03 2016 by root via cibadmin on node-1 > Stack: corosync > Current DC: node-3 (version 1.1.13-10.el7_2.2-44eb2dd) - partition > with quorum > 3 nodes and 1 resource configured > > Online: [ node-1 node-2 node-3 ] > > Full list of resources: > > MyStonith (stonith:fence_sbd): Started node-2 > > PCSD Status: > node-1: Online > node-2: Online > node-3: Online > > Daemon Status: > corosync: active/disabled > pacemaker: active/disabled > pcsd: active/enabled > > Starting from the above cluster state: > [root@node-2 ~]# stonith_admin -F node-1 > results also in a reboot of node-1 instead of shutdown. > > /var/log/messages after the last command show "reboot" on node-2 > ... > Jun 28 04:49:39 localhost stonith-ng[3081]: notice: Client > stonith_admin.3179.fbc038ee wants to fence (off) 'node-1' with device > '(any)' > Jun 28 04:49:39 localhost stonith-ng[3081]: notice: Initiating remote > operation off for node-1: 8aea4f12-538d-41ab-bf20-0c8b0f72e2a3 (0) > Jun 28 04:49:39 localhost stonith-ng[3081]: notice: watchdog can not > fence (off) node-1: static-list > Jun 28 04:49:40 localhost stonith-ng[3081]: notice: MyStonith can > fence (off) node-1: dynamic-list > Jun 28 04:49:40 localhost stonith-ng[3081]: notice: watchdog can not > fence (off) node-1: static-list > Jun 28 04:49:44 localhost stonith-ng[3081]: notice: > crm_update_peer_proc: Node node-1[1] - state is now lost (was member) > Jun 28 04:49:44 localhost stonith-ng[3081]: notice: Removing node-1/1 > from the membership list > Jun 28 04:49:44 localhost stonith-ng[3081]: notice: Purged 1 peers > with id=1 and/or uname=node-1 from the membership cache > Jun 28 04:49:45 localhost stonith-ng[3081]: notice: MyStonith can > fence (reboot) node-1: dynamic-list > Jun 28 04:49:45 localhost stonith-ng[3081]: notice: watchdog can not > fence (reboot) node-1: static-list > Jun 28 04:49:46 localhost stonith-ng[3081]: notice: Operation reboot > of node-1 by node-3 for crmd.3063@node-3.36859c4e: OK > Jun 28 04:50:00 localhost stonith-ng[3081]: notice: Operation 'off' > [3200] (call 2 from stonith_admin.3179) for host 'node-1' with device > 'MyStonith' returned: 0 (OK) > Jun 28 04:50:00 localhost stonith-ng[3081]: notice: Operation off of > node-1 by node-2 for stonith_admin.3179@node-2.8aea4f12: OK > ... > > > Another question (I think the question is valid also for a potential > SUSE setup): What is the proper way of operating a cluster with SBD > after node-1 was fenced? > > [root at node-2 ~]# sbd -d /dev/sdb1 list > 0 node-3 clear > 1 node-2 clear > 2 node-1 off node-2 > > I found that executing sbd watch on node-1 clears the SBD status: > [root at node-1 ~]# sbd -d /dev/sdb1 watch > [root at node-1 ~]# sbd -d /dev/sdb1 list > 0 node-3 clear > 1 node-2 clear > 2 node-1 clear > Making sure that sbd is not running on node-1 (I can do that because > node-1 is currently not a part of the cluster) > [root at node-1 ~]# killall -15 sbd > I have to kill sbd because it's integrated with corosync and corosync > fails to start on node-1 with sbd already running. > > I can now join node-1 to the cluster from node-2: > [root at node-2 ~]# pcs cluster start node-1 > > > Marcin > > > > > > Question 4 (for the future): > > Assuming the node-1 was fenced, what is the way of operating SBD? > > I see the sbd lists now: > > 0 node-3 clear > > 1 node-1 off node-2 > > 2 node-2 clear > > How to clear the status of node-1? > > > > Question 5 (also for the future): > > While the relation 'stonith-timeout = Timeout (msgwait) + 20%' > presented > > at > > > > https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_storage_protect_fencing.html > > is clearly described, I wonder about the relation of > 'stonith-timeout' > > to other timeouts like the 'monitor interval=60s' reported by > `pcs stonith > > show MyStonith`. > > > > Here is how I configure the cluster and test it. The run.sh > script is > > attached. > > > > $ sh -x run01.sh 2>&1 | tee run01.txt > > > > with the result: > > > > $ cat run01.txt > > > > Each block below shows the executed ssh command and the result. > > > > ############################ > > ssh node-1 -c sudo su - -c 'pcs cluster auth -u hacluster -p > password node-1 > > node-2 node-3' > > node-1: Authorized > > node-3: Authorized > > node-2: Authorized > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'pcs cluster setup --name mycluster > node-1 node-2 > > node-3' > > Shutting down pacemaker/corosync services... > > Redirecting to /bin/systemctl stop pacemaker.service > > Redirecting to /bin/systemctl stop corosync.service > > Killing any remaining services... > > Removing all cluster configuration files... > > node-1: Succeeded > > node-2: Succeeded > > node-3: Succeeded > > Synchronizing pcsd certificates on nodes node-1, node-2, node-3... > > node-1: Success > > node-3: Success > > node-2: Success > > Restaring pcsd on the nodes in order to reload the certificates... > > node-1: Success > > node-3: Success > > node-2: Success > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'pcs cluster start --all' > > node-3: Starting Cluster... > > node-2: Starting Cluster... > > node-1: Starting Cluster... > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'corosync-cfgtool -s' > > Printing ring status. > > Local node ID 1 > > RING ID 0 > > id = 192.168.10.11 > > status = ring 0 active with no faults > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'pcs status corosync' > > Membership information > > ---------------------- > > Nodeid Votes Name > > 1 1 node-1 (local) > > 2 1 node-2 > > 3 1 node-3 > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'pcs status' > > Cluster name: mycluster > > WARNING: no stonith devices and stonith-enabled is not false > > Last updated: Sat Jun 25 15:40:51 2016 Last change: Sat > Jun 25 > > 15:40:33 2016 by hacluster via crmd on node-2 > > Stack: corosync > > Current DC: node-2 (version 1.1.13-10.el7_2.2-44eb2dd) - > partition with > > quorum > > 3 nodes and 0 resources configured > > Online: [ node-1 node-2 node-3 ] > > Full list of resources: > > PCSD Status: > > node-1: Online > > node-2: Online > > node-3: Online > > Daemon Status: > > corosync: active/disabled > > pacemaker: active/disabled > > pcsd: active/enabled > > > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 list' > > 0 node-3 clear > > 1 node-2 clear > > 2 node-1 clear > > > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 dump' > > ==Dumping header on disk /dev/sdb1 > > Header version : 2.1 > > UUID : 79f28167-a207-4f2a-a723-aa1c00bf1dee > > Number of slots : 255 > > Sector size : 512 > > Timeout (watchdog) : 10 > > Timeout (allocate) : 2 > > Timeout (loop) : 1 > > Timeout (msgwait) : 20 > > ==Header on disk /dev/sdb1 is dumped > > > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'pcs stonith list' > > fence_sbd - Fence agent for sbd > > > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'pcs stonith create MyStonith fence_sbd > > devices=/dev/sdb1 power_timeout=21 action=off' > > ssh node-1 -c sudo su - -c 'pcs property set stonith-enabled=true' > > ssh node-1 -c sudo su - -c 'pcs property set stonith-timeout=24s' > > ssh node-1 -c sudo su - -c 'pcs property' > > Cluster Properties: > > cluster-infrastructure: corosync > > cluster-name: mycluster > > dc-version: 1.1.13-10.el7_2.2-44eb2dd > > have-watchdog: true > > stonith-enabled: true > > stonith-timeout: 24s > > stonith-watchdog-timeout: 10s > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'pcs stonith show MyStonith' > > Resource: MyStonith (class=stonith type=fence_sbd) > > Attributes: devices=/dev/sdb1 power_timeout=21 action=off > > Operations: monitor interval=60s (MyStonith-monitor-interval-60s) > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'pcs cluster stop node-1 ' > > node-1: Stopping Cluster (pacemaker)... > > node-1: Stopping Cluster (corosync)... > > > > > > > > ############################ > > ssh node-2 -c sudo su - -c 'pcs status' > > Cluster name: mycluster > > Last updated: Sat Jun 25 15:42:29 2016 Last change: Sat > Jun 25 > > 15:41:09 2016 by root via cibadmin on node-1 > > Stack: corosync > > Current DC: node-2 (version 1.1.13-10.el7_2.2-44eb2dd) - > partition with > > quorum > > 3 nodes and 1 resource configured > > Online: [ node-2 node-3 ] > > OFFLINE: [ node-1 ] > > Full list of resources: > > MyStonith (stonith:fence_sbd): Started node-2 > > PCSD Status: > > node-1: Online > > node-2: Online > > node-3: Online > > Daemon Status: > > corosync: active/disabled > > pacemaker: active/disabled > > pcsd: active/enabled > > > > > > > > ############################ > > ssh node-2 -c sudo su - -c 'stonith_admin -F node-1 ' > > > > > > > > ############################ > > ssh node-2 -c sudo su - -c 'grep stonith-ng /var/log/messages' > > Jun 25 15:40:11 localhost stonith-ng[3102]: notice: Additional > logging > > available in /var/log/cluster/corosync.log > > Jun 25 15:40:11 localhost stonith-ng[3102]: notice: Connecting > to cluster > > infrastructure: corosync > > Jun 25 15:40:11 localhost stonith-ng[3102]: notice: > crm_update_peer_proc: > > Node node-2[2] - state is now member (was (null)) > > Jun 25 15:40:12 localhost stonith-ng[3102]: notice: Watching > for stonith > > topology changes > > Jun 25 15:40:12 localhost stonith-ng[3102]: notice: Added > 'watchdog' to the > > device list (1 active devices) > > Jun 25 15:40:12 localhost stonith-ng[3102]: notice: > crm_update_peer_proc: > > Node node-3[3] - state is now member (was (null)) > > Jun 25 15:40:12 localhost stonith-ng[3102]: notice: > crm_update_peer_proc: > > Node node-1[1] - state is now member (was (null)) > > Jun 25 15:40:12 localhost stonith-ng[3102]: notice: New > watchdog timeout > > 10s (was 0s) > > Jun 25 15:41:03 localhost stonith-ng[3102]: notice: Relying on > watchdog > > integration for fencing > > Jun 25 15:41:04 localhost stonith-ng[3102]: notice: Added > 'MyStonith' to > > the device list (2 active devices) > > Jun 25 15:41:54 localhost stonith-ng[3102]: notice: > crm_update_peer_proc: > > Node node-1[1] - state is now lost (was member) > > Jun 25 15:41:54 localhost stonith-ng[3102]: notice: Removing > node-1/1 from > > the membership list > > Jun 25 15:41:54 localhost stonith-ng[3102]: notice: Purged 1 > peers with > > id=1 and/or uname=node-1 from the membership cache > > Jun 25 15:42:33 localhost stonith-ng[3102]: notice: Client > > stonith_admin.3288.eb400ac9 wants to fence (off) 'node-1' with > device > > '(any)' > > Jun 25 15:42:33 localhost stonith-ng[3102]: notice: Initiating > remote > > operation off for node-1: 848cd1e9-55e4-4abc-8d7a-3762eaaf9ab4 (0) > > Jun 25 15:42:33 localhost stonith-ng[3102]: notice: watchdog > can not fence > > (off) node-1: static-list > > Jun 25 15:42:33 localhost stonith-ng[3102]: notice: MyStonith > can fence > > (off) node-1: dynamic-list > > Jun 25 15:42:33 localhost stonith-ng[3102]: notice: watchdog > can not fence > > (off) node-1: static-list > > Jun 25 15:42:54 localhost stonith-ng[3102]: notice: Operation > 'off' [3309] > > (call 2 from stonith_admin.3288) for host 'node-1' with device > 'MyStonith' > > returned: 0 (OK) > > Jun 25 15:42:54 localhost stonith-ng[3102]: notice: Operation > off of node-1 > > by node-2 for stonith_admin.3288@node-2.848cd1e9: OK > > Jun 25 15:42:54 localhost stonith-ng[3102]: warning: > new_event_notification > > (3102-3288-12 <tel:%283102-3288-12>): Broken pipe (32) > > Jun 25 15:42:54 localhost stonith-ng[3102]: warning: st_notify_fence > > notification of client stonith_admin.3288.eb400a failed: Broken > pipe (-32) > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'sbd -d /dev/sdb1 list' > > 0 node-3 clear > > 1 node-2 clear > > 2 node-1 off node-2 > > > > > > > > ############################ > > ssh node-1 -c sudo su - -c 'uptime' > > 15:43:31 up 21 min, 2 users, load average: 0.25, 0.18, 0.11 > > > > > > > > Cheers, > > > > Marcin > > > > > > _______________________________________________ > > Users mailing list: Users@clusterlabs.org > <mailto:Users@clusterlabs.org> > > http://clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > <mailto:Users@clusterlabs.org> > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org