Re: [ClusterLabs] problems with a CentOS7 SBD cluster

2016-06-28 Thread Klaus Wenninger
On 06/28/2016 11:24 AM, Marcin Dulak wrote:
>
>
> On Tue, Jun 28, 2016 at 5:04 AM, Andrew Beekhof  > wrote:
>
> On Sun, Jun 26, 2016 at 6:05 AM, Marcin Dulak
> > wrote:
> > Hi,
> >
> > I'm trying to get familiar with STONITH Block Devices (SBD) on a
> 3-node
> > CentOS7 built in VirtualBox.
> > The complete setup is available at
> > https://github.com/marcindulak/vagrant-sbd-tutorial-centos7.git
> > so hopefully with some help I'll be able to make it work.
> >
> > Question 1:
> > The shared device /dev/sbd1 is the VirtualBox's "shareable hard
> disk"
> > https://www.virtualbox.org/manual/ch05.html#hdimagewrites
> > will SBD fencing work with that type of storage?
>
> unknown
>
> >
> > I start the cluster using vagrant_1.8.1 and virtualbox-4.3 with:
> > $ vagrant up  # takes ~15 minutes
> >
> > The setup brings up the nodes, installs the necessary packages,
> and prepares
> > for the configuration of the pcs cluster.
> > You can see which scripts the nodes execute at the bottom of the
> > Vagrantfile.
> > While there is 'yum -y install sbd' on CentOS7 the fence_sbd
> agent has not
> > been packaged yet.
>
> you're not supposed to use it
>
> > Therefore I rebuild Fedora 24 package using the latest
> > https://github.com/ClusterLabs/fence-agents/archive/v4.0.22.tar.gz
> > plus the update to the fence_sbd from
> > https://github.com/ClusterLabs/fence-agents/pull/73
> >
> > The configuration is inspired by
> > https://www.novell.com/support/kb/doc.php?id=7009485 and
> >
> 
> https://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_storage_protect_fencing.html
> >
> > Question 2:
> > After reading
> http://blog.clusterlabs.org/blog/2015/sbd-fun-and-profit I
> > expect with just one stonith resource configured
>
> there shouldn't be any stonith resources configured
>
>
> It's a test setup.
> Foundhttps://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_storage_protect_fencing.html
>
> crm configure
> property stonith-enabled="true"
> property stonith-timeout="40s"
> primitive stonith_sbd stonith:external/sbd op start interval="0"
> timeout="15" start-delay="10"
> commit
> quit

For what is supported (self-fencing by watchdog) the stonith-resource is
just not needed because
of sbd and pacemaker interacting via cib.

>
>
> and trying to configure CentOS7 similarly.
>  
>
>
> > a node will be fenced when I stop pacemaker and corosync `pcs
> cluster stop
> > node-1` or just `stonith_admin -F node-1`, but this is not the case.
> >
> > As can be seen below from uptime, the node-1 is not shutdown by
> `pcs cluster
> > stop node-1` executed on itself.
> > I found some discussions on users@clusterlabs.org
>  about whether a node
> > running SBD resource can fence itself,
> > but the conclusion was not clear to me.
>
> on RHEL and derivatives it can ONLY fence itself. the disk based
> posion pill isn't supported yet
>
>
> once it's supported on RHEL I'll be ready :)
Meaning not supported in this case doesn't (just) mean that you will
receive - if at
all - very limited help, but that sbd is built with "--disable-shared-disk".
So unless you rebuild the package accordingly (with the other type of not
supported then ;-) ) testing with a block-device won't make much sense I
guess.
I'm already a little surprised that you get what you get ;-)
>
>
> >
> > Question 3:
> > Neither node-1 is fenced by `stonith_admin -F node-1` executed
> on node-2,
> > despite the fact
> > /var/log/messages on node-2 (the one currently running
> MyStonith) reporting:
> > ...
> > notice: Operation 'off' [3309] (call 2 from stonith_admin.3288)
> for host
> > 'node-1' with device 'MyStonith' returned: 0 (OK)
> > ...
> > What is happening here?
>
> have you tried looking at the sbd logs?
> is the watchdog device functioning correctly?
>
>
> it turned out (suggested here
> http://clusterlabs.org/pipermail/users/2016-June/003355.html) that the
> reason for node-1 not being fenced by `stonith_admin -F node-1`
> executed on node-2
> was the previously executed `pcs cluster stop node-1`. In my setup SBD
> seems integrated with corosync/pacemaker and the latter command
> stopped the sbd service on node-1.
> Killing corosync on node-1 instead of `pcs cluster stop node-1` fences
> node-1 as expected:
>
> [root at node-1 ~]# killall -15 corosync
> Broadcast message from systemd-journald at node-1 (Sat 2016-06-25
> 21:55:07 EDT):
> sbd[4761]:  /dev/sdb1:emerg: do_exit: Rebooting system: off
>
> I'm left with further questions: how to setup fence_sbd for the fenced
> node to shutdown instead of reboot?
> Both action=off 

Re: [ClusterLabs] problems with a CentOS7 SBD cluster

2016-06-28 Thread Marcin Dulak
On Tue, Jun 28, 2016 at 5:04 AM, Andrew Beekhof  wrote:

> On Sun, Jun 26, 2016 at 6:05 AM, Marcin Dulak 
> wrote:
> > Hi,
> >
> > I'm trying to get familiar with STONITH Block Devices (SBD) on a 3-node
> > CentOS7 built in VirtualBox.
> > The complete setup is available at
> > https://github.com/marcindulak/vagrant-sbd-tutorial-centos7.git
> > so hopefully with some help I'll be able to make it work.
> >
> > Question 1:
> > The shared device /dev/sbd1 is the VirtualBox's "shareable hard disk"
> > https://www.virtualbox.org/manual/ch05.html#hdimagewrites
> > will SBD fencing work with that type of storage?
>
> unknown
>
> >
> > I start the cluster using vagrant_1.8.1 and virtualbox-4.3 with:
> > $ vagrant up  # takes ~15 minutes
> >
> > The setup brings up the nodes, installs the necessary packages, and
> prepares
> > for the configuration of the pcs cluster.
> > You can see which scripts the nodes execute at the bottom of the
> > Vagrantfile.
> > While there is 'yum -y install sbd' on CentOS7 the fence_sbd agent has
> not
> > been packaged yet.
>
> you're not supposed to use it
>
> > Therefore I rebuild Fedora 24 package using the latest
> > https://github.com/ClusterLabs/fence-agents/archive/v4.0.22.tar.gz
> > plus the update to the fence_sbd from
> > https://github.com/ClusterLabs/fence-agents/pull/73
> >
> > The configuration is inspired by
> > https://www.novell.com/support/kb/doc.php?id=7009485 and
> >
> https://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_storage_protect_fencing.html
> >
> > Question 2:
> > After reading http://blog.clusterlabs.org/blog/2015/sbd-fun-and-profit I
> > expect with just one stonith resource configured
>
> there shouldn't be any stonith resources configured
>

It's a test setup. Found
https://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_storage_protect_fencing.html

crm configure
property stonith-enabled="true"
property stonith-timeout="40s"
primitive stonith_sbd stonith:external/sbd op start interval="0"
timeout="15" start-delay="10"
commit
quit

and trying to configure CentOS7 similarly.


>
> > a node will be fenced when I stop pacemaker and corosync `pcs cluster
> stop
> > node-1` or just `stonith_admin -F node-1`, but this is not the case.
> >
> > As can be seen below from uptime, the node-1 is not shutdown by `pcs
> cluster
> > stop node-1` executed on itself.
> > I found some discussions on users@clusterlabs.org about whether a node
> > running SBD resource can fence itself,
> > but the conclusion was not clear to me.
>
> on RHEL and derivatives it can ONLY fence itself. the disk based
> posion pill isn't supported yet
>

once it's supported on RHEL I'll be ready :)

>
> >
> > Question 3:
> > Neither node-1 is fenced by `stonith_admin -F node-1` executed on node-2,
> > despite the fact
> > /var/log/messages on node-2 (the one currently running MyStonith)
> reporting:
> > ...
> > notice: Operation 'off' [3309] (call 2 from stonith_admin.3288) for host
> > 'node-1' with device 'MyStonith' returned: 0 (OK)
> > ...
> > What is happening here?
>
> have you tried looking at the sbd logs?
> is the watchdog device functioning correctly?
>
>
it turned out (suggested here
http://clusterlabs.org/pipermail/users/2016-June/003355.html) that the
reason for node-1 not being fenced by `stonith_admin -F node-1` executed on
node-2
was the previously executed `pcs cluster stop node-1`. In my setup SBD
seems integrated with corosync/pacemaker and the latter command stopped the
sbd service on node-1.
Killing corosync on node-1 instead of  `pcs cluster stop node-1` fences
node-1 as expected:

[root at node-1 ~]# killall -15 corosync
Broadcast message from systemd-journald at node-1 (Sat 2016-06-25 21:55:07
EDT):
sbd[4761]:  /dev/sdb1:emerg: do_exit: Rebooting system: off

I'm left with further questions: how to setup fence_sbd for the fenced node
to shutdown instead of reboot?
Both action=off or mode=onoff action=off options passed to fence_sbd when
creating the MyStonith resource result in a reboot.

[root at node-2 ~]# pcs stonith show MyStonith
 Resource: MyStonith (class=stonith type=fence_sbd)
  Attributes: devices=/dev/sdb1 power_timeout=21 action=off
  Operations: monitor interval=60s (MyStonith-monitor-interval-60s)

[root@node-2 ~]# pcs status
Cluster name: mycluster
Last updated: Tue Jun 28 04:55:43 2016Last change: Tue Jun 28
04:48:03 2016 by root via cibadmin on node-1
Stack: corosync
Current DC: node-3 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with
quorum
3 nodes and 1 resource configured

Online: [ node-1 node-2 node-3 ]

Full list of resources:

 MyStonith(stonith:fence_sbd):Started node-2

PCSD Status:
  node-1: Online
  node-2: Online
  node-3: Online

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

Starting from the above cluster state:
[root@node-2 ~]# stonith_admin -F node-1
results also in a reboot of node-1 instead 

[ClusterLabs] problems with a CentOS7 SBD cluster

2016-06-25 Thread Marcin Dulak
Hi,

I'm trying to get familiar with STONITH Block Devices (SBD) on a 3-node
CentOS7 built in VirtualBox.
The complete setup is available at
https://github.com/marcindulak/vagrant-sbd-tutorial-centos7.git
so hopefully with some help I'll be able to make it work.

Question 1:
The shared device /dev/sbd1 is the VirtualBox's "shareable hard disk"
https://www.virtualbox.org/manual/ch05.html#hdimagewrites
will SBD fencing work with that type of storage?

I start the cluster using vagrant_1.8.1 and virtualbox-4.3 with:
$ vagrant up  # takes ~15 minutes

The setup brings up the nodes, installs the necessary packages, and
prepares for the configuration of the pcs cluster.
You can see which scripts the nodes execute at the bottom of the
Vagrantfile.
While there is 'yum -y install sbd' on CentOS7 the fence_sbd agent has not
been packaged yet.
Therefore I rebuild Fedora 24 package using the latest
https://github.com/ClusterLabs/fence-agents/archive/v4.0.22.tar.gz
plus the update to the fence_sbd from
https://github.com/ClusterLabs/fence-agents/pull/73

The configuration is inspired by
https://www.novell.com/support/kb/doc.php?id=7009485 and
https://www.suse.com/documentation/sle-ha-12/book_sleha/data/sec_ha_storage_protect_fencing.html

Question 2:
After reading http://blog.clusterlabs.org/blog/2015/sbd-fun-and-profit I
expect with just one stonith resource configured
a node will be fenced when I stop pacemaker and corosync `pcs cluster stop
node-1` or just `stonith_admin -F node-1`, but this is not the case.

As can be seen below from uptime, the node-1 is not shutdown by `pcs
cluster stop node-1` executed on itself.
I found some discussions on users@clusterlabs.org about whether a node
running SBD resource can fence itself,
but the conclusion was not clear to me.

Question 3:
Neither node-1 is fenced by `stonith_admin -F node-1` executed on node-2,
despite the fact
/var/log/messages on node-2 (the one currently running MyStonith) reporting:
...
notice: Operation 'off' [3309] (call 2 from stonith_admin.3288) for host
'node-1' with device 'MyStonith' returned: 0 (OK)
...
What is happening here?

Question 4 (for the future):
Assuming the node-1 was fenced, what is the way of operating SBD?
I see the sbd lists now:
0   node-3  clear
1   node-1  offnode-2
2   node-2  clear
How to clear the status of node-1?

Question 5 (also for the future):
While the relation 'stonith-timeout = Timeout (msgwait) + 20%' presented
at
https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_storage_protect_fencing.html
is clearly described, I wonder about the relation of 'stonith-timeout'
to other timeouts like the 'monitor interval=60s' reported by `pcs stonith
show MyStonith`.

Here is how I configure the cluster and test it. The run.sh script is
attached.

$ sh -x run01.sh 2>&1 | tee run01.txt

with the result:

$ cat run01.txt

Each block below shows the executed ssh command and the result.


ssh node-1 -c sudo su - -c 'pcs cluster auth -u hacluster -p password
node-1 node-2 node-3'
node-1: Authorized
node-3: Authorized
node-2: Authorized




ssh node-1 -c sudo su - -c 'pcs cluster setup --name mycluster node-1
node-2 node-3'
Shutting down pacemaker/corosync services...
Redirecting to /bin/systemctl stop  pacemaker.service
Redirecting to /bin/systemctl stop  corosync.service
Killing any remaining services...
Removing all cluster configuration files...
node-1: Succeeded
node-2: Succeeded
node-3: Succeeded
Synchronizing pcsd certificates on nodes node-1, node-2, node-3...
node-1: Success
node-3: Success
node-2: Success
Restaring pcsd on the nodes in order to reload the certificates...
node-1: Success
node-3: Success
node-2: Success




ssh node-1 -c sudo su - -c 'pcs cluster start --all'
node-3: Starting Cluster...
node-2: Starting Cluster...
node-1: Starting Cluster...




ssh node-1 -c sudo su - -c 'corosync-cfgtool -s'
Printing ring status.
Local node ID 1
RING ID 0
id= 192.168.10.11
status= ring 0 active with no faults



ssh node-1 -c sudo su - -c 'pcs status corosync'
Membership information
--
Nodeid  Votes Name
 1  1 node-1 (local)
 2  1 node-2
 3  1 node-3




ssh node-1 -c sudo su - -c 'pcs status'
Cluster name: mycluster
WARNING: no stonith devices and stonith-enabled is not false
Last updated: Sat Jun 25 15:40:51 2016Last change: Sat Jun 25
15:40:33 2016 by hacluster via crmd on node-2
Stack: corosync
Current DC: node-2 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with
quorum
3 nodes and 0 resources configured
Online: [ node-1 node-2 node-3 ]
Full list of resources:
PCSD Status:
  node-1: Online
  node-2: Online
  node-3: Online
Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled