Re: [ClusterLabs] Installed Galera, now HAProxy won't start

2016-03-20 Thread Matthew Mucker
To close the loop, this was the root cause of the problem. There are several 
configuration files that mySQL reads at startup, and later files in the chain 
overwrite settings from files earlier in the chain. I had to edit two or three 
config files on each node to get mySQL to stop binding to 0.0.0.0 (netstat -tl 
is your friend!) but once I did the HAProxy clustered resource came back online.


Hopefully there's enough information in this thread to help those coming across 
this problem after me.


From: Ian 
Sent: Wednesday, March 16, 2016 8:27 PM
To: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] Installed Galera, now HAProxy won't start

> configure MariaDB server to bind to all available ports 
> (http://docs.openstack.org/ha-guide/controller-ha-galera-config.html, scroll 
> to "Database Configuration," note that bind-address is 0.0.0.0.). If MariaDB 
> binds to the virtual IP address, then HAProxy can't bind to that address and 
> therefore won't start. Right?

That is correct as far as my understanding goes.  By binding to port 3306 on 
all IPs (0.0.0.0), you are effectively preventing HAProxy from being able to 
use port 3306 on its own IP and vice-versa.

Try setting specific bind addresses for your Galera nodes; I would be surprised 
and interested if it didn't work.
OpenStack Docs: 
Configuration
docs.openstack.org
SELinux¶ Security-Enhanced Linux is a kernel module for improving security on 
Linux operating systems. It is commonly enabled and configured by default on 
Red Hat ...


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Cluster goes to unusable state if fencing resource is down

2016-03-20 Thread Ken Gaillot
On 03/18/2016 02:58 AM, Arjun Pandey wrote:
> Hi
> 
> I am running a 2 node cluster with this config on centos 6.6  where i
> have a multi-state resource foo being run in master/slave mode and  a
> bunch of floating IP addresses configured. Additionally i have a
> collocation constraint for the IP addr to be collocated with the
> master.
> 
> When i configure fencing using fence_ilo4 agents things work fine.
> However during testing i was trying out a case where the ilo cable is
> plugged out. In this case the entire cluster is brought down.
> 
> I understand that this seems to be a safer solution to ensure
> correctness and consistency of the systems. However my requirement was

Exactly. Without working fencing, the cluster can't know whether the
node is really down, or just malfunctioning and possibly still accessing
shared resources.

> to still keep it operational since the application and the floating ip
> are still up. Is there a way to acheive this ?

If fencing fails, and the node is really down, you'd be fine ignoring
the failure. But if the node is actually up, ignoring the failure means
both nodes will activate the floating IP, which will not be operational
(packets will sometimes go to one node, sometimes the other, disrupting
any reliable communication).

> Also considering a case where there is a multi node cluster ( more
> than 10 nodes )  and one of the machines just goes down along with the
> ilo resource for that node. Does it really make sense to bring the
> services down even when the rest of nodes are up ?

It makes sense if data integrity is your highest priority. Imagine a
cluster used by a bank for customer's account balances -- it's far
better to lock up the entire cluster than risk corrupting that data.

The best solution that pacemaker offers in this situation is fencing
topology. You can have multiple fence devices, and if one fails,
pacemaker will try the next.

One common deployment is IPMI as the first level (as you have now), with
an intelligent power switch as the second (backup) level. If IPMI
doesn't respond, the cluster will cut power to the host. Another
possibility is to use an intelligent network switch to cut off network
access to the failed node (if that is sufficient to prevent the node
from accessing any shared resources). If the services being offered are
important enough to require high availability, the relatively small cost
of an intelligent power switch should be easily justified, serving as a
type of insurance.

Not having fencing has such a high chance of making a huge mess that no
company I know of that supports clusters will support a cluster without it.

That said, if you are supporting your own clusters, understand the
risks, and are willing to deal with the worst-case scenario manually,
pacemaker does offer the option to disable stonith. There is no built-in
option to try stonith but ignore any failures. However, it is possible
to configure a fencing topology that does the same thing, if the second
level simply pretends that the fencing succeeded. I'm not going to
encourage that by describing how ;)

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] booth release v1.0

2016-03-20 Thread Dejan Muhamedagic
Hello everybody,

I'm happy to announce that the booth repository was yesterday
tagged as v1.0:

https://github.com/ClusterLabs/booth/releases/tag/v1.0

There were very few patches since the v1.0 rc1. The complete
list of changes is available in the ChangeLog:

https://github.com/ClusterLabs/booth/blob/v1.0/ChangeLog

The binaries are provided for some Linux distributions. Currently,
there are packages for CentOS7 and various openSUSE versions:

http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/

If you don't know what booth is and what is it good for, please
check the README at the bottom of the git repository home page:

https://github.com/ClusterLabs/booth

Cheers,

Dejan

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] DRBD fencing issue on failover causes resource failure

2016-03-20 Thread Digimer
On 16/03/16 01:17 PM, Tim Walberg wrote:
> Having an issue on a newly built CentOS 7.2.1511 NFS cluster with DRBD
> (drbd84-utils-8.9.5-1 with kmod-drbd84-8.4.7-1_1). At this point, the
> resources consist of a cluster address, a DRBD device mirroring between
> the two cluster nodes, the file system, and the nfs-server resource. The
> resources all behave properly until an extended failover or outage.
> 
> I have tested failover in several ways ("pcs cluster standby", "pcs
> cluster stop", "init 0", "init 6", "echo b > /proc/sysrq-trigger", etc.)
> and the symptoms are that, until the killed node is brought back into
> the cluster, failover never seems to complete. The DRBD device appears
> on the remaining node to be in a "Secondary/Unknown" state, and the
> resources end up looking like:
> 
> # pcs status
> Cluster name: nfscluster
> Last updated: Wed Mar 16 12:05:33 2016  Last change: Wed Mar 16
> 12:04:46 2016 by root via cibadmin on nfsnode01
> Stack: corosync
> Current DC: nfsnode01 (version 1.1.13-10.el7_2.2-44eb2dd) - partition
> with quorum
> 2 nodes and 5 resources configured
> 
> Online: [ nfsnode01 ]
> OFFLINE: [ nfsnode02 ]
> 
> Full list of resources:
> 
>  nfsVIP  (ocf::heartbeat:IPaddr2):   Started nfsnode01
>  nfs-server (systemd:nfs-server):   Stopped
>  Master/Slave Set: drbd_master [drbd_dev]
>  Slaves: [ nfsnode01 ]
>  Stopped: [ nfsnode02 ]
>  drbd_fs   (ocf::heartbeat:Filesystem):Stopped
> 
> PCSD Status:
>   nfsnode01: Online
>   nfsnode02: Online
> 
> Daemon Status:
>   corosync: active/enabled
>   pacemaker: active/enabled
>   pcsd: active/enabled
> 
> As soon as I bring the second node back online, the failover completes.
> But this is obviously not a good state, as an extended outage for any
> reason on one node essentially kills the cluster services. There's
> obviously something I've missed in configuring the resources, but I
> haven't been able to pinpoint it yet.
> 
> Perusing the logs, it appears that, upon the initial failure, DRBD does
> in fact promote the drbd_master resource, but immediately after that,
> pengine calls for it to be demoted for reasons I haven't been able to
> determine yet, but seems to be tied to the fencing configuration. I can
> see that the crm-fence-peer.sh script is called, but it almost seems
> like it's fencing the wrong node... Indeed, I do see that it adds a
> -INFINITY location constraint for the surviving node, which would
> explain the decision to demote the DRBD master.
> 
> My DRBD resource looks like this:
> 
> # cat /etc/drbd.d/drbd0.res
> resource drbd0 {
> 
> protocol C;
> startup { wfc-timeout 0; degr-wfc-timeout 120; }
> 
> disk {
> on-io-error detach;
> fencing resource-only;

This should be 'resource-and-stonith;', but alone won't do anything
until pacemaker's stonith is working.

> }
> 
> handlers {
> fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
> }
> 
> on nfsnode01 {
> device /dev/drbd0;
> disk /dev/vg_nfs/lv_drbd0;
> meta-disk internal;
> address 10.0.0.2:7788 ;
> }
> 
> on nfsnode02 {
> device /dev/drbd0;
> disk /dev/vg_nfs/lv_drbd0;
> meta-disk internal;
> address 10.0.0.3:7788 ;
> }
> }
> 
> If I comment out the three lines having to do with fencing, the failover
> works properly. But I'd prefer to have the fencing there in the odd
> chance that we end up with a split brain instead of just a node outage...
> 
> And, here's "pcs config --full":
> 
> # pcs config --full
> Cluster Name: nfscluster
> Corosync Nodes:
>  nfsnode01 nfsnode02
> Pacemaker Nodes:
>  nfsnode01 nfsnode02
> 
> Resources:
>  Resource: nfsVIP (class=ocf provider=heartbeat type=IPaddr2)
>   Attributes: ip=10.0.0.1 cidr_netmask=24
>   Operations: start interval=0s timeout=20s (nfsVIP-start-interval-0s)
>   stop interval=0s timeout=20s (nfsVIP-stop-interval-0s)
>   monitor interval=15s (nfsVIP-monitor-interval-15s)
>  Resource: nfs-server (class=systemd type=nfs-server)
>   Operations: monitor interval=60s (nfs-server-monitor-interval-60s)
>  Master: drbd_master
>   Meta Attrs: master-max=1 master-node-max=1 clone-max=2
> clone-node-max=1 notify=true
>   Resource: drbd_dev (class=ocf provider=linbit type=drbd)
>Attributes: drbd_resource=drbd0
>Operations: start interval=0s timeout=240 (drbd_dev-start-interval-0s)
>promote interval=0s timeout=90 (drbd_dev-promote-interval-0s)
>demote interval=0s timeout=90 (drbd_dev-demote-interval-0s)
>stop interval=0s timeout=100 (drbd_dev-stop-interval-0s)
>monitor interval=29s role=Master
> (drbd_dev-monitor-interval-29s)
>