Re: [ClusterLabs] pcs cluster auth returns authentication error

2016-09-05 Thread Jan Pokorný
On 26/08/16 02:14 +, Jason A Ramsey wrote:
> Well, I got around the problem, but I don’t understand the solution…
> 
> I edited /etc/pam.d/password-auth and commented out the following line:
> 
> authrequiredpam_tally2.so onerr=fail audit silent 
> deny=5 unlock_time=900
> 
> Anyone have any idea why this was interfering?

No clear idea, but...

> On 08/25/2016 03:04 PM, Jason A Ramsey wrote:
>> type=USER_AUTH msg=audit(1472154922.415:69): user pid=1138 uid=0
>> auid=4294967295 ses=4294967295 subj=system_u:system_r:initrc_t:s0
>> msg='op=PAM:authentication acct="hacluster" exe="/usr/bin/ruby"
>> hostname=? addr=? terminal=? res=failed'

First, this definitely has nothing to do with SELinux (as opposed to
"AVC" type of audit record).

As a wild guess, if you want to continue using pam_tally2 module
(seems like a good idea), I'd suggest giving magic_root option
a try (and perhaps evaluate if that would be an acceptable compromise).

-- 
Jan (Poki)


pgpkU739TmiC1.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] What cib_stats line means in logfile

2016-09-05 Thread Jan Pokorný
On 05/09/16 21:26 +0200, Jan Pokorný wrote:
> On 25/08/16 17:55 +0200, Sébastien Emeriau wrote:
>> When i check my corosync.log i see this line :
>> 
>> info: cib_stats: Processed 1 operations (1.00us average, 0%
>> utilization) in the last 10min
>> 
>> What does it mean (cpu load or just information) ?
> 
> These are just periodically (10 minutes by default, if any
> operations observed at all) emitted diagnostic summaries that
> were once considered useful, which was later reconsidered
> leading to their complete removal:
> 
> https://github.com/ClusterLabs/pacemaker/commit/73e8c89#diff-37b681fa792dfc09ec67bb0d64eb55feL306
> 
> Honestly, using as old Pacemaker as 1.1.8 (released 4 years ago)

actually, it must have been even older than that (I'm afraid to ask).

> would be a bigger concern for me.  Plenty of important fixes
> (as well as enhancements) have been added since then...

P.S. Checked my mailbox, aggregating plentiful sources such as this
list and various GitHub notifications, and found 1 other trace of
such an oudated version within this year + 2 another last year(!).

-- 
Jan (Poki)


pgpCKucUUGJKN.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] What cib_stats line means in logfile

2016-09-05 Thread Jan Pokorný
On 25/08/16 17:55 +0200, Sébastien Emeriau wrote:
> When i check my corosync.log i see this line :
> 
> info: cib_stats: Processed 1 operations (1.00us average, 0%
> utilization) in the last 10min
> 
> What does it mean (cpu load or just information) ?

These are just periodically (10 minutes by default, if any
operations observed at all) emitted diagnostic summaries that
were once considered useful, which was later reconsidered
leading to their complete removal:

https://github.com/ClusterLabs/pacemaker/commit/73e8c89#diff-37b681fa792dfc09ec67bb0d64eb55feL306

Honestly, using as old Pacemaker as 1.1.8 (released 4 years ago)
would be a bigger concern for me.  Plenty of important fixes
(as well as enhancements) have been added since then...

-- 
Jan (Poki)


pgp_r5Exg7FXk.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] fence_apc delay?

2016-09-05 Thread Marek Grac
Hi,

On Mon, Sep 5, 2016 at 3:46 PM, Dan Swartzendruber 
wrote:

> ...
> Marek, thanks.  I have tested repeatedly (8 or so times with disk writes
> in progress) with 5-7 seconds and have had no corruption.  My only issue
> with using power_wait here (possibly I am misunderstanding this) is that
> the default action is 'reboot' which I *think* is 'power off, then power
> on'.  e.g. two operations to the fencing device.  The only place I need a
> delay though, is after the power off operation - doing so after power on is
> just wasted time that the resource is offline before the other node takes
> it over.  Am I misunderstanding this?  Thanks!
>

You are right. Default sequence for reboot is:

get status, power off, delay(power-wait), get status [repeat until OFF],
power on, delay(power-wait), get status [repeat until ON].

The power-wait was introduced because some devices respond with strange
values when they are asked too soon after power change. It was not intended
to be used in a way that you propose. Possible solutions:

*) Configure fence device to not use reboot but OFF, ON
Very same to the situation when there are multiple power circuits; you have
to switch them all OFF and afterwards turn them ON.

*) Add a new option power-wait-off that will be used only in OFF case (and
will override power-wait). It should be quite easy to do. Just, send us PR.

m,
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Failover IP with Monitoring but not controling the colocated services.

2016-09-05 Thread Stefan Schörghofer
On Mon, 5 Sep 2016 15:04:34 +0200
Klaus Wenninger  wrote:

> On 09/05/2016 01:38 PM, Stefan Schörghofer wrote:
> > Hi List,
> >
> > I am currently trying to setup the following situation in my lab:
> >
> > |--Cluster IP--|
> > | HAProxy instances |HAProxy instances |
> > | Node 1|   Node 2 |
> >
> >
> >
> > Now I've successfully added the Cluster IP resource to pacemaker and
> > tested the failover which worked perfectly.
> >
> > After that I wanted to ensure that all HAProxy instances are
> > running on the Node I want to failover to, so I wrote a
> > ocf-compatible script that supports all necessary options and added
> > it as resource. The script works and monitors the HAProxy incances
> > just fine, it also restarts them if some go down but it also stops
> > the haproxy instances on the standby Node.
> >
> > What I want to have is multible Node's where all HAProxy instances
> > are permanently running, but if something happens on the primary
> > node the Cluster IP should failover to the other (only if the
> > HAProxy instances are running there fine).
> >
> > Is this possible with Pacemake/Corosync? I've read a lot in the
> > documentation and found something about resource_clones and also a
> > resource-type that uses Nagios Plugins. Maybe there is a way with
> > using this?
> >
> > I tried to setup the clone resource, HAProxy instances where
> > running on all nodes.
> > I created a colocation rule for the clone resource and the Cluster
> > IP and after a manual migration all services where stopped.  
> 
> Sounds as if the colocation could have the wrong order.
> Basically the IP-primitive collocated with the HAProxy clones
> should be fine.
> Maybe you could paste the config...

That was the push in the right direction. Thanks for that.
I've checked the order and indeed...it was wrong during my tests.

The config now looks like:

node 167916820: sbg-vm-lb-tcp01
node 167916821: sbg-vm-lb-tcp02
primitive haproxy-rsc ocf:custom:haproxy \
meta migration-threshold=3 \
op monitor interval=60 timeout=30 on-fail=restart
primitive sbg-vm-lb-tcpC1 IPaddr2 \
params ip=10.2.53.22 nic=eth0 cidr_netmask=24 \
meta migration-threshold=2 target-role=Started \
op monitor interval=20 timeout=60 on-fail=restart
clone cl_haproxy-rsc haproxy-rsc \
params globally-unique=false
location cli-ban-sbg-vm-lb-tcpC1-on-sbg-vm-lb-tcp02 sbg-vm-lb-tcpC1
role=Started -inf: sbg-vm-lb-tcp02 colocation loc-sbg-vm-lb-tcpC1 inf:
sbg-vm-lb-tcpC1 cl_haproxy-rsc order ord-sbg-vm-lb-tcpC1 inf:
cl_haproxy-rsc sbg-vm-lb-tcpC1 property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.14-70404b0 \
cluster-infrastructure=corosync \
cluster-name=debian \
stonith-enabled=no \
no-quorum-policy=ignore \
default-resource-stickiness=100 \
last-lrm-refresh=1473069316


And everything works as expected.
Thank you!

regards,
Stefan


> 
> Regards,
> Klaus
> 
> >
> > Can I maybe work with multible HAProxy resources (one per node) and
> > setup the colocation rule with "ClusterIP coloc with (haproxy_node1
> > or haproxy_node2). Does something like this work?
> >
> >
> > Thanks for your input.
> > Best regards,
> > Stefan Schörghofer
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
> > http://bugs.clusterlabs.org  
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
> http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: fence_apc delay?

2016-09-05 Thread Dan Swartzendruber

On 2016-09-05 03:04, Ulrich Windl wrote:
Marek Grac  schrieb am 03.09.2016 um 14:41 in 
Nachricht

:

Hi,

There are two problems mentioned in the email.

1) power-wait

Power-wait is a quite advanced option and there are only few fence
devices/agent where it makes sense. And only because the HW/firmware 
on the

device is somewhat broken. Basically, when we execute power ON/OFF
operation, we wait for power-wait seconds before we send next command. 
I

don't remember any issue with APC and this kind of problems.


2) the only theory I could come up with was that maybe the fencing
operation was considered complete too quickly?

That is virtually not possible. Even when power ON/OFF is 
asynchronous, we
test status of device and fence agent wait until status of the 
plug/VM/...

matches what user wants.


I can imagine that a powerful power supply can deliver up to one
second of power even if the mains is disconnected. If the cluster is
very quick after fencing, it might be a problem. I'd suggest a 5 to 10
second delay between fencing action and cluster reaction.


Ulrich, please see the response I just posted to Marek.  Thanks!

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ip clustering strange behaviour

2016-09-05 Thread Klaus Wenninger
On 09/05/2016 03:02 PM, Gabriele Bulfon wrote:
> I read docs, looks like sbd fencing is more about iscsi/fc exposed
> storage resources.
> Here I have real shared disks (seen from solaris with the format
> utility as normal sas disks, but on both nodes).
> They are all jbod disks, that ZFS organizes in raidz/mirror pools, so
> I have 5 disks on one pool in one node, and the other 5 disks on
> another pool in one node.
> How can sbd work in this situation? Has it already been used/tested on
> a Solaris env with ZFS ?

You wouldn't have to have discs at all with sbd. You can just use it for
pacemaker
to be monitored by a hardware-watchdog.
But if you want to add discs it shouldn't really matter how they are
accessed as
long as you can concurrently read/write the block-devices. Configuration of
caching in the controllers might be an issue as well.
I'm e.g. currently testing with a simple kvm setup using following
virsh-config
for the shared block-device:


  
  
  
  
  
 

Don't know about test-coverage for sbd on Solaris. Actually it should be
independent
of which file-system you are using as you would anyway use a partition
without
filesystem for sbd.

>
> BTW, is there any other possibility other than sbd.
>

Probably - see Kens' suggestions.
Excuse me thinking a little unidimensional at the moment
working on some sbd-issue ;-)
And not having a proper fencing-device a watchdog is the last resort to
have something
working reliably. And pacemakers' way to do watchdog is sbd...

> Last but not least, is there any way to let ssh-fencing be considered
> good?
> At the moment, with ssh-fencing, if I shut down the second node, I get
> all second resources in UNCLEAN state, not taken by the first one.
> If I reboot the second , I only get the node on again, but resources
> remain stopped.

Strange... What do the logs say about the fencing-action being
successful or not?

>
> I remember my tests with heartbeat react different (halt would move
> everything to node1 and get back everything on restart)
>
> Gabriele
>
> 
> *Sonicle S.r.l. *: http://www.sonicle.com 
> *Music: *http://www.gabrielebulfon.com 
> *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
>
>
>
> --
>
> Da: Klaus Wenninger 
> A: users@clusterlabs.org
> Data: 5 settembre 2016 12.21.25 CEST
> Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
>
> On 09/05/2016 11:20 AM, Gabriele Bulfon wrote:
> > The dual machine is equipped with a syncro controller LSI 3008
> MPT SAS3.
> > Both nodes can see the same jbod disks (10 at the moment, up to 24).
> > Systems are XStreamOS / illumos, with ZFS.
> > Each system has one ZFS pool of 5 disks, with different pool names
> > (data1, data2).
> > When in active / active, the two machines run different zones and
> > services on their pools, on their networks.
> > I have custom resource agents (tested on pacemaker/heartbeat, now
> > porting to pacemaker/corosync) for ZFS pools and zones migration.
> > When I was testing pacemaker/heartbeat, when ssh-fencing discovered
> > the other node to be down (cleanly or abrupt halt), it was
> > automatically using IPaddr and our ZFS agents to take control of
> > everything, mounting the other pool and running any configured
> zone in it.
> > I would like to do the same with pacemaker/corosync.
> > The two nodes of the dual machine have an inernal lan connecting
> them,
> > a 100Mb ethernet: maybe this is enough reliable to trust
> ssh-fencing?
> > Or is there anything I can do to ensure at the controller level that
> > the pool is not in use on the other node?
>
> It is not just about the reliability of the networking-connection why
> ssh-fencing might be
> suboptimal. Something with the IP-Stack config (dynamic due to moving
> resources)
> might have gone wrong. And resources might be somehow hanging so that
> the node
> can't be brought down gracefully. Thus my suggestion to add a watchdog
> (so available)
> via sbd.
>
> >
> > Gabriele
> >
> >
> 
> 
> > *Sonicle S.r.l. *: http://www.sonicle.com 
> > *Music: *http://www.gabrielebulfon.com
> 
> > *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
> >
> >
> >
> >
> 
> --
> >
> > Da: Ken Gaillot 
> > A: gbul...@sonicle.com Cluster Labs - All topics related to
> > open-source clustering welcomed 

Re: [ClusterLabs] pacemaker doesn't failover when httpd killed

2016-09-05 Thread Nurit Vilosny
Perfect! I did missed it. Thanks for the help!!

-Original Message-
From: Kristoffer Grönlund [mailto:kgronl...@suse.com] 
Sent: Monday, September 05, 2016 3:27 PM
To: Nurit Vilosny ; users@clusterlabs.org
Subject: RE: [ClusterLabs] pacemaker doesn't failover when httpd killed

Nurit Vilosny  writes:

> Here is the configuration for the httpd:
>
> # pcs resource show cluster_virtualIP
> Resource: cluster_virtualIP (class=ocf provider=heartbeat type=IPaddr2)
>   Attributes: ip=10.215.53.99
>   Operations: monitor interval=20s (cluster_virtualIP-monitor-interval-20s)
>   start interval=0s timeout=20s 
> (cluster_virtualIP-start-interval-0s)
>   stop interval=0s timeout=20s on-fail=restart 
> (cluster_virtualIP-stop-interval-0s)
>
> (yes - I have monitoring configured and yes I used the ocf)
>

Hi Nurit,

That's just the cluster resource for managing a virtual IP, not the resource 
for managing the httpd daemon itself.

If you've only got this resource, then there is nothing that monitors the web 
server. You need a cluster resource for the web server as well 
(ocf:heartbeat:apache, usually).

You are missing both that resource and the constraints that ensure that the 
virtual IP is active on the same node as the web server. The Clusters from 
Scratch document on the clusterlabs.org website shows you how to configure this.

Cheers,
Kristoffer

--
// Kristoffer Grönlund
// kgronl...@suse.com
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-05 Thread Klaus Wenninger
On 09/03/2016 08:42 PM, Shermal Fernando wrote:
>
> Hi,
>
>  
>
> Currently our system have 99.96% uptime. But our goal is to increase
> it beyond 99.999%. Now we are studying the
> reliability/performance/features of pacemaker to replace the existing
> clustering solution.
>
>  
>
> While testing pacemaker, I have encountered a problem. If the DC (crm
> daemon) is frozen by sending the SIGSTOP signal, crmds in other
> machines never start election to elect a new DC. Therefore fail-overs,
> resource restartings and other cluster decisions will be delayed until
> the DC is unfrozen.
>
> Is this the default behavior of pacemaker or is it due to a
> misconfiguration? Is there any way to avoid this single point of failure?
>
>  
>
> For the testing, we use Pacemaker 1.1.12 with Corosync 2.3.3 in SLES
> 12 SP1 operation system.
>

Guess I can reproduce that with pacemaker 1.1.15 & corosync 2.3.6.
I'm having sbd with pacemaker-watcher running as well on the nodes.
As the node-health is not updated and the cib can be read sbd is
happy - as to be expected.
Maybe we could at least add something into sbd-pacemaker-watcher
to detect the issue ... thinking ...

Regards,
Klaus

>  
>
>  
>
> Regards,
>
> Shermal Fernando
>
>  
>
>  
>
>  
>
>  
>
>  
>
>  
>
>  
>
> This e-mail transmission (inclusive of any attachments) is strictly
> confidential and intended solely for the ordinary user of the e-mail
> address to which it was addressed. It may contain legally privileged
> and/or CONFIDENTIAL information. The unauthorized use, disclosure,
> distribution printing and/or copying of this e-mail or any information
> it contains is prohibited and could, in certain circumstances,
> constitute an offence. If you have received this e-mail in error or
> are not an intended recipient please inform the sender of the email
> and MillenniumIT immediately by return e-mail or telephone (+94-11)
> 2416000. We advise that in keeping with good computing practice, the
> recipient of this e-mail should ensure that it is virus free. We do
> not accept responsibility for any virus that may be transferred by way
> of this e-mail. E-mail may be susceptible to data corruption,
> interception and unauthorized amendment, and we do not accept
> liability for any such corruption, interception or amendment or any
> consequences thereof.
>
> www.millenniumit.com 
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ip clustering strange behaviour

2016-09-05 Thread Gabriele Bulfon
I read docs, looks like sbd fencing is more about iscsi/fc exposed storage 
resources.
Here I have real shared disks (seen from solaris with the format utility as 
normal sas disks, but on both nodes).
They are all jbod disks, that ZFS organizes in raidz/mirror pools, so I have 5 
disks on one pool in one node, and the other 5 disks on another pool in one 
node.
How can sbd work in this situation? Has it already been used/tested on a 
Solaris env with ZFS ?
BTW, is there any other possibility other than sbd.
Last but not least, is there any way to let ssh-fencing be considered good?
At the moment, with ssh-fencing, if I shut down the second node, I get all 
second resources in UNCLEAN state, not taken by the first one.
If I reboot the second , I only get the node on again, but resources remain 
stopped.
I remember my tests with heartbeat react different (halt would move everything 
to node1 and get back everything on restart)
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Klaus Wenninger
A: users@clusterlabs.org
Data: 5 settembre 2016 12.21.25 CEST
Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
On 09/05/2016 11:20 AM, Gabriele Bulfon wrote:
The dual machine is equipped with a syncro controller LSI 3008 MPT SAS3.
Both nodes can see the same jbod disks (10 at the moment, up to 24).
Systems are XStreamOS / illumos, with ZFS.
Each system has one ZFS pool of 5 disks, with different pool names
(data1, data2).
When in active / active, the two machines run different zones and
services on their pools, on their networks.
I have custom resource agents (tested on pacemaker/heartbeat, now
porting to pacemaker/corosync) for ZFS pools and zones migration.
When I was testing pacemaker/heartbeat, when ssh-fencing discovered
the other node to be down (cleanly or abrupt halt), it was
automatically using IPaddr and our ZFS agents to take control of
everything, mounting the other pool and running any configured zone in it.
I would like to do the same with pacemaker/corosync.
The two nodes of the dual machine have an inernal lan connecting them,
a 100Mb ethernet: maybe this is enough reliable to trust ssh-fencing?
Or is there anything I can do to ensure at the controller level that
the pool is not in use on the other node?
It is not just about the reliability of the networking-connection why
ssh-fencing might be
suboptimal. Something with the IP-Stack config (dynamic due to moving
resources)
might have gone wrong. And resources might be somehow hanging so that
the node
can't be brought down gracefully. Thus my suggestion to add a watchdog
(so available)
via sbd.
Gabriele

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
A: gbul...@sonicle.com Cluster Labs - All topics related to
open-source clustering welcomed
Data: 1 settembre 2016 15.49.04 CEST
Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
On 08/31/2016 11:50 PM, Gabriele Bulfon wrote:
Thanks, got it.
So, is it better to use "two_node: 1" or, as suggested else
where, or
"no-quorum-policy=stop"?
I'd prefer "two_node: 1" and letting pacemaker's options default. But
see the votequorum(5) man page for what two_node implies -- most
importantly, both nodes have to be available when the cluster starts
before it will start any resources. Node failure is handled fine once
the cluster has started, but at start time, both nodes must be up.
About fencing, the machine I'm going to implement the 2-nodes
cluster is
a dual machine with shared disks backend.
Each node has two 10Gb ethernets dedicated to the public ip and the
admin console.
Then there is a third 100Mb ethernet connecing the two machines
internally.
I was going to use this last one as fencing via ssh, but looks
like this
way I'm not gonna have ip/pool/zone movements if one of the nodes
freezes or halts without shutting down pacemaker clean.
What should I use instead?
I'm guessing as a dual machine, they share a power supply, so that
rules
out a power switch. If the box has IPMI that can individually power
cycle each host, you can use fence_ipmilan. If the disks are
shared via
iSCSI, you could use fence_scsi. If the box has a hardware watchdog
device that can individually target the hosts, you could use sbd. If
none of those is an option, probably the best you could do is run the
cluster nodes as VMs on each host, and use fence_xvm.
Thanks for your help,
Gabriele


Re: [ClusterLabs] Failover IP with Monitoring but not controling the colocated services.

2016-09-05 Thread Klaus Wenninger
On 09/05/2016 01:38 PM, Stefan Schörghofer wrote:
> Hi List,
>
> I am currently trying to setup the following situation in my lab:
>
> |--Cluster IP--|
> | HAProxy instances |HAProxy instances |
> | Node 1|   Node 2 |
>
>
>
> Now I've successfully added the Cluster IP resource to pacemaker and
> tested the failover which worked perfectly.
>
> After that I wanted to ensure that all HAProxy instances are running on
> the Node I want to failover to, so I wrote a ocf-compatible script that
> supports all necessary options and added it as resource.
> The script works and monitors the HAProxy incances just fine, it also
> restarts them if some go down but it also stops the haproxy instances
> on the standby Node.
>
> What I want to have is multible Node's where all HAProxy instances are
> permanently running, but if something happens on the primary node the
> Cluster IP should failover to the other (only if the HAProxy instances
> are running there fine).
>
> Is this possible with Pacemake/Corosync? I've read a lot in the
> documentation and found something about resource_clones and also a
> resource-type that uses Nagios Plugins. Maybe there is a way with using
> this?
>
> I tried to setup the clone resource, HAProxy instances where running on
> all nodes.
> I created a colocation rule for the clone resource and the Cluster IP
> and after a manual migration all services where stopped.

Sounds as if the colocation could have the wrong order.
Basically the IP-primitive collocated with the HAProxy clones
should be fine.
Maybe you could paste the config...

Regards,
Klaus

>
> Can I maybe work with multible HAProxy resources (one per node) and
> setup the colocation rule with "ClusterIP coloc with (haproxy_node1 or
> haproxy_node2). Does something like this work?
>
>
> Thanks for your input.
> Best regards,
> Stefan Schörghofer
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemaker doesn't failover when httpd killed

2016-09-05 Thread Kristoffer Grönlund
Nurit Vilosny  writes:

> Here is the configuration for the httpd:
>
> # pcs resource show cluster_virtualIP
> Resource: cluster_virtualIP (class=ocf provider=heartbeat type=IPaddr2)
>   Attributes: ip=10.215.53.99
>   Operations: monitor interval=20s (cluster_virtualIP-monitor-interval-20s)
>   start interval=0s timeout=20s 
> (cluster_virtualIP-start-interval-0s)
>   stop interval=0s timeout=20s on-fail=restart 
> (cluster_virtualIP-stop-interval-0s)
>
> (yes - I have monitoring configured and yes I used the ocf)
>

Hi Nurit,

That's just the cluster resource for managing a virtual IP, not the
resource for managing the httpd daemon itself.

If you've only got this resource, then there is nothing that monitors
the web server. You need a cluster resource for the web server as well
(ocf:heartbeat:apache, usually).

You are missing both that resource and the constraints that ensure that
the virtual IP is active on the same node as the web server. The
Clusters from Scratch document on the clusterlabs.org website shows you
how to configure this.

Cheers,
Kristoffer

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemaker doesn't failover when httpd killed

2016-09-05 Thread Digimer
Depends on your OS, but generally /var/log/messages. Also, please share
your full pacemaker config. Please only obfuscate passwords.

digimer

On 05/09/16 07:53 PM, Nurit Vilosny wrote:
> Hi Kristoffer,
> Thanks for the prompt answer.
> Result of kill -9 is a dead process. Restart is not being performed.
> Can you tell me what logs to attach, so I can add them?
> 
> -Original Message-
> From: Kristoffer Grönlund [mailto:kgronl...@suse.com] 
> Sent: Monday, September 05, 2016 9:35 AM
> To: Nurit Vilosny ; users@clusterlabs.org
> Subject: Re: [ClusterLabs] pacemaker doesn't failover when httpd killed
> 
> Nurit Vilosny  writes:
> 
>> Hi everyone,
>> I tried the IRC for that, but I get disconnected and cannot see the reply...
>> So I try again:
>> I have a cluster with 3 nodes and 2 services: apache and application service 
>> - grouped together.
>> Debugging the cluster I used kill -9 to kill the httpd process, assuming the 
>> services will migrate to another node, but they didn't.
>> Log didn't show anything, and I remember reading that pacemaker check httpd 
>> status somewhere else that at service httpd status - but couldn't find where.
>> Any idea what can I do ?
> 
> Without any attached logs or before/after status information, it's difficult 
> to know what exactly happened in your case. But by default, Pacemaker tries 
> to restart the service on the same node before migrating to another node. So 
> running kill -9 on httpd should result in a restart on the same node, not a 
> migration.
> 
> Cheers,
> Kristoffer
> 
>>
>> Thanks.
>> Nurit
>>
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org Getting started: 
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> --
> // Kristoffer Grönlund
> // kgronl...@suse.com
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemaker doesn't failover when httpd killed

2016-09-05 Thread Nurit Vilosny
Here is the configuration for the httpd:

# pcs resource show cluster_virtualIP
Resource: cluster_virtualIP (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=10.215.53.99
  Operations: monitor interval=20s (cluster_virtualIP-monitor-interval-20s)
  start interval=0s timeout=20s 
(cluster_virtualIP-start-interval-0s)
  stop interval=0s timeout=20s on-fail=restart 
(cluster_virtualIP-stop-interval-0s)

(yes - I have monitoring configured and yes I used the ocf)

Regrads,
Nurit

-Original Message-
From: Kristoffer Grönlund [mailto:kgronl...@suse.com] 
Sent: Monday, September 05, 2016 2:01 PM
To: Nurit Vilosny ; users@clusterlabs.org
Subject: RE: [ClusterLabs] pacemaker doesn't failover when httpd killed

Nurit Vilosny  writes:

> Hi Kristoffer,
> Thanks for the prompt answer.
> Result of kill -9 is a dead process. Restart is not being performed.
> Can you tell me what logs to attach, so I can add them?

Hi Nurit,

Start by attaching your configuration. Do you have a monitoring operation 
configured for your apache resource? Did you use the OCF resource agent?

Cheers,
Kristoffer

--
// Kristoffer Grönlund
// kgronl...@suse.com
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Failover IP with Monitoring but not controling the colocated services.

2016-09-05 Thread Stefan Schörghofer
Hi List,

I am currently trying to setup the following situation in my lab:

|--Cluster IP--|
| HAProxy instances |HAProxy instances |
| Node 1|   Node 2 |



Now I've successfully added the Cluster IP resource to pacemaker and
tested the failover which worked perfectly.

After that I wanted to ensure that all HAProxy instances are running on
the Node I want to failover to, so I wrote a ocf-compatible script that
supports all necessary options and added it as resource.
The script works and monitors the HAProxy incances just fine, it also
restarts them if some go down but it also stops the haproxy instances
on the standby Node.

What I want to have is multible Node's where all HAProxy instances are
permanently running, but if something happens on the primary node the
Cluster IP should failover to the other (only if the HAProxy instances
are running there fine).

Is this possible with Pacemake/Corosync? I've read a lot in the
documentation and found something about resource_clones and also a
resource-type that uses Nagios Plugins. Maybe there is a way with using
this?

I tried to setup the clone resource, HAProxy instances where running on
all nodes.
I created a colocation rule for the clone resource and the Cluster IP
and after a manual migration all services where stopped.

Can I maybe work with multible HAProxy resources (one per node) and
setup the colocation rule with "ClusterIP coloc with (haproxy_node1 or
haproxy_node2). Does something like this work?


Thanks for your input.
Best regards,
Stefan Schörghofer

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Service pacemaker start kills my cluster and other NFS HA issues

2016-09-05 Thread Pablo Pines Leon
Hello,

I implemented the suggested change in corosync and I realized that service 
pacemaker stop on the master node works provided that I run crm_resource -P 
from another terminal right after it, and the same goes for the case of the 
"failback", getting back the node that failed on the cluster, which causes the 
IP resource and then the NFS exports to fail, if I run crm_resource -P twice 
after running service pacemaker start to get it back in it will work.

However, I see no reason why this is happening, if the failover works fine why 
can there be any problem getting a node back in the cluster?

Thanks and kind regards

Pablo

From: Pablo Pines Leon [pablo.pines.l...@cern.ch]
Sent: 01 September 2016 09:49
To: kgail...@redhat.com; Cluster Labs - All topics  related to open-source 
clustering welcomed
Subject: Re: [ClusterLabs] Service pacemaker start kills my cluster and other 
NFS HA issues

Dear Ken,

Thanks for your reply. That configuration in Ubuntu works perfectly fine, the 
problem is that in CentOS 7 for some reason I am not even able to do a "service 
pacemaker stop" of the node that is running as master (with the slave off too) 
because it will have some failed actions that don't make any sense:

Migration Summary:
* Node nfsha1:
   res_exportfs_root: migration-threshold=100 fail-count=1 last-failure='Thu
 Sep  1 09:42:43 2016'
   res_exportfs_export1: migration-threshold=100 fail-count=100 last-fai
lure='Thu Sep  1 09:42:38 2016'

Failed Actions:
* res_exportfs_root_monitor_3 on nfsha1 'not running' (7): call=79, status=c
omplete, exitreason='none',
last-rc-change='Thu Sep  1 09:42:43 2016', queued=0ms, exec=0ms
* res_exportfs_export1_stop_0 on nfsha1 'unknown error' (1): call=88, status=Tim
ed Out, exitreason='none',
last-rc-change='Thu Sep  1 09:42:18 2016', queued=0ms, exec=20001ms

So I am wondering what is different between both OSes that will cause this 
different outcome.

Kind regards


From: Ken Gaillot [kgail...@redhat.com]
Sent: 31 August 2016 17:31
To: users@clusterlabs.org
Subject: Re: [ClusterLabs] Service pacemaker start kills my cluster and other 
NFS HA issues

On 08/30/2016 10:49 AM, Pablo Pines Leon wrote:
> Hello,
>
> I have set up a DRBD-Corosync-Pacemaker cluster following the
> instructions from https://wiki.ubuntu.com/ClusterStack/Natty adapting
> them to CentOS 7 (e.g: using systemd). After testing it in Virtual

There is a similar how-to specifically for CentOS 7:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Clusters_from_Scratch/index.html

I think if you compare your configs to that, you'll probably find the
cause. I'm guessing the most important missing pieces are "two_node: 1"
in corosync.conf, and fencing.


> Machines it seemed to be working fine, so it is now implemented in
> physical machines, and I have noticed that the failover works fine as
> long as I kill the master by pulling the AC cable, but not if I issue
> the halt, reboot or shutdown commands, that makes the cluster get in a
> situation like this:
>
> Last updated: Tue Aug 30 16:55:58 2016  Last change: Tue Aug 23
> 11:49:43 2016 by hacluster via crmd on nfsha2
> Stack: corosync
> Current DC: nfsha2 (version 1.1.13-10.el7_2.4-44eb2dd) - partition with
> quorum
> 2 nodes and 9 resources configured
>
> Online: [ nfsha1 nfsha2 ]
>
>  Master/Slave Set: ms_drbd_export [res_drbd_export]
>  Masters: [ nfsha2 ]
>  Slaves: [ nfsha1 ]
>  Resource Group: rg_export
>  res_fs (ocf::heartbeat:Filesystem):Started nfsha2
>  res_exportfs_export1(ocf::heartbeat:exportfs):FAILED nfsha2
> (unmanaged)
>  res_ip (ocf::heartbeat:IPaddr2):Stopped
>  Clone Set: cl_nfsserver [res_nfsserver]
>  Started: [ nfsha1 ]
>  Clone Set: cl_exportfs_root [res_exportfs_root]
>  res_exportfs_root  (ocf::heartbeat:exportfs):FAILED nfsha2
>  Started: [ nfsha1 ]
>
> Migration Summary:
> * Node 2:
>res_exportfs_export1: migration-threshold=100
> fail-count=100last-failure='Tue Aug 30 16:55:50 2016'
>res_exportfs_root: migration-threshold=100 fail-count=1
> last-failure='Tue Aug 30 16:55:48 2016'
> * Node 1:
>
> Failed Actions:
> * res_exportfs_export1_stop_0 on nfsha2 'unknown error' (1): call=134,
> status=Timed Out, exitreason='non
> e',
> last-rc-change='Tue Aug 30 16:55:30 2016', queued=0ms, exec=20001ms
> * res_exportfs_root_monitor_3 on nfsha2 'not running' (7): call=126,
> status=complete, exitreason='no
> ne',
> last-rc-change='Tue Aug 30 16:55:48 2016', queued=0ms, exec=0ms
>
> This of course blocks it, because the IP and the NFS exports are down.
> It doesn't even recognize that the other node is down. I am then forced
> to do "crm_resource -P" to get it back to a working state.
>
> Even when unplugging the master, and booting it up again, trying to get
> it back in the cluster executing "service 

Re: [ClusterLabs] ip clustering strange behaviour

2016-09-05 Thread Klaus Wenninger
On 09/05/2016 11:20 AM, Gabriele Bulfon wrote:
> The dual machine is equipped with a syncro controller LSI 3008 MPT SAS3.
> Both nodes can see the same jbod disks (10 at the moment, up to 24).
> Systems are XStreamOS / illumos, with ZFS.
> Each system has one ZFS pool of 5 disks, with different pool names
> (data1, data2).
> When in active / active, the two machines run different zones and
> services on their pools, on their networks.
> I have custom resource agents (tested on pacemaker/heartbeat, now
> porting to pacemaker/corosync) for ZFS pools and zones migration.
> When I was testing pacemaker/heartbeat, when ssh-fencing discovered
> the other node to be down (cleanly or abrupt halt), it was
> automatically using IPaddr and our ZFS agents to take control of
> everything, mounting the other pool and running any configured zone in it.
> I would like to do the same with pacemaker/corosync.
> The two nodes of the dual machine have an inernal lan connecting them,
> a 100Mb ethernet: maybe this is enough reliable to trust ssh-fencing?
> Or is there anything I can do to ensure at the controller level that
> the pool is not in use on the other node?

It is not just about the reliability of the networking-connection why
ssh-fencing might be
suboptimal. Something with the IP-Stack config (dynamic due to moving
resources)
might have gone wrong. And resources might be somehow hanging so that
the node
can't be brought down gracefully. Thus my suggestion to add a watchdog
(so available)
via sbd.

>
> Gabriele
>
> 
> *Sonicle S.r.l. *: http://www.sonicle.com 
> *Music: *http://www.gabrielebulfon.com 
> *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
>
>
>
> --
>
> Da: Ken Gaillot 
> A: gbul...@sonicle.com Cluster Labs - All topics related to
> open-source clustering welcomed 
> Data: 1 settembre 2016 15.49.04 CEST
> Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
>
> On 08/31/2016 11:50 PM, Gabriele Bulfon wrote:
> > Thanks, got it.
> > So, is it better to use "two_node: 1" or, as suggested else
> where, or
> > "no-quorum-policy=stop"?
>
> I'd prefer "two_node: 1" and letting pacemaker's options default. But
> see the votequorum(5) man page for what two_node implies -- most
> importantly, both nodes have to be available when the cluster starts
> before it will start any resources. Node failure is handled fine once
> the cluster has started, but at start time, both nodes must be up.
>
> > About fencing, the machine I'm going to implement the 2-nodes
> cluster is
> > a dual machine with shared disks backend.
> > Each node has two 10Gb ethernets dedicated to the public ip and the
> > admin console.
> > Then there is a third 100Mb ethernet connecing the two machines
> internally.
> > I was going to use this last one as fencing via ssh, but looks
> like this
> > way I'm not gonna have ip/pool/zone movements if one of the nodes
> > freezes or halts without shutting down pacemaker clean.
> > What should I use instead?
>
> I'm guessing as a dual machine, they share a power supply, so that
> rules
> out a power switch. If the box has IPMI that can individually power
> cycle each host, you can use fence_ipmilan. If the disks are
> shared via
> iSCSI, you could use fence_scsi. If the box has a hardware watchdog
> device that can individually target the hosts, you could use sbd. If
> none of those is an option, probably the best you could do is run the
> cluster nodes as VMs on each host, and use fence_xvm.
>
> > Thanks for your help,
> > Gabriele
> >
> >
> 
> 
> > *Sonicle S.r.l. *: http://www.sonicle.com 
> > *Music: *http://www.gabrielebulfon.com
> 
> > *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
> >
> >
> >
> >
> 
> --
> >
> > Da: Ken Gaillot 
> > A: users@clusterlabs.org
> > Data: 31 agosto 2016 17.25.05 CEST
> > Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
> >
> > On 08/30/2016 01:52 AM, Gabriele Bulfon wrote:
> > > Sorry for reiterating, but my main question was:
> > >
> > > why does node 1 removes its own IP if I shut down node 2 abruptly?
> > > I understand that it does not take the node 2 IP (because the
> > > ssh-fencing has no clue about what happened on the 2nd node),
> but I
> > > wouldn't expect it to shut down its own 

Re: [ClusterLabs] ip clustering strange behaviour

2016-09-05 Thread Gabriele Bulfon
The dual machine is equipped with a syncro controller LSI 3008 MPT SAS3.
Both nodes can see the same jbod disks (10 at the moment, up to 24).
Systems are XStreamOS / illumos, with ZFS.
Each system has one ZFS pool of 5 disks, with different pool names (data1, 
data2).
When in active / active, the two machines run different zones and services on 
their pools, on their networks.
I have custom resource agents (tested on pacemaker/heartbeat, now porting to 
pacemaker/corosync) for ZFS pools and zones migration.
When I was testing pacemaker/heartbeat, when ssh-fencing discovered the other 
node to be down (cleanly or abrupt halt), it was automatically using IPaddr and 
our ZFS agents to take control of everything, mounting the other pool and 
running any configured zone in it.
I would like to do the same with pacemaker/corosync.
The two nodes of the dual machine have an inernal lan connecting them, a 100Mb 
ethernet: maybe this is enough reliable to trust ssh-fencing? Or is there 
anything I can do to ensure at the controller level that the pool is not in use 
on the other node?
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
A: gbul...@sonicle.com Cluster Labs - All topics related to open-source 
clustering welcomed
Data: 1 settembre 2016 15.49.04 CEST
Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
On 08/31/2016 11:50 PM, Gabriele Bulfon wrote:
Thanks, got it.
So, is it better to use "two_node: 1" or, as suggested else where, or
"no-quorum-policy=stop"?
I'd prefer "two_node: 1" and letting pacemaker's options default. But
see the votequorum(5) man page for what two_node implies -- most
importantly, both nodes have to be available when the cluster starts
before it will start any resources. Node failure is handled fine once
the cluster has started, but at start time, both nodes must be up.
About fencing, the machine I'm going to implement the 2-nodes cluster is
a dual machine with shared disks backend.
Each node has two 10Gb ethernets dedicated to the public ip and the
admin console.
Then there is a third 100Mb ethernet connecing the two machines internally.
I was going to use this last one as fencing via ssh, but looks like this
way I'm not gonna have ip/pool/zone movements if one of the nodes
freezes or halts without shutting down pacemaker clean.
What should I use instead?
I'm guessing as a dual machine, they share a power supply, so that rules
out a power switch. If the box has IPMI that can individually power
cycle each host, you can use fence_ipmilan. If the disks are shared via
iSCSI, you could use fence_scsi. If the box has a hardware watchdog
device that can individually target the hosts, you could use sbd. If
none of those is an option, probably the best you could do is run the
cluster nodes as VMs on each host, and use fence_xvm.
Thanks for your help,
Gabriele

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Ken Gaillot
A: users@clusterlabs.org
Data: 31 agosto 2016 17.25.05 CEST
Oggetto: Re: [ClusterLabs] ip clustering strange behaviour
On 08/30/2016 01:52 AM, Gabriele Bulfon wrote:
Sorry for reiterating, but my main question was:
why does node 1 removes its own IP if I shut down node 2 abruptly?
I understand that it does not take the node 2 IP (because the
ssh-fencing has no clue about what happened on the 2nd node), but I
wouldn't expect it to shut down its own IP...this would kill any
service
on both nodes...what am I wrong?
Assuming you're using corosync 2, be sure you have "two_node: 1" in
corosync.conf. That will tell corosync to pretend there is always
quorum, so pacemaker doesn't need any special quorum settings. See the
votequorum(5) man page for details. Of course, you need fencing in this
setup, to handle when communication between the nodes is broken but both
are still up.

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon

*Da:* Gabriele Bulfon
*A:* kwenn...@redhat.com Cluster Labs - All topics related to
open-source clustering welcomed
*Data:* 29 agosto 2016 17.37.36 CEST
*Oggetto:* Re: [ClusterLabs] ip clustering strange behaviour
Ok, got it, I hadn't gracefully shut pacemaker on node2.
Now I restarted, everything was up, stopped pacemaker service on
host2 and I got host1 with 

[ClusterLabs] Antw: Re: fence_apc delay?

2016-09-05 Thread Ulrich Windl
>>> Marek Grac  schrieb am 03.09.2016 um 14:41 in Nachricht
:
> Hi,
> 
> There are two problems mentioned in the email.
> 
> 1) power-wait
> 
> Power-wait is a quite advanced option and there are only few fence
> devices/agent where it makes sense. And only because the HW/firmware on the
> device is somewhat broken. Basically, when we execute power ON/OFF
> operation, we wait for power-wait seconds before we send next command. I
> don't remember any issue with APC and this kind of problems.
> 
> 
> 2) the only theory I could come up with was that maybe the fencing
> operation was considered complete too quickly?
> 
> That is virtually not possible. Even when power ON/OFF is asynchronous, we
> test status of device and fence agent wait until status of the plug/VM/...
> matches what user wants.

I can imagine that a powerful power supply can deliver up to one second of 
power even if the mains is disconnected. If the cluster is very quick after 
fencing, it might be a problem. I'd suggest a 5 to 10 second delay between 
fencing action and cluster reaction.

> 
> 
> m,
> 
> 
> On Fri, Sep 2, 2016 at 3:14 PM, Dan Swartzendruber 
> wrote:
> 
>>
>> So, I was testing my ZFS dual-head JBOD 2-node cluster.  Manual failovers
>> worked just fine.  I then went to try an acid-test by logging in to node A
>> and doing 'systemctl stop network'.  Sure enough, pacemaker told the APC
>> fencing agent to power-cycle node A.  The ZFS pool moved to node B as
>> expected.  As soon as node A was back up, I migrated the pool/IP back to
>> node A.  I *thought* all was okay, until a bit later, I did 'zpool status',
>> and saw checksum errors on both sides of several of the vdevs.  After much
>> digging and poking, the only theory I could come up with was that maybe the
>> fencing operation was considered complete too quickly?  I googled for
>> examples using this, and the best tutorial I found showed using a
>> power-wait=5, whereas the default seems to be power-wait=0?  (this is
>> CentOS 7, btw...)  I changed it to use 5 instead of 0, and did a several
>> fencing operations while a guest VM (vsphere via NFS) was writing to the
>> pool.  So far, no evidence of corruption.  BTW, the way I was creating and
>> managing the cluster was with the lcmc java gui.  Possibly the power-wait
>> default of 0 comes from there, I can't really tell.  Any thoughts or ideas
>> appreciated :)
>>
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> http://clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>>





___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemaker doesn't failover when httpd killed

2016-09-05 Thread Kristoffer Grönlund
Nurit Vilosny  writes:

> Hi everyone,
> I tried the IRC for that, but I get disconnected and cannot see the reply...
> So I try again:
> I have a cluster with 3 nodes and 2 services: apache and application service 
> - grouped together.
> Debugging the cluster I used kill -9 to kill the httpd process, assuming the 
> services will migrate to another node, but they didn't.
> Log didn't show anything, and I remember reading that pacemaker check httpd 
> status somewhere else that at service httpd status - but couldn't find where.
> Any idea what can I do ?

Without any attached logs or before/after status information, it's
difficult to know what exactly happened in your case. But by default,
Pacemaker tries to restart the service on the same node before migrating
to another node. So running kill -9 on httpd should result in a restart
on the same node, not a migration.

Cheers,
Kristoffer

>
> Thanks.
> Nurit
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org