Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd resource

2017-11-14 Thread Digimer

  
  
Quorum doesn't prevent split-brains,
  stonith (fencing) does. 
  
  https://www.alteeve.com/w/The_2-Node_Myth
  
  There is no way to use quorum-only to avoid a potential
  split-brain. You might be able to make it less likely with enough
  effort, but never prevent it.
  
  digimer
  
  On 2017-11-14 10:45 PM, Garima wrote:


  
  
  
  
Hello
All,
 
Split-brain
situation occurs due to there is a drop in quorum which
leads to Spilt-brain situation and status information is not
exchanged between both two nodes of the cluster. 
This
can be avoided if quorum communicates between both the
nodes.

I
have checked the code. In My opinion these files need to be
updated (quorum.py/stonith.py) to avoid the spilt-brain
situation to maintain Active-Passive configuration.
 
Regards,
Garima
 

  
From: Derek Wuelfrath
[mailto:dwuelfr...@inverse.ca]

Sent: 13 November 2017 20:55
To: Cluster Labs - All topics related to
open-source clustering welcomed

Subject: Re: [ClusterLabs] Pacemaker responsible
of DRBD and a systemd resource
  

 
Hello Ken !

   


  
Make sure that the systemd service is
  not enabled. If pacemaker is
  managing a service, systemd can't also be trying to start
  and stop it.
  
   


  It is not. I made sure of this in the
first place :)


   


  
Beyond that, the question is what log
  messages are there from around
  the time of the issue (on both nodes).
  
   


  Well, that’s the thing. There is not much
log messages telling what is actually happening. The
’systemd’ resource is not even trying to start (nothing in
either log for that resource). Here are the logs from my
last attempt:


  Scenario:


  - Services were running on
‘pancakeFence2’. DRBD was synced and connected


  - I rebooted ‘pancakeFence2’. Services
failed to ‘pancakeFence1’


  - After ‘pancakeFence2’ comes back,
services are running just fine on ‘pancakeFence1’ but DRBD
is in Standalone due to split-brain


   


  Logs for pancakeFence1: https://pastebin.com/dVSGPP78


  Logs for pancakeFence2: https://pastebin.com/at8qPkHE


   


  It really looks like the status checkup
mechanism of corosync/pacemaker for a systemd resource force
the resource to “start” and therefore, start the ones above
that resource in the group (DRBD in instance).


  This does not happen for a regular OCF
resource (IPaddr2 per example)


  

  

  

  

  

  

  
  Cheers!


  -dw


   


  --


  Derek
  Wuelfrath


  dwuelfr...@inverse.ca ::
  +1.514.447.4918 (x110) ::
  +1.866.353.6153 (x110)


  Inverse
  inc. :: Leaders behind SOGo (www.sogo.nu),
  PacketFence (www.packetfence.org)
  and Fingerbank (www.fingerbank.org)

  

  
   

Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd resource

2017-11-14 Thread Garima
Hello All,

Split-brain situation occurs due to there is a drop in quorum which leads to 
Spilt-brain situation and status information is not exchanged between both two 
nodes of the cluster.
This can be avoided if quorum communicates between both the nodes.
I have checked the code. In My opinion these files need to be updated 
(quorum.py/stonith.py) to avoid the spilt-brain situation to maintain 
Active-Passive configuration.

Regards,
Garima

From: Derek Wuelfrath [mailto:dwuelfr...@inverse.ca]
Sent: 13 November 2017 20:55
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd resource

Hello Ken !

Make sure that the systemd service is not enabled. If pacemaker is
managing a service, systemd can't also be trying to start and stop it.

It is not. I made sure of this in the first place :)

Beyond that, the question is what log messages are there from around
the time of the issue (on both nodes).

Well, that’s the thing. There is not much log messages telling what is actually 
happening. The ’systemd’ resource is not even trying to start (nothing in 
either log for that resource). Here are the logs from my last attempt:
Scenario:
- Services were running on ‘pancakeFence2’. DRBD was synced and connected
- I rebooted ‘pancakeFence2’. Services failed to ‘pancakeFence1’
- After ‘pancakeFence2’ comes back, services are running just fine on 
‘pancakeFence1’ but DRBD is in Standalone due to split-brain

Logs for pancakeFence1: https://pastebin.com/dVSGPP78
Logs for pancakeFence2: https://pastebin.com/at8qPkHE

It really looks like the status checkup mechanism of corosync/pacemaker for a 
systemd resource force the resource to “start” and therefore, start the ones 
above that resource in the group (DRBD in instance).
This does not happen for a regular OCF resource (IPaddr2 per example)

Cheers!
-dw

--
Derek Wuelfrath
dwuelfr...@inverse.ca :: +1.514.447.4918 (x110) 
:: +1.866.353.6153 (x110)
Inverse inc. :: Leaders behind SOGo (www.sogo.nu), 
PacketFence (www.packetfence.org) and Fingerbank 
(www.fingerbank.org)


On Nov 10, 2017, at 11:39, Ken Gaillot 
mailto:kgail...@redhat.com>> wrote:

On Thu, 2017-11-09 at 20:27 -0500, Derek Wuelfrath wrote:

Hello there,

First post here but following since a while!

Welcome!



Here’s my issue,
we are putting in place and running this type of cluster since a
while and never really encountered this kind of problem.

I recently set up a Corosync / Pacemaker / PCS cluster to manage DRBD
along with different other resources. Part of theses resources are
some systemd resources… this is the part where things are “breaking”.

Having a two servers cluster running only DRBD or DRBD with an OCF
ipaddr2 resource (Cluser IP in instance) works just fine. I can
easily move from one node to the other without any issue.
As soon as I add a systemd resource to the resource group, things are
breaking. Moving from one node to the other using standby mode works
just fine but as soon as Corosync / Pacemaker restart involves
polling of a systemd resource, it seems like it is trying to start
the whole resource group and therefore, create a split-brain of the
DRBD resource.

My first two suggestions would be:

Make sure that the systemd service is not enabled. If pacemaker is
managing a service, systemd can't also be trying to start and stop it.

Fencing is the only way pacemaker can resolve split-brains and certain
other situations, so that will help in the recovery.

Beyond that, the question is what log messages are there from around
the time of the issue (on both nodes).




It is the best explanation / description of the situation that I can
give. If it need any clarification, examples, … I am more than open
to share them.

Any guidance would be appreciated :)

Here’s the output of a ‘pcs config’

https://pastebin.com/1TUvZ4X9

Cheers!
-dw

--
Derek Wuelfrath
dwuelfr...@inverse.ca :: +1.514.447.4918 (x110) 
:: +1.866.353.6153
(x110)
Inverse inc. :: Leaders behind SOGo (www.sogo.nu), 
PacketFence
(www.packetfence.org) and Fingerbank 
(www.fingerbank.org)
--
Ken Gaillot mailto:kgail...@redhat.com>>

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.or

Re: [ClusterLabs] Pacemaker 1.1.18 released

2017-11-14 Thread Digimer
On 2017-11-14 05:23 PM, Ken Gaillot wrote:
> ClusterLabs announces the release of version 1.1.18 of the Pacemaker
> cluster resource manager. The source code is available at:
> 
> https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.18
> 
> This is expected to be the last actively developed release in the 1.1
> line. Development will now begin on Pacemaker 2.0.0.
> 
> The most significant new features in 1.1.18 are:
> 
> * Warnings will be logged when legacy configuration syntax planned to
> be removed in 2.0.0 is used.
> 
> * Bundles are no longer considered experimental. They support all
> constraint types, and they now support rkt as well as Docker
> containers. Many bug fixes and enhancements have been made.
>  
> * Alerts may now be filtered so that alert agents are called only for
> desired alert types, and (as an experimental feature) it is now
> possible to receive alerts for transient node attribute changes.
> 
> * Status output (from crm_mon, pengine logs, and a new crm_resource --
> why option) now has more details about why resources are in a certain
> state.
> 
> As usual, to support the new features, the CRM feature set has been
> incremented. This means that mixed-version clusters are supported only
> during a rolling upgrade -- nodes with an older version will not be
> allowed to rejoin once they shut down.
> 
> For a more detailed list of bug fixes and other changes, see the change
> log:
> 
> https://github.com/ClusterLabs/pacemaker/blob/1.1/ChangeLog
> 
> Everyone is encouraged to download, compile and test the new release.
> Your feedback is important and appreciated.
> 
> Many thanks to all contributors of source code to this release,
> including Andrew Beekhof, Aravind Kumar, Artur Novik, Bin Liu, Ferenc
> Wágner, Helmut Grohne, Hideo Yamauchi, Igor Tsiglyar, Jan
> Pokorný, Kazunori INOUE, Keisuke MORI, Ken Gaillot, Klaus Wenninger,
> Nye Liu, Tomer Azran, Valentin Vidic, and Yan Gao.

Awesome! Thanks to all involved!


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Pacemaker 1.1.18 released

2017-11-14 Thread Ken Gaillot
ClusterLabs announces the release of version 1.1.18 of the Pacemaker
cluster resource manager. The source code is available at:

https://github.com/ClusterLabs/pacemaker/releases/tag/Pacemaker-1.1.18

This is expected to be the last actively developed release in the 1.1
line. Development will now begin on Pacemaker 2.0.0.

The most significant new features in 1.1.18 are:

* Warnings will be logged when legacy configuration syntax planned to
be removed in 2.0.0 is used.

* Bundles are no longer considered experimental. They support all
constraint types, and they now support rkt as well as Docker
containers. Many bug fixes and enhancements have been made.
 
* Alerts may now be filtered so that alert agents are called only for
desired alert types, and (as an experimental feature) it is now
possible to receive alerts for transient node attribute changes.

* Status output (from crm_mon, pengine logs, and a new crm_resource --
why option) now has more details about why resources are in a certain
state.

As usual, to support the new features, the CRM feature set has been
incremented. This means that mixed-version clusters are supported only
during a rolling upgrade -- nodes with an older version will not be
allowed to rejoin once they shut down.

For a more detailed list of bug fixes and other changes, see the change
log:

https://github.com/ClusterLabs/pacemaker/blob/1.1/ChangeLog

Everyone is encouraged to download, compile and test the new release.
Your feedback is important and appreciated.

Many thanks to all contributors of source code to this release,
including Andrew Beekhof, Aravind Kumar, Artur Novik, Bin Liu, Ferenc
Wágner, Helmut Grohne, Hideo Yamauchi, Igor Tsiglyar, Jan
Pokorný, Kazunori INOUE, Keisuke MORI, Ken Gaillot, Klaus Wenninger,
Nye Liu, Tomer Azran, Valentin Vidic, and Yan Gao.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] systemd's TasksMax and pacemaker

2017-11-14 Thread Ken Gaillot
Systemd version 227 introduced a new unit file option, TasksMax, to
limit the number of processes that a service can spawn at one time.
Depending on the version, the default is either 512 or 4,915.

It is conceivable in a large cluster that Pacemaker could exceed this
limit, so we are now recommending that users set TasksMax=infinity in
the Pacemaker unit file if building from scratch, or in a local
override if already deployed, to disable the limit.

We are not setting TasksMax=infinity in the shipped unit file in the
soon-to-be-released version 1.1.18 because older versions of systemd
will log a warning about an "Unknown lvalue". However, we will set it
in the 2.0.0 release, when we'll be making a number of behavioral
changes.

Particular OS distributions may have backported the TasksMax feature to
an older version of systemd, and/or changed its default value. For
example, in RHEL, TasksMax was backported as of RHEL 7.3, but the
default was changed to infinity.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] corosync race condition when node leaves immediately after joining

2017-11-14 Thread Jonathan Davies


On 13/11/17 17:06, Jan Friesse wrote:

Jonathan,
I've finished (I hope) proper fix for problem you've seen, so can you 
please try to test


https://github.com/corosync/corosync/pull/280

Thanks,
   Honza


Hi Honza,

Thanks very much for putting this fix together.

I'm happy to confirm that I do not see the problem with this fix.

In my repro environment that normally triggers the problem once in every 
2 attempts, I didn't see the problem at all after over 1000 attempts 
with these patches.


Thanks!
Jonathan

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] resource-agents v4.1.0 rc1

2017-11-14 Thread Oyvind Albrigtsen

ClusterLabs is happy to announce resource-agents v4.1.0 rc1.
Source code is available at:
https://github.com/ClusterLabs/resource-agents/releases/tag/v4.1.0rc1

The most significant enhancements in this release are:
- new resource agents:
 - aws-vpc-route53
 - LVM-activate
 - minio
 - NodeUtilization
 - oraasm
 - ovsmonitor
 - rkt
 - ZFS

- bugfixes and enhancements:
 - aws*: fixes and improvements
 - CTDB: fixes for newer versions
 - CTDB: fix for --logfile being replaced with --logging
 - DB2: fix HADR support for DB2 V98+
 - docker: add docker-native healthcheck
 - galera: fix for MariaDB 10.1.21+
 - mysql: set correct master score after maintenance mode
 - ocf-shellfuncs: improve locking (ocf_take_lock())
 - pgsql: add support for PostgreSQL 10
 - pgsql: allow dynamic membership
 - rabbitmq-cluster: fix to work on Pacemaker remote nodes

The full list of changes for resource-agents is available at:
https://github.com/ClusterLabs/resource-agents/blob/v4.1.0rc1/ChangeLog

Everyone is encouraged to download and test the new release candidate.
We do many regression tests and simulations, but we can't cover all
possible use cases, so your feedback is important and appreciated.

Many thanks to all the contributors to this release.


Best,
The resource-agents maintainers

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org