Re: [ClusterLabs] [Pacemaker] large cluster - failure recovery

Cédric Dufour - Idiap Research Institute Thu, 19 Nov 2015 00:46:03 -0800

[coming over from the old mailing list pacema...@oss.clusterlabs.org; sorry for 
any thread discrepancy]

Hello,

We've also setup a fairly large cluster - 24 nodes / 348 resources (pacemaker 
1.1.12, corosync 1.4.7) - and pacemaker 1.1.12 is definitely the minimum 
version you'll want, thanks to changes on how the CIB is handled.

If you're going to handle a large number (~several hundreds) of resources as 
well, you may need to concern yourself with the CIB size as well.
You may want to have a look at pp.17-18 of the document I wrote to describe our 
setup: http://cedric.dufour.name/cv/download/idiap_havc2.pdf

Currently, I would consider that with 24 nodes / 348 resources, we are close to 
the limit of what our cluster can handle, the bottleneck being CPU(core) power 
for CIB/CRM handling. Our "worst performing nodes" (out of the 24 in the 
cluster) are Xeon E7-2830 @ 2.13GHz.
The main issue we currently face in when a DC is taken out and a new one must 
be elected: CPU goes 100% for several tens of seconds (even minutes), during 
which the cluster is totally unresponsive. Fortunately, resources themselves 
just seat tight and remain available (I can't say about those who would need to 
be migrated because being collocated with the DC; we manually avoid that 
situation when performing maintenance that may affect the DC)

I'm looking forwards to migrate to corosync 2+ (there are some backports 
available for debian/Jessie) and see it this would allow to push the limit 
further. Unfortunately, I can't say for sure as I have only a limited 
understanding of how Pacemaker/Corosync work and where CPU is bond to become a 
bottleneck.

[UPDATE] Thanks Ken for the Pacemaker Remote pointer; I'm head on to have a 
look at that

'Hope it can help,

Cédric

On 04/11/15 23:26, Radoslaw Garbacz wrote:
> Thank you, will give it a try.
>
> On Wed, Nov 4, 2015 at 12:50 PM, Trevor Hemsley <thems...@voiceflex.com 
> <mailto:thems...@voiceflex.com>> wrote:
>
>     On 04/11/15 18:41, Radoslaw Garbacz wrote:
>     > Details:
>     > OS: CentOS 6
>     > Pacemaker: Pacemaker 1.1.9-1512.el6
>     > Corosync: Corosync Cluster Engine, version '2.3.2'
>
>     yum update
>
>     Pacemaker is currently 1.1.12 and corosync 1.4.7 on CentOS 6. There were
>     major improvements in speed with later versions of pacemaker.
>
>     Trevor
>
>     _______________________________________________
>     Pacemaker mailing list: pacema...@oss.clusterlabs.org 
> <mailto:pacema...@oss.clusterlabs.org>
>     http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>     Project Home: http://www.clusterlabs.org
>     Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>     Bugs: http://bugs.clusterlabs.org
>
>
>
>
> -- 
> Best Regards,
>
> Radoslaw Garbacz
> XtremeData Incorporation
>
>
> _______________________________________________
> Pacemaker mailing list: pacema...@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] [Pacemaker] large cluster - failure recovery

Reply via email to