Re: [Pacemaker] Split-site cluster in two locations

2011-01-11 Thread Andrew Beekhof
On Tue, Dec 28, 2010 at 10:21 PM, Anton Altaparmakov  wrote:
> Hi,
>
> On 28 Dec 2010, at 20:32, Michael Schwartzkopff wrote:
>> Hi,
>>
>> I have four nodes in a split site scenario located in two computing centers.
>> STONITH is enabled.
>>
>> Is there and best practise how to deal with this setup? Does it make sense to
>> set expected-quorum-votes to "3" to make the whole setup still running with
>> one data center online? Is this possible at all?
>>
>> Is quorum needed with STONITH enabled?
>>
>> Is there a quorum server available already?
>
> I couldn't see a quorum server in Pacemaker so I have installed a third dummy 
> node which is not allowed to run any resources (using location constraints 
> and setting the cluster to not be symmetric) which just acts as a third vote. 
>  I am hoping this effectively acts as a quorum server as a node that looses 
> connectivity will lose quorum and shut down its services whilst the other 
> real node will retain connectivity and thus quorum due to the dummy node 
> still being present.
>
> Obviously this is quite wasteful of servers as you can only run a single 
> Pacemaker instance on a server (as far as I know) so that is a lot of dummy 
> servers when you run multiple pacemaker clusters...  Solution for us is to 
> use virtualization - one physical server with VMs and each VM is a dummy node 
> for a cluster...

With recent 1.1.x builds it should be possible to run just the
corosync piece (no pacemaker).

>
> Best regards,
>
>        Anton
>
>>
>> Thanks for any hints,
>>
>> Greetings!
>>
>> --
>> Dr. Michael Schwartzkopff
>> Guardinistr. 63
>> 81375 München
>>
>> Tel: (0163) 172 50 98
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: 
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
> Best regards,
>
>        Anton
> --
> Anton Altaparmakov  (replace at with @)
> Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
> Linux NTFS maintainer, http://www.linux-ntfs.org/
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Split-site cluster in two locations

2011-01-11 Thread Christoph Herrmann
-Ursprüngliche Nachricht-
Von: Andrew Beekhof 
Gesendet: Di 11.01.2011 09:01
An: The Pacemaker cluster resource manager ; 
CC: Michael Schwartzkopff ; 
Betreff: Re: [Pacemaker] Split-site cluster in two locations

> On Tue, Dec 28, 2010 at 10:21 PM, Anton Altaparmakov  wrote:
> > Hi,
> >
> > On 28 Dec 2010, at 20:32, Michael Schwartzkopff wrote:
> >> Hi,
> >>
> >> I have four nodes in a split site scenario located in two computing 
> >> centers.
> >> STONITH is enabled.
> >>
> >> Is there and best practise how to deal with this setup? Does it make sense 
> >> to
> >> set expected-quorum-votes to "3" to make the whole setup still running with
> >> one data center online? Is this possible at all?
> >>
> >> Is quorum needed with STONITH enabled?
> >>
> >> Is there a quorum server available already?
> >
> > I couldn't see a quorum server in Pacemaker so I have installed a third 
> > dummy 
> node which is not allowed to run any resources (using location constraints 
> and 
> setting the cluster to not be symmetric) which just acts as a third vote.  I 
> am 
> hoping this effectively acts as a quorum server as a node that looses 
> connectivity will lose quorum and shut down its services whilst the other 
> real 
> node will retain connectivity and thus quorum due to the dummy node still 
> being 
> present.
> >
> > Obviously this is quite wasteful of servers as you can only run a single 
> Pacemaker instance on a server (as far as I know) so that is a lot of dummy 
> servers when you run multiple pacemaker clusters...  Solution for us is to 
> use 
> virtualization - one physical server with VMs and each VM is a dummy node for 
> a 
> cluster...
> 
> With recent 1.1.x builds it should be possible to run just the
> corosync piece (no pacemaker).
> 

As long as you have only two computing centers it doesn't matter if you run a 
corosync
only piece or whatever  on a physikal or a virtual machine. The question is: 
How to
configure a four node (or six node, an even number bigger then two) 
corosync/pacemaker
cluster to continue services if you have a blackout in one computing center 
(you will
always loose (at least) one half of your nodes), but to shutdown everything if 
you have
less then half of the node available. Are there any best practices on how to 
deal with
clusters in two computing centers? Anything like an external quorum node or a 
quorum
partition? I'd like to set the expected-quorum-votes to "3" but this is not 
possible
(with corosync-1.2.6 and pacemaker-1.1.2 on SLES11 SP1) Does anybody know why?
Currently, the only way I can figure out is to run the cluster with 
no-quorum-policy="ignore". But I don't like that. Any suggestions?


Best regards

  Christoph
-- 
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Dr. Roland Niemeier, 
Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Michel Lepert
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196 



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] pingd process dies for no reason

2011-01-11 Thread Patrik . Rapposch
we already made changes to the interval and timeout ().

how big should dampen be set?

please correct me, if i am wrong, as i calculate it as following:
assuming the last check was ok and in the next second, the failures takes 
place:
then we there would be 29s till the next check will start, and another 10 
seconds timeout, plus 5 seconds dampen. this would be 44 seconds, isn't 
that enough?

thx,

kr patrik




Mit freundlichen Grüßen / Best Regards

Patrik Rapposch, BSc
System Administration

KNAPP Systemintegration GmbH
Waltenbachstraße 9
8700 Leoben, Austria 
Phone: +43 3842 805-915
Fax: +43 3842 82930-500
patrik.rappo...@knapp.com 
www.KNAPP.com 

Commercial register number: FN 138870x
Commercial register court: Leoben

The information in this e-mail (including any attachment) is confidential 
and intended to be for the use of the addressee(s) only. If you have 
received the e-mail by mistake, any disclosure, copy, distribution or use 
of the contents of the e-mail is prohibited, and you must delete the 
e-mail from your system. As e-mail can be changed electronically KNAPP 
assumes no responsibility for any alteration to this e-mail or its 
attachments. KNAPP has taken every reasonable precaution to ensure that 
any attachment to this e-mail has been swept for virus. However, KNAPP 
does not accept any liability for damage sustained as a result of such 
attachment being virus infected and strongly recommend that you carry out 
your own virus check before opening any attachment.



Dejan Muhamedagic  
10.01.2011 17:48
Bitte antworten an
The Pacemaker cluster resource manager  


An
The Pacemaker cluster resource manager 
Kopie

Thema
Re: [Pacemaker] Antwort: Re:  pingd process dies for no reason






Hi again,

On Fri, Jan 07, 2011 at 04:35:40PM +0100, patrik.rappo...@knapp.com wrote:
[...]
> > the ressource is configured in following way:
> > 
> > 
> >> name="globally-unique" value="false"/>
> > 
> >  > type="ping">
> >   
> >  > name="host_list" value="xxx.xxx.xxx.xxx"/>
> >  > name="multiplier" value="100"/>
> >  > name="dampen" value="5s"/>
> >   
> >   
> >  > timeout="5s"/>

5s is way too short.

Thanks,

Dejan

> >   
> > 
> >   
> > 
> > thx for your help in advance.
> > 
> > Mit freundlichen Grüßen / Best Regards
> > 
> > Patrik Rapposch, BSc
> 
> Please use the "ping" resource agent instead of the "pingd"
> 
> Greetings,
> 
> -- 
> Dr. Michael Schwartzkopff
> Guardinistr. 63
> 81375 München
> 
> Tel: (0163) 172 50 98
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> 
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 


> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: 
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Split-site cluster in two locations

2011-01-11 Thread Robert van Leeuwen
-Original message-
To: The Pacemaker cluster resource manager ; 
From:   Christoph Herrmann 
Sent:   Tue 11-01-2011 10:24
Subject:Re: [Pacemaker] Split-site cluster in two locations
 
> As long as you have only two computing centers it doesn't matter if you run a 
> corosync
> only piece or whatever  on a physikal or a virtual machine. The question is: 
> How to
> configure a four node (or six node, an even number bigger then two) 
> corosync/pacemaker
> cluster to continue services if you have a blackout in one computing center 
> (you will
> always loose (at least) one half of your nodes), but to shutdown everything 
> if 
> you have
> less then half of the node available. Are there any best practices on how to 
> deal with
> clusters in two computing centers? Anything like an external quorum node or a 
> quorum
> partition? I'd like to set the expected-quorum-votes to "3" but this is not 
> possible
> (with corosync-1.2.6 and pacemaker-1.1.2 on SLES11 SP1) Does anybody know why?
> Currently, the only way I can figure out is to run the cluster with 
> no-quorum-policy="ignore". But I don't like that. Any suggestions?


Apart from the number of nodes in de datacenter: with 2 datacentre's you have 
another issue:
How do you know which DC is reachable (from you're clients point of view) when 
the communication between DC fails?
Best fix for this would be a node at a third DC but you still run into problems 
with the fencing devices.
I doubt you can remotely power-off the non-responding DC :-)
So a split brain situation is likely to happen sometime. 

So for 100% data integrity I think it is best to let the cluster freeze 
itself...

Best Regards,
Robert van Leeuwen

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Split-site cluster in two locations

2011-01-11 Thread Holger Teutsch
On Tue, 2011-01-11 at 10:21 +0100, Christoph Herrmann wrote:
> -Ursprüngliche Nachricht-
> Von: Andrew Beekhof 
> Gesendet: Di 11.01.2011 09:01
> An: The Pacemaker cluster resource manager ; 
> CC: Michael Schwartzkopff ; 
> Betreff: Re: [Pacemaker] Split-site cluster in two locations
> 
> > On Tue, Dec 28, 2010 at 10:21 PM, Anton Altaparmakov  
> > wrote:
> > > Hi,
> > >
> > > On 28 Dec 2010, at 20:32, Michael Schwartzkopff wrote:
> > >> Hi,
> > >>
> > >> I have four nodes in a split site scenario located in two computing 
> > >> centers.
> > >> STONITH is enabled.
> > >>
> > >> Is there and best practise how to deal with this setup? Does it make 
> > >> sense to
> > >> set expected-quorum-votes to "3" to make the whole setup still running 
> > >> with
> > >> one data center online? Is this possible at all?
> > >>
> > >> Is quorum needed with STONITH enabled?
> > >>
> > >> Is there a quorum server available already?
> > >
> > > I couldn't see a quorum server in Pacemaker so I have installed a third 
> > > dummy 
> > node which is not allowed to run any resources (using location constraints 
> > and 
> > setting the cluster to not be symmetric) which just acts as a third vote.  
> > I am 
> > hoping this effectively acts as a quorum server as a node that looses 
> > connectivity will lose quorum and shut down its services whilst the other 
> > real 
> > node will retain connectivity and thus quorum due to the dummy node still 
> > being 
> > present.
> > >
> > > Obviously this is quite wasteful of servers as you can only run a single 
> > Pacemaker instance on a server (as far as I know) so that is a lot of dummy 
> > servers when you run multiple pacemaker clusters...  Solution for us is to 
> > use 
> > virtualization - one physical server with VMs and each VM is a dummy node 
> > for a 
> > cluster...
> > 
> > With recent 1.1.x builds it should be possible to run just the
> > corosync piece (no pacemaker).
> > 
> 
> As long as you have only two computing centers it doesn't matter if you run a 
> corosync
> only piece or whatever  on a physikal or a virtual machine. The question is: 
> How to
> configure a four node (or six node, an even number bigger then two) 
> corosync/pacemaker
> cluster to continue services if you have a blackout in one computing center 
> (you will
> always loose (at least) one half of your nodes), but to shutdown everything 
> if you have
> less then half of the node available. Are there any best practices on how to 
> deal with
> clusters in two computing centers? Anything like an external quorum node or a 
> quorum
> partition? I'd like to set the expected-quorum-votes to "3" but this is not 
> possible
> (with corosync-1.2.6 and pacemaker-1.1.2 on SLES11 SP1) Does anybody know why?
> Currently, the only way I can figure out is to run the cluster with 
> no-quorum-policy="ignore". But I don't like that. Any suggestions?
> 
> 
> Best regards
> 
>   Christoph

Hi,
I assume the only solution is to work with manual intervention, i.e. the
stonith meatware module.
Whenever a site goes down a human being has to confirm that it is lost,
pull the power cords or the inter-site links so it will not come back
unintentionally.

Then confirm with meatclient on the healthy site that the no longer
reachable site can be considered gone.

From theory this can be configured with an additional meatware stonith
resource with lower priority. The intention is to let your regular
stonith resources do the work with meatware as last resort.
Although I was not able to get this running with versions packaged with
SLES11 SP1. The priority was not honored and a lot of zombie meatware
processes were left over.
I found some patches in the upstream repositories that seem to address
these problems but I didn't follow up.

Regards
Holger


 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] pingd process dies for no reason

2011-01-11 Thread Lars Ellenberg
On Tue, Jan 11, 2011 at 11:24:35AM +0100, patrik.rappo...@knapp.com wrote:
> we already made changes to the interval and timeout ( id="pingd-op-monitor-30s" interval="30s" name="monitor" timeout="10s"/>).
> 
> how big should dampen be set?
> 
> please correct me, if i am wrong, as i calculate it as following:
> assuming the last check was ok and in the next second, the failures takes 
> place:
> then we there would be 29s till the next check will start, and another 10 
> seconds timeout, plus 5 seconds dampen. this would be 44 seconds, isn't 
> that enough?

I think "dampen" needs to be larger than the monitoring interval.
And the timeout on the operation should be large enough that
ping, even if the remote is unreachable for the first time,
will timeout by itself (and not killed prematurely by lrmd because
the operation timeout elapsed).

try with interval 15s, dampen 20,
  instance parameter timeout: something explicit, if you want to.
  instance parameter attempts: something explicit, if you want to.
 monitor operation timeout=60s 

BTW, someone should really implement the fping based ping RA ...
Or did I miss it?

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] pingd process dies for no reason

2011-01-11 Thread Patrik . Rapposch
hy,

thx i configured these values now. i hope that we won't face this problem 
again, otherwise, like i said, i turned on the debug mode of the ping ra, 
and if i get the next maintenance window, i'll turn on cluster debog mode. 
so we'd have more log info to find the reason for this problem.

thx again.

kr patrik


Mit freundlichen Grüßen / Best Regards

Patrik Rapposch, BSc
System Administration

KNAPP Systemintegration GmbH
Waltenbachstraße 9
8700 Leoben, Austria 
Phone: +43 3842 805-915
Fax: +43 3842 82930-500
patrik.rappo...@knapp.com 
www.KNAPP.com 

Commercial register number: FN 138870x
Commercial register court: Leoben

The information in this e-mail (including any attachment) is confidential 
and intended to be for the use of the addressee(s) only. If you have 
received the e-mail by mistake, any disclosure, copy, distribution or use 
of the contents of the e-mail is prohibited, and you must delete the 
e-mail from your system. As e-mail can be changed electronically KNAPP 
assumes no responsibility for any alteration to this e-mail or its 
attachments. KNAPP has taken every reasonable precaution to ensure that 
any attachment to this e-mail has been swept for virus. However, KNAPP 
does not accept any liability for damage sustained as a result of such 
attachment being virus infected and strongly recommend that you carry out 
your own virus check before opening any attachment.



Lars Ellenberg  
11.01.2011 14:47
Bitte antworten an
The Pacemaker cluster resource manager  


An
pacemaker@oss.clusterlabs.org
Kopie

Thema
Re: [Pacemaker] pingd process dies for no reason






On Tue, Jan 11, 2011 at 11:24:35AM +0100, patrik.rappo...@knapp.com wrote:
> we already made changes to the interval and timeout ( id="pingd-op-monitor-30s" interval="30s" name="monitor" 
timeout="10s"/>).
> 
> how big should dampen be set?
> 
> please correct me, if i am wrong, as i calculate it as following:
> assuming the last check was ok and in the next second, the failures 
takes 
> place:
> then we there would be 29s till the next check will start, and another 
10 
> seconds timeout, plus 5 seconds dampen. this would be 44 seconds, isn't 
> that enough?

I think "dampen" needs to be larger than the monitoring interval.
And the timeout on the operation should be large enough that
ping, even if the remote is unreachable for the first time,
will timeout by itself (and not killed prematurely by lrmd because
the operation timeout elapsed).

try with interval 15s, dampen 20,
  instance parameter timeout: something explicit, if you want to.
  instance parameter attempts: something explicit, if you want to.
 monitor operation timeout=60s 

BTW, someone should really implement the fping based ping RA ...
Or did I miss it?

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: 
http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Best stonith method to avoid split brain on a drbd cluster

2011-01-11 Thread Dejan Muhamedagic
Hi,

On Wed, Jan 05, 2011 at 05:18:36PM -0700, Devin Reade wrote:
> Johannes Freygner  wrote:
> 
> > *) Yes, and I found the wrong setting:
> 
> Excellent.
> 
> > But if I pull the power cable without a regular shutting down,
> > the powerless node gets status "UNCLEAN (offline)" and the 
> > resources remains stopped.
> 
> I would contend that would be correct behavior as (again assuming that
> you have redundant power sources), the only way that should happen is 
> with multiple failures (against which HA in general does not protect).
> 
> But, it's *your* cluster :)
> 
> > I found and tested a workaround: I use as second fencing device "meatware"
> [snip]
> 
> I would be wary of meatware; I believe that it's intended for testing only.

No, it's not for testing.

> I tried using it in production on one site where the nodes of the HA cluster
> were virtual machines under the free version of VMWare ESXi.  (That version
> has the APIs disabled that would be required to do a proper stonith
> mechanism.) I found that meatware was flakey at best.

If it was flakey, then you ran into a bug. I think that there
were some issues with pacemaker 1.1, but can't recall details
anymore.

Thanks,

Dejan

> As an aside, I will likely never again try to deploy HA nodes as ESXi
> VMs unless the site is running the commercial version.
> 
> Devin
> -- 
> It is far, far better to have a bastard in the 
> family than an unemployed son-in-law. - Robert Heinlein
> 
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] pingd process dies for no reason

2011-01-11 Thread Andrew Beekhof
On Tue, Jan 11, 2011 at 2:45 PM, Lars Ellenberg
 wrote:
> On Tue, Jan 11, 2011 at 11:24:35AM +0100, patrik.rappo...@knapp.com wrote:
>> we already made changes to the interval and timeout (> id="pingd-op-monitor-30s" interval="30s" name="monitor" timeout="10s"/>).
>>
>> how big should dampen be set?
>>
>> please correct me, if i am wrong, as i calculate it as following:
>> assuming the last check was ok and in the next second, the failures takes
>> place:
>> then we there would be 29s till the next check will start, and another 10
>> seconds timeout, plus 5 seconds dampen. this would be 44 seconds, isn't
>> that enough?
>
> I think "dampen" needs to be larger than the monitoring interval.
> And the timeout on the operation should be large enough that
> ping, even if the remote is unreachable for the first time,
> will timeout by itself (and not killed prematurely by lrmd because
> the operation timeout elapsed).
>
> try with interval 15s, dampen 20,
>  instance parameter timeout: something explicit, if you want to.
>  instance parameter attempts: something explicit, if you want to.
>  monitor operation timeout=60s
>
> BTW, someone should really implement the fping based ping RA ...

Thankyou for volunteering :-)

> Or did I miss it?
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] pingd process dies for no reason

2011-01-11 Thread Lars Ellenberg
On Tue, Jan 11, 2011 at 03:53:29PM +0100, Andrew Beekhof wrote:
> On Tue, Jan 11, 2011 at 2:45 PM, Lars Ellenberg
>  wrote:
> > On Tue, Jan 11, 2011 at 11:24:35AM +0100, patrik.rappo...@knapp.com wrote:
> >> we already made changes to the interval and timeout ( >> id="pingd-op-monitor-30s" interval="30s" name="monitor" timeout="10s"/>).
> >>
> >> how big should dampen be set?
> >>
> >> please correct me, if i am wrong, as i calculate it as following:
> >> assuming the last check was ok and in the next second, the failures takes
> >> place:
> >> then we there would be 29s till the next check will start, and another 10
> >> seconds timeout, plus 5 seconds dampen. this would be 44 seconds, isn't
> >> that enough?
> >
> > I think "dampen" needs to be larger than the monitoring interval.
> > And the timeout on the operation should be large enough that
> > ping, even if the remote is unreachable for the first time,
> > will timeout by itself (and not killed prematurely by lrmd because
> > the operation timeout elapsed).
> >
> > try with interval 15s, dampen 20,
> >  instance parameter timeout: something explicit, if you want to.
> >  instance parameter attempts: something explicit, if you want to.
> >  monitor operation timeout=60s
> >
> > BTW, someone should really implement the fping based ping RA ...
> 
> Thankyou for volunteering :-)

  :-P

 Date: Fri, 3 Sep 2010 12:12:58 +0200
 From: Bernd Schubert   


 Subject: Re: [Pacemaker] pingd 



On Friday, September 03, 2010, Lars Ellenberg wrote:
> > > how about an fping RA ?
> > > active=$(fping -a -i 5 -t 250 -B1 -r1 $host_list 2>/dev/null | wc -l)
> > > 
> > > terminates in about 3 seconds for a hostlist of 100 (on the LAN, 29 of
> > > which are alive).
> > 
> > Happy to add if someone writes it :-)
> 
> I thought so ;-)
> Additional note to whomever is going to:
> 
> With fping you can get fancy about "better connectivity",
> you are not limited to the measure "number of nodes responding".

I think for the beginning, just the basic feature should be sufficient. 
Actually I thought about to add an option to the existing ping RA to let the 
user choose between ping and fping, it would default to ping. I will do that 
mid of next week.

...

Bernd?



-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Speed up resource failover?

2011-01-11 Thread Patrick H.
As it is right now, pacemaker seems to take a long time (in computer 
terms) to fail over resources from one node to the other. Right now, I 
have 477 IPaddr2 resources evenly distributed among 2 nodes. When I put 
one node in standby, it takes approximately 5 minutes to move the half 
of those from one node to the other. And before you ask, theyre because 
of SSL http virtual hosting. I have no order rules, colocations or 
anything on those resources, so it should be able migrate the entire 
list simultaneously, but it seems to do them sequentially. Is there any 
way to make it migrate the resources in parallel? Or at the very least 
speed it up?


-Patrick
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker