Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?

Andrew Martin Thu, 23 Aug 2012 16:39:38 -0700

Hi Florian,


Thanks for the suggestion. I gave it a try, but even with a dampen value 
greater than 2* the monitoring interval the same behavior occurred (pacemaker 
restarted the resources on the same node). Here are my current 
ocf:pacemaker:ping settings: 

primitive p_ping ocf:pacemaker:ping \ 
params name="p_ping" host_list="192.168.0.128 192.168.0.129" dampen="25s" 
multiplier="1000" attempts="8" debug="true" \ 
op start interval="0" timeout="60" \ 
op monitor interval="10s" timeout="60" 


Any other ideas on what is causing this behavior? My understanding is the above 
config tells the cluster to attempt 8 pings to each of the IPs, and will assume 
that an IP is down if none of the 8 come back. Thus, an IP would have to be 
down for more than 8 seconds to be considered down. The dampen parameter tells 
the cluster to wait before making any decision, so that if the IP comes back 
online within the dampen period then no action is taken. Is this correct? 


Thanks, 


Andrew 


----- Original Message -----

From: "Florian Crouzat" <gen...@floriancrouzat.net> 
To: pacemaker@oss.clusterlabs.org 
Sent: Thursday, August 23, 2012 3:57:02 AM 
Subject: Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to 
restart? 

Le 22/08/2012 18:23, Andrew Martin a écrit : 
> Hello, 
> 
> 
> I have a 3 node Pacemaker + Heartbeat cluster (two real nodes and 1 quorum 
> node that cannot run resources) running on Ubuntu 12.04 Server amd64. This 
> cluster has a DRBD resource that it mounts and then runs a KVM virtual 
> machine from. I have configured the cluster to use ocf:pacemaker:ping with 
> two other devices on the network (192.168.0.128, 192.168.0.129), and set 
> constraints to move the resources to the most well-connected node (whichever 
> node can see more of these two devices): 
> 
> primitive p_ping ocf:pacemaker:ping \ 
> params name="p_ping" host_list="192.168.0.128 192.168.0.129" 
> multiplier="1000" attempts="8" debug="true" \ 
> op start interval="0" timeout="60" \ 
> op monitor interval="10s" timeout="60" 
> ... 
> 
> clone cl_ping p_ping \ 
> meta interleave="true" 
> 
> ... 
> location loc_run_on_most_connected g_vm \ 
> rule $id="loc_run_on_most_connected-rule" p_ping: defined p_ping 
> 
> 
> Today, 192.168.0.128's network cable was unplugged for a few seconds and then 
> plugged back in. During this time, pacemaker recognized that it could not 
> ping 192.168.0.128 and restarted all of the resources, but left them on the 
> same node. My understanding was that since neither node could ping 
> 192.168.0.128 during this period, pacemaker would do nothing with the 
> resources (leave them running). It would only migrate or restart the 
> resources if for example node2 could ping 192.168.0.128 but node1 could not 
> (move the resources to where things are better-connected). Is this 
> understanding incorrect? If so, is there a way I can change my configuration 
> so that it will only restart/migrate resources if one node is found to be 
> better connected? 
> 
> Can you tell me why these resources were restarted? I have attached the 
> syslog as well as my full CIB configuration. 
> 
> Thanks, 
> 
> Andrew Martin 
> 

This is an interesting question and I'm also interested in answers. 

I had the same observations, and there is also the case where the 
monitor() aren't synced across all nodes so, "Node 1 issue a monitor() 
on the ping resource and finds ping-node dead, node2 hasn't pinged yet, 
so node1 moves things to node2 but node2 now issue a monitor() and also 
finds ping-node dead." 

The only solution I found was to adjust the dampen parameter to at least 
2*monitor().interval so that I can be *sure* that all nodes have issued 
a monitor() and they all decreased they scores so that when a decision 
occurs, nothings move. 

It's been a long time I haven't tested, my cluster is very very stable, 
I guess I should retry to validate it's still a working trick. 

==== 

dampen (integer, [5s]): Dampening interval 
The time to wait (dampening) further changes occur 

Eg: 

primitive ping-nq-sw-swsec ocf:pacemaker:ping \ 
params host_list="192.168.10.1 192.168.2.11 192.168.2.12" 
dampen="35s" attempts="2" timeout="2" multiplier="100" \ 
op monitor interval="15s" 




-- 
Cheers, 
Florian Crouzat 

_______________________________________________ 
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker 

Project Home: http://www.clusterlabs.org 
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
Bugs: http://bugs.clusterlabs.org 

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?

Reply via email to