Le 22/08/2012 18:23, Andrew Martin a écrit :
Hello,


I have a 3 node Pacemaker + Heartbeat cluster (two real nodes and 1 quorum node 
that cannot run resources) running on Ubuntu 12.04 Server amd64. This cluster 
has a DRBD resource that it mounts and then runs a KVM virtual machine from. I 
have configured the cluster to use ocf:pacemaker:ping with two other devices on 
the network (192.168.0.128, 192.168.0.129), and set constraints to move the 
resources to the most well-connected node (whichever node can see more of these 
two devices):

primitive p_ping ocf:pacemaker:ping \
params name="p_ping" host_list="192.168.0.128 192.168.0.129" multiplier="1000" 
attempts="8" debug="true" \
op start interval="0" timeout="60" \
op monitor interval="10s" timeout="60"
...

clone cl_ping p_ping \
meta interleave="true"

...
location loc_run_on_most_connected g_vm \
rule $id="loc_run_on_most_connected-rule" p_ping: defined p_ping


Today, 192.168.0.128's network cable was unplugged for a few seconds and then 
plugged back in. During this time, pacemaker recognized that it could not ping 
192.168.0.128 and restarted all of the resources, but left them on the same 
node. My understanding was that since neither node could ping 192.168.0.128 
during this period, pacemaker would do nothing with the resources (leave them 
running). It would only migrate or restart the resources if for example node2 
could ping 192.168.0.128 but node1 could not (move the resources to where 
things are better-connected). Is this understanding incorrect? If so, is there 
a way I can change my configuration so that it will only restart/migrate 
resources if one node is found to be better connected?

Can you tell me why these resources were restarted? I have attached the syslog 
as well as my full CIB configuration.

Thanks,

Andrew Martin


This is an interesting question and I'm also interested in answers.

I had the same observations, and there is also the case where the monitor() aren't synced across all nodes so, "Node 1 issue a monitor() on the ping resource and finds ping-node dead, node2 hasn't pinged yet, so node1 moves things to node2 but node2 now issue a monitor() and also finds ping-node dead."

The only solution I found was to adjust the dampen parameter to at least 2*monitor().interval so that I can be *sure* that all nodes have issued a monitor() and they all decreased they scores so that when a decision occurs, nothings move.

It's been a long time I haven't tested, my cluster is very very stable, I guess I should retry to validate it's still a working trick.

====

dampen (integer, [5s]): Dampening interval
    The time to wait (dampening) further changes occur

Eg:

primitive ping-nq-sw-swsec ocf:pacemaker:ping \
params host_list="192.168.10.1 192.168.2.11 192.168.2.12" dampen="35s" attempts="2" timeout="2" multiplier="100" \
        op monitor interval="15s"




--
Cheers,
Florian Crouzat

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to