Re: [Pacemaker] resource stickiness and preventing stonith on failback

Bernd Schubert Wed, 24 Aug 2011 08:38:44 -0700

Hello Brian,

On 08/23/2011 10:56 PM, Brian J. Murrell wrote:

Hi All,

I am trying to configure pacemaker (1.0.10) to make a single filesystem
highly available by two nodes (please don't be distracted by the dangers
of multiply mounted filesystems and clustering filesystems, etc., as I
am absolutely clear about that -- consider that I am using a filesystem
resource as just an example if you wish). Here is my filesystem
resource description:

node foo1
node foo2 \
attributes standby="off"
primitive OST1 ocf:heartbeat:Filesystem \
meta target-role="Started" \
operations $id="BAR1-operations" \
op monitor interval="120" timeout="60" \
op start interval="0" timeout="300" \
op stop interval="0" timeout="300" \
params device="/dev/disk/by-uuid/8c500092-5de6-43d7-b59a-ef91fa9667b9"
directory="/mnt/bar1" fstype="ext3"
primitive st-pm stonith:external/powerman \
params serverhost="192.168.122.1:10101" poweroff="0"
clone fencing st-pm
property $id="cib-bootstrap-options" \
dc-version="1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3" \
cluster-infrastructure="openais" \
expected-quorum-votes="1" \
no-quorum-policy="ignore" \
last-lrm-refresh="1306783242" \
default-resource-stickiness="1000"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"

The two problems I have run into are:

1. preventing the resource from failing back to the node it was
previously on after it has failed over and the previous node has
been restored. Basically what's documented at

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ch05s03s02.html

2. preventing the active node from being STONITHed when the resource
is moved back to it's failed-and-restored node after a failover.
IOW: BAR1 is available on foo1, which fails and the resource is moved
to foo2. foo1 returns and the resource is failed back to foo1, but
in doing that foo2 is STONITHed.

For #1, as you can see, I tried setting the default resource stickiness
to 100. That didn't seem to work. When I stopped corosync on the
active node, the service failed over but it promptly failed back when I
started corosync again, contrary to the example on the referenced URL.

Subsequently I (think I) tried adding the specific resource stickiness
of 1000. That didn't seem to help either.


I had the same question some time ago, see here for it and Andrews response:

http://www.gossamer-threads.com/lists/linuxha/pacemaker/59471

So basically you should check the current score and then increase thestickiness above that. Though I'm surprised that 1000 does not seem to help.


As for #2, the issue with STONITHing foo2 when failing back to foo1 is
that foo1 and foo2 are an active/active pair of servers.  STONITHing
foo2 just to restore foo1's services puts foo2's services out of service,

I do want a node that is believed to be dead to be STONITHed before it's
resource(s) are failed over though.

Any hints on what I am doing wrong?

Basically a stonith only will happen if the stop action fails, aspacemaker then does not know if the resource really stopped. To bringback the system into a known state it simply kills the node that failsto stop a resource. There also was a stop bug in several pacemakerreleases, I don't remember any more if it already got fixed in 1.0.10 orif I simply back ported the patches (right now I'm not doing anythingwith pacemaker anymore...).

 You will need to check your logs why it does so.

As you might have noticed, logs in pacemaker often also contain quitesome debug messages (reminds me to Lustre ;) ) and for DDN systems Itherefore set up rather complex syslog-ng filter rules. One of the lastthings I did for DDN was to send several ha-logd patches upstream, whichthen also got integrated. So log filtering should be more easy now.Additionally, the Lustre server RA I then wrote based on a stripped downfilesystem RA also does lots of more logging what actually fails. So ifyou would use that one (an older version is in lustre-2.0 I think),update the type to ext3 and remove the lustre_health check, you shouldget a better idea what is actually going on. Somewhere on my desktopsystem at home I also still should have the syslog-ng rules.



Cheers,
Bernd

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] resource stickiness and preventing stonith on failback

Reply via email to