Thanks for the reply. After further successless testing for the
automatic recover I had read this artikel:
http://clusterlabs.org/doc/crm_fencing.html
There is a recommendation to monitor only once in a few hours the
fencing device.
I am happy with it and so I configured the interval for monitoring at
9600 secs (3-4 hours).
Cheers
Michael
On 08.01.2016 16:30, Ken Gaillot wrote:
On 01/08/2016 08:56 AM, m...@inwx.de wrote:
Hello List,
I have here a test environment for checking pacemaker. Sometimes our
kvm-hosts with libvirt have trouble with responding the stonith/libvirt
resource, so I like to configure the service to realize as failed after
three failed monitoring attempts. I was searching for a configuration
here:
http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html
But I failed after hours.
That's the configuration line for stonith/libvirt:
crm configure primitive p_fence_ha3 stonith:external/libvirt params
hostlist="ha3" hypervisor_uri="qemu+tls://debian1/system" op monitor
interval="60"
Every 60 seconds pacemaker makes something like this:
stonith -t external/libvirt hostlist="ha3"
hypervisor_uri="qemu+tls://debian1/system" -S
ok
To simulate the unavailability of the kvm host I remove the certificate
in /etc/libvirt/libvirtd.conf and restart libvirtd. After 60 seconds or
less I can see the error with "crm status". On the kvm host I add
certificate again to /etc/libvirt/libvirtd.conf and restart libvirt
again. Although libvirt is again available the stonith-resource did not
start again.
I altered the configuration line for stonith/libvirt with following parts:
op monitor interval="60" pcmk_status_retries="3"
op monitor interval="60" pcmk_monitor_retries="3"
op monitor interval="60" start-delay=180
meta migration-threshold="200" failure-timeout="120"
But always with first failed monitor check after 60 or less seconds
pacemakers did not resume stonith-libvirt after libvirt is again available.
Is there enough time left in the timeout for the cluster to retry? (The
interval is not the same as the timeout.) Check your pacemaker.log for
messages like "Attempted to execute agent ... the maximum number of
times (...) allowed". That will tell you whether it is retrying.
You definitely don't want start-delay, and migration-threshold doesn't
really mean much for fence devices.
Of course, you also want to fix the underlying problem of libvirt not
being responsive. That doesn't sound like something that should
routinely happen.
BTW I haven't used stonith/external agents (which rely on the
cluster-glue package) myself. I use the fence_virtd daemon on the host
with fence_xvm as the configured fence agent.
Here is the "crm status"-output on debian 8 (Jessie):
root@ha4:~# crm status
Last updated: Tue Jan 5 10:04:18 2016
Last change: Mon Jan 4 18:18:12 2016
Stack: corosync
Current DC: ha3 (167772400) - partition with quorum
Version: 1.1.12-561c4cf
2 Nodes configured
2 Resources configured
Online: [ ha3 ha4 ]
Service-IP (ocf::heartbeat:IPaddr2): Started ha3
haproxy (lsb:haproxy): Started ha3
p_fence_ha3 (stonith:external/libvirt): Started ha4
Kind regards
Michael R.
_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org