Thanks for the reply. After further successless testing for the automatic recover I had read this artikel:

 http://clusterlabs.org/doc/crm_fencing.html

There is a recommendation to monitor only once in a few hours the fencing device.

I am happy with it and so I configured the interval for monitoring at 9600 secs (3-4 hours).

Cheers

Michael

On 08.01.2016 16:30, Ken Gaillot wrote:
On 01/08/2016 08:56 AM, m...@inwx.de wrote:
Hello List,

I have here a test environment for checking pacemaker. Sometimes our
kvm-hosts with libvirt have trouble with responding the stonith/libvirt
resource, so I like to configure the service to realize as failed after
three failed monitoring attempts. I was searching for a configuration
here:


http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/index.html


But I failed after hours.

That's the configuration line for stonith/libvirt:

crm configure primitive p_fence_ha3 stonith:external/libvirt  params
hostlist="ha3" hypervisor_uri="qemu+tls://debian1/system" op monitor
interval="60"

Every 60 seconds pacemaker makes something like this:

  stonith -t external/libvirt hostlist="ha3"
hypervisor_uri="qemu+tls://debian1/system" -S
  ok

To simulate the unavailability of the kvm host I remove the certificate
in /etc/libvirt/libvirtd.conf and restart libvirtd. After 60 seconds or
less I can see the error with "crm status". On the kvm host I add
certificate again to /etc/libvirt/libvirtd.conf and restart libvirt
again. Although libvirt is again available the stonith-resource did not
start again.

I altered the configuration line for stonith/libvirt with following parts:

  op monitor interval="60" pcmk_status_retries="3"
  op monitor interval="60" pcmk_monitor_retries="3"
  op monitor interval="60" start-delay=180
  meta migration-threshold="200" failure-timeout="120"

But always with first failed monitor check after 60 or less seconds
pacemakers did not resume stonith-libvirt after libvirt is again available.

Is there enough time left in the timeout for the cluster to retry? (The
interval is not the same as the timeout.) Check your pacemaker.log for
messages like "Attempted to execute agent ... the maximum number of
times (...) allowed". That will tell you whether it is retrying.

You definitely don't want start-delay, and migration-threshold doesn't
really mean much for fence devices.

Of course, you also want to fix the underlying problem of libvirt not
being responsive. That doesn't sound like something that should
routinely happen.

BTW I haven't used stonith/external agents (which rely on the
cluster-glue package) myself. I use the fence_virtd daemon on the host
with fence_xvm as the configured fence agent.

Here is the "crm status"-output on debian 8 (Jessie):

  root@ha4:~# crm status
  Last updated: Tue Jan  5 10:04:18 2016
  Last change: Mon Jan  4 18:18:12 2016
  Stack: corosync
  Current DC: ha3 (167772400) - partition with quorum
  Version: 1.1.12-561c4cf
  2 Nodes configured
  2 Resources configured
  Online: [ ha3 ha4 ]
  Service-IP     (ocf::heartbeat:IPaddr2):       Started ha3
  haproxy        (lsb:haproxy):  Started ha3
  p_fence_ha3    (stonith:external/libvirt):     Started ha4

Kind regards

Michael R.


_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to