[Linux-HA] Understanding why a host fence (was: Resource fail and node fence)

RaSca Tue, 20 Sep 2011 01:00:21 -0700

Hi all,
I start a new thread because I've got more debug details to analyze my 
situation, and starting from the beginning might be better.


My environment is composed by two machine connected to a network and one 
to each other. The cluster runs a lot of virtual machines, each one 
based upon a dual primary drbd. The two systems are Debian Squeeze with 
backports:

kernel 2.6.39-3
drbd 8.3.10-1
corosync 1.3.0-3
pacemaker 1.0.11-1
libvirt-bin 0.9.2-7

The (dual-primary) drbd resources are declared in this way:

primitive vm-1_r0 ocf:linbit:drbd \
        params drbd_resource="r0" \
        op monitor interval="20s" role="Master" timeout="20s" \
        op monitor interval="30s" role="Slave" timeout="20s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="100s"

ms vm-1_ms-r0 vm-1_r0 \
        meta notify="true" master-max="2" clone-max="2" interleave="true"

and the virtual machine are like this:

primitive vm-1_virtualdomain ocf:heartbeat:VirtualDomain \
        params config="/etc/libvirt/qemu/vm-1.xml" hypervisor="qemu:///system" 
migration_transport="ssh" force_stop="true" \
        meta allow-migrate="true" \
        op monitor interval="10s" timeout="30s" on-fail="restart" depth="0" \
        op start interval="0" timeout="120s" \
        op stop interval="0" timeout="120s"

There are colocation and order for each vm:

colocation vm-1_ON_vm-1_ms-r0 inf: vm-1 vm-1_ms-r0:Master
order vm-1_AFTER_vm-1_ms-r0 inf: vm-1_ms-r0:promote vm-1:start

And there is a location constraint for the connectivity:

location vm-1_ON_CONNECTED_NODE vm-1 \
        rule $id="vm-1_ON_CONNECTED_NODE-rule" -inf: not_defined ping or ping 
lte 0

The problem is that every night I've scheduled a live migration of a vm, 
but if this fails, then the node gets fenced, even if the on-fail 
parameter of the vm is set to "restart".
Everything starts at 23:

Sep 19 23:00:01 node-2 crm_resource: [8947]: info: Invoked: crm_resource 
-M -r vm-1

Two seconds later the first problem:

Sep 19 23:00:02 node-2 lrmd: [2145]: info: cancel_op: operation 
monitor[171] on ocf::VirtualDomain::vm-1_virtualdomain for client 2148, 
its parameters: hypervisor=[qemu:///system] CRM_m
eta_depth=[0] CRM_meta_timeout=[30000] force_stop=[true] 
config=[/etc/libvirt/qemu/vm-1.lan.mmul.local.xml] depth=[0] 
crm_feature_set=[3.0.1] CRM_meta_on_fail=[restart] CRM_meta_name=[monito
r] migration_transport=[ssh] CRM_meta_interval=[10000]  cancelled

why this operation is marked ad cancelled? Anyway, after 22 seconds, the 
operation fails with "Timed Out":

Sep 19 23:00:22 node-2 crmd: [2148]: ERROR: process_lrm_event: LRM 
operation vm-1_virtualdomain_migrate_to_0 (236) Timed Out (timeout=20000ms)

Force shutdown is invoked:

Sep 19 23:00:22 node-2 VirtualDomain[9256]: INFO: Issuing forced 
shutdown (destroy) request for domain vm-1.

and even if the vm appears to be destroyed (the kernel messages confirm 
the the vmnet devices were destroyed), the RA seems to ignore it:

Sep 19 23:00:22 node-2 lrmd: [2145]: info: RA output: 
(vm-1_virtualdomain:stop:stderr) error: Failed to destroy domain vm-1
Sep 19 23:00:22 node-2 lrmd: [2145]: info: RA output: 
(vm-1_virtualdomain:stop:stderr) error: Requested operation is not 
valid: domain is not running
Sep 19 23:00:22 node-2 crmd: [2148]: info: process_lrm_event: LRM 
operation vm-1_virtualdomain_stop_0 (call=237, rc=1, cib-update=445, 
confirmed=true) unknown error

In the meantime on the other node, since some errors are discovered:

Sep 19 23:00:01 node-1 pengine: [2313]: info: complex_migrate_reload: 
Migrating vm-1_virtualdomain from node-2 to node-1
Sep 19 23:00:01 node-1 pengine: [2313]: info: complex_migrate_reload: 
Repairing vm-1_ON_vm-1_ms-r5: vm-1_virtualdomain == vm-1_ms-r5 (1000000)
...
...
Sep 19 23:00:23 node-1 pengine: [2313]: WARN: unpack_rsc_op: Processing 
failed op vm-1_virtualdomain_monitor_0 on node-2: unknown exec error (-2)
Sep 19 23:00:23 node-1 pengine: [2313]: WARN: unpack_rsc_op: Processing 
failed op vm-1_virtualdomain_stop_0 on node-2: unknown error (1)
Sep 19 23:00:23 node-1 pengine: [2313]: WARN: pe_fence_node: Node node-2 
will be fenced to recover from resource failure(s)

a STONITH is invoked...

Sep 19 23:00:23 node-1 stonithd: [2309]: info: client tengine [pid: 
2314] requests a STONITH operation RESET on node node-2

...with success:

Sep 19 23:00:24 node-1 stonithd: [2309]: info: Succeeded to STONITH the 
node node-2: optype=RESET. whodoit: node-1

My conclusions are:

1 - the fence has nothing to do with drbd (there is no mention to it 
until the reset is done);

2 - for some reason live migrating the vms SOMETIMES fails, even if once 
the system has recovered I can do a crm resource move vm-1 with ANY problem.

3 - Even if the vm fails to stop the cluster does not try to restart it, 
but simply fence the node, and this is not what the on-fail parameter is 
meant to do.

Does someone have some suggestions on how to debug more this problem? 
Please help!

Thanks a lot,

-- 
RaSca
Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
ra...@miamammausalinux.org
http://www.miamammausalinux.org

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Understanding why a host fence (was: Resource fail and node fence)

Reply via email to