Re: [Linux-HA] Understanding why a host fence (was: Resource fail and node fence)

Andrew Beekhof Mon, 26 Sep 2011 21:28:19 -0700

On Tue, Sep 20, 2011 at 5:59 PM, RaSca <ra...@miamammausalinux.org> wrote:
> Hi all,
> I start a new thread because I've got more debug details to analyze my
> situation, and starting from the beginning might be better.
>
> My environment is composed by two machine connected to a network and one
> to each other. The cluster runs a lot of virtual machines, each one
> based upon a dual primary drbd. The two systems are Debian Squeeze with
> backports:
>
> kernel 2.6.39-3
> drbd 8.3.10-1
> corosync 1.3.0-3
> pacemaker 1.0.11-1
> libvirt-bin 0.9.2-7
>
> The (dual-primary) drbd resources are declared in this way:
>
> primitive vm-1_r0 ocf:linbit:drbd \
>        params drbd_resource="r0" \
>        op monitor interval="20s" role="Master" timeout="20s" \
>        op monitor interval="30s" role="Slave" timeout="20s" \
>        op start interval="0" timeout="240s" \
>        op stop interval="0" timeout="100s"
>
> ms vm-1_ms-r0 vm-1_r0 \
>        meta notify="true" master-max="2" clone-max="2" interleave="true"
>
> and the virtual machine are like this:
>
> primitive vm-1_virtualdomain ocf:heartbeat:VirtualDomain \
>        params config="/etc/libvirt/qemu/vm-1.xml" hypervisor="qemu:///system"
> migration_transport="ssh" force_stop="true" \
>        meta allow-migrate="true" \
>        op monitor interval="10s" timeout="30s" on-fail="restart" depth="0" \
>        op start interval="0" timeout="120s" \
>        op stop interval="0" timeout="120s"
>
> There are colocation and order for each vm:
>
> colocation vm-1_ON_vm-1_ms-r0 inf: vm-1 vm-1_ms-r0:Master
> order vm-1_AFTER_vm-1_ms-r0 inf: vm-1_ms-r0:promote vm-1:start
>
> And there is a location constraint for the connectivity:
>
> location vm-1_ON_CONNECTED_NODE vm-1 \
>        rule $id="vm-1_ON_CONNECTED_NODE-rule" -inf: not_defined ping or ping 
> lte 0
>
> The problem is that every night I've scheduled a live migration of a vm,
> but if this fails, then the node gets fenced, even if the on-fail
> parameter of the vm is set to "restart".
> Everything starts at 23:
>
> Sep 19 23:00:01 node-2 crm_resource: [8947]: info: Invoked: crm_resource
> -M -r vm-1
>
> Two seconds later the first problem:
>
> Sep 19 23:00:02 node-2 lrmd: [2145]: info: cancel_op: operation
> monitor[171] on ocf::VirtualDomain::vm-1_virtualdomain for client 2148,
> its parameters: hypervisor=[qemu:///system] CRM_m
> eta_depth=[0] CRM_meta_timeout=[30000] force_stop=[true]
> config=[/etc/libvirt/qemu/vm-1.lan.mmul.local.xml] depth=[0]
> crm_feature_set=[3.0.1] CRM_meta_on_fail=[restart] CRM_meta_name=[monito
> r] migration_transport=[ssh] CRM_meta_interval=[10000]  cancelled
>
> why this operation is marked ad cancelled?


Hard to tell from just one log message.
My guess though, since its a recurring operation, is that we're about
to run stop or migrate_from for the resource - before which we cancel
all recurring monitor ops.

> Anyway, after 22 seconds, the
> operation fails with "Timed Out":
>
> Sep 19 23:00:22 node-2 crmd: [2148]: ERROR: process_lrm_event: LRM
> operation vm-1_virtualdomain_migrate_to_0 (236) Timed Out (timeout=20000ms)

No, this is a completely independent operation to the one being cancelled.
Is 20s enough time to migrate the VM to another machine?

> Force shutdown is invoked:
>
> Sep 19 23:00:22 node-2 VirtualDomain[9256]: INFO: Issuing forced
> shutdown (destroy) request for domain vm-1.
>
> and even if the vm appears to be destroyed (the kernel messages confirm
> the the vmnet devices were destroyed), the RA seems to ignore it:
>
> Sep 19 23:00:22 node-2 lrmd: [2145]: info: RA output:
> (vm-1_virtualdomain:stop:stderr) error: Failed to destroy domain vm-1
> Sep 19 23:00:22 node-2 lrmd: [2145]: info: RA output:
> (vm-1_virtualdomain:stop:stderr) error: Requested operation is not
> valid: domain is not running
> Sep 19 23:00:22 node-2 crmd: [2148]: info: process_lrm_event: LRM
> operation vm-1_virtualdomain_stop_0 (call=237, rc=1, cib-update=445,
> confirmed=true) unknown error

The RA isn't ignoring it, its reporting that state as an error instead
of OCF_NOT_RUNNING which is probably more appropriate.

>
> In the meantime on the other node, since some errors are discovered:
>
> Sep 19 23:00:01 node-1 pengine: [2313]: info: complex_migrate_reload:
> Migrating vm-1_virtualdomain from node-2 to node-1
> Sep 19 23:00:01 node-1 pengine: [2313]: info: complex_migrate_reload:
> Repairing vm-1_ON_vm-1_ms-r5: vm-1_virtualdomain == vm-1_ms-r5 (1000000)
> ...
> ...
> Sep 19 23:00:23 node-1 pengine: [2313]: WARN: unpack_rsc_op: Processing
> failed op vm-1_virtualdomain_monitor_0 on node-2: unknown exec error (-2)
> Sep 19 23:00:23 node-1 pengine: [2313]: WARN: unpack_rsc_op: Processing
> failed op vm-1_virtualdomain_stop_0 on node-2: unknown error (1)
> Sep 19 23:00:23 node-1 pengine: [2313]: WARN: pe_fence_node: Node node-2
> will be fenced to recover from resource failure(s)

Right, so stop failed too... hence the fencing.

>
> a STONITH is invoked...
>
> Sep 19 23:00:23 node-1 stonithd: [2309]: info: client tengine [pid:
> 2314] requests a STONITH operation RESET on node node-2
>
> ...with success:
>
> Sep 19 23:00:24 node-1 stonithd: [2309]: info: Succeeded to STONITH the
> node node-2: optype=RESET. whodoit: node-1
>
> My conclusions are:
>
> 1 - the fence has nothing to do with drbd (there is no mention to it
> until the reset is done);
>
> 2 - for some reason live migrating the vms SOMETIMES fails, even if once
> the system has recovered I can do a crm resource move vm-1 with ANY problem.
>
> 3 - Even if the vm fails to stop the cluster does not try to restart it,
> but simply fence the node, and this is not what the on-fail parameter is
> meant to do.

/stop/ failed, your on-fail setting only applies to the /monitor/ operation

>
> Does someone have some suggestions on how to debug more this problem?
> Please help!
>
> Thanks a lot,
>
> --
> RaSca
> Mia Mamma Usa Linux: Niente è impossibile da capire, se lo spieghi bene!
> ra...@miamammausalinux.org
> http://www.miamammausalinux.org
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Understanding why a host fence (was: Resource fail and node fence)

Reply via email to