Hi all, I start a new thread because I've got more debug details to analyze my situation, and starting from the beginning might be better.
My environment is composed by two machine connected to a network and one to each other. The cluster runs a lot of virtual machines, each one based upon a dual primary drbd. The two systems are Debian Squeeze with backports: kernel 2.6.39-3 drbd 8.3.10-1 corosync 1.3.0-3 pacemaker 1.0.11-1 libvirt-bin 0.9.2-7 The (dual-primary) drbd resources are declared in this way: primitive vm-1_r0 ocf:linbit:drbd \ params drbd_resource="r0" \ op monitor interval="20s" role="Master" timeout="20s" \ op monitor interval="30s" role="Slave" timeout="20s" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="100s" ms vm-1_ms-r0 vm-1_r0 \ meta notify="true" master-max="2" clone-max="2" interleave="true" and the virtual machine are like this: primitive vm-1_virtualdomain ocf:heartbeat:VirtualDomain \ params config="/etc/libvirt/qemu/vm-1.xml" hypervisor="qemu:///system" migration_transport="ssh" force_stop="true" \ meta allow-migrate="true" \ op monitor interval="10s" timeout="30s" on-fail="restart" depth="0" \ op start interval="0" timeout="120s" \ op stop interval="0" timeout="120s" There are colocation and order for each vm: colocation vm-1_ON_vm-1_ms-r0 inf: vm-1 vm-1_ms-r0:Master order vm-1_AFTER_vm-1_ms-r0 inf: vm-1_ms-r0:promote vm-1:start And there is a location constraint for the connectivity: location vm-1_ON_CONNECTED_NODE vm-1 \ rule $id="vm-1_ON_CONNECTED_NODE-rule" -inf: not_defined ping or ping lte 0 The problem is that every night I've scheduled a live migration of a vm, but if this fails, then the node gets fenced, even if the on-fail parameter of the vm is set to "restart". Everything starts at 23: Sep 19 23:00:01 node-2 crm_resource: [8947]: info: Invoked: crm_resource -M -r vm-1 Two seconds later the first problem: Sep 19 23:00:02 node-2 lrmd: [2145]: info: cancel_op: operation monitor[171] on ocf::VirtualDomain::vm-1_virtualdomain for client 2148, its parameters: hypervisor=[qemu:///system] CRM_m eta_depth=[0] CRM_meta_timeout=[30000] force_stop=[true] config=[/etc/libvirt/qemu/vm-1.lan.mmul.local.xml] depth=[0] crm_feature_set=[3.0.1] CRM_meta_on_fail=[restart] CRM_meta_name=[monito r] migration_transport=[ssh] CRM_meta_interval=[10000] cancelled why this operation is marked ad cancelled? Anyway, after 22 seconds, the operation fails with "Timed Out": Sep 19 23:00:22 node-2 crmd: [2148]: ERROR: process_lrm_event: LRM operation vm-1_virtualdomain_migrate_to_0 (236) Timed Out (timeout=20000ms) Force shutdown is invoked: Sep 19 23:00:22 node-2 VirtualDomain[9256]: INFO: Issuing forced shutdown (destroy) request for domain vm-1. and even if the vm appears to be destroyed (the kernel messages confirm the the vmnet devices were destroyed), the RA seems to ignore it: Sep 19 23:00:22 node-2 lrmd: [2145]: info: RA output: (vm-1_virtualdomain:stop:stderr) error: Failed to destroy domain vm-1 Sep 19 23:00:22 node-2 lrmd: [2145]: info: RA output: (vm-1_virtualdomain:stop:stderr) error: Requested operation is not valid: domain is not running Sep 19 23:00:22 node-2 crmd: [2148]: info: process_lrm_event: LRM operation vm-1_virtualdomain_stop_0 (call=237, rc=1, cib-update=445, confirmed=true) unknown error In the meantime on the other node, since some errors are discovered: Sep 19 23:00:01 node-1 pengine: [2313]: info: complex_migrate_reload: Migrating vm-1_virtualdomain from node-2 to node-1 Sep 19 23:00:01 node-1 pengine: [2313]: info: complex_migrate_reload: Repairing vm-1_ON_vm-1_ms-r5: vm-1_virtualdomain == vm-1_ms-r5 (1000000) ... ... Sep 19 23:00:23 node-1 pengine: [2313]: WARN: unpack_rsc_op: Processing failed op vm-1_virtualdomain_monitor_0 on node-2: unknown exec error (-2) Sep 19 23:00:23 node-1 pengine: [2313]: WARN: unpack_rsc_op: Processing failed op vm-1_virtualdomain_stop_0 on node-2: unknown error (1) Sep 19 23:00:23 node-1 pengine: [2313]: WARN: pe_fence_node: Node node-2 will be fenced to recover from resource failure(s) a STONITH is invoked... Sep 19 23:00:23 node-1 stonithd: [2309]: info: client tengine [pid: 2314] requests a STONITH operation RESET on node node-2 ...with success: Sep 19 23:00:24 node-1 stonithd: [2309]: info: Succeeded to STONITH the node node-2: optype=RESET. whodoit: node-1 My conclusions are: 1 - the fence has nothing to do with drbd (there is no mention to it until the reset is done); 2 - for some reason live migrating the vms SOMETIMES fails, even if once the system has recovered I can do a crm resource move vm-1 with ANY problem. 3 - Even if the vm fails to stop the cluster does not try to restart it, but simply fence the node, and this is not what the on-fail parameter is meant to do. Does someone have some suggestions on how to debug more this problem? Please help! Thanks a lot, -- RaSca Mia Mamma Usa Linux: Niente รจ impossibile da capire, se lo spieghi bene! ra...@miamammausalinux.org http://www.miamammausalinux.org _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems