[ClusterLabs] Antw: Re: Xen Migration/resource cleanup problem in SLES11 SP3

Ulrich Windl Thu, 08 Oct 2015 23:10:54 -0700

>>> Dejan Muhamedagic <deja...@fastmail.fm> schrieb am 08.10.2015 um 16:13 in
Nachricht <20151008141357.GB15084@tuttle.linbit>:
> Hi,
> 
> On Thu, Oct 08, 2015 at 02:29:08PM +0200, Ulrich Windl wrote:
>> Hi!
>> 
>> I'd like to report an "interesting problem" with SLES11 SP3+HAE (latest 
> updates):
>> 
>> When doing "rcopenais stop" on node "h10" with three Xen-VMs running, the 
> cluster tried to migrate those VMs to other nodes (OK).
>> 
>> However migration failed on the remote nodes, but the cluster thought 
> migration was successfully. Later the cluster restarted the VMs (BAD).
>> 
>> Oct  8 13:19:17 h10 Xen(prm_xen_v07)[16537]: INFO: v07: xm migrate to h01 
> succeeded.
>> Oct  8 13:20:38 h01 Xen(prm_xen_v07)[9027]: ERROR: v07: Not active locally, 
> migration failed!
> 
> xm did report success in migrate_to, but the overall migration
> should've been considered failed, because migrate_from failed. Do
> you have a too low timeout? The failure msg is logged 81 second
> later, provided the clocks are in sync.


First, the timeout is in the order of 5 Minutes, and the clocks are "very much 
in sync" (TM) ;-)
The reason is that Xen failed to unpause the VM. The guess is that the node 
where the VM (para virtualized) started has a somewhat newer CPU than the 
target node, and this fact causes the migration to fail.
In an ideal world Xen wouldn't even start to try a migration if the CPU on the 
target node cannot run the VM. In a less perfect world this error should be 
detected after failure at least.


> 
>> Oct  8 13:44:53 h01 pengine[18985]:  warning: unpack_rsc_op_failure: 
> Processing failed op migrate_from for prm_xen_v07 on h01: unknown error (1)
>> 
>> Things are really bad after h10 was rebooted eventually: The cluster 
> restarted the three VMs again, because it thought those VMs were still 
> running on h10! (VERY BAD)
>> During startup, the cluster did nor probe the three VMs.
> 
> If a node restarted, how could anything think that there was
> anything there still running. Strange.

Well basically when you start the cluster node it does not mean that the OS on 
the node has just been rebootet, so rousources might have been messed with 
outside the cluster (it was not, but it could be). Thus probing on node startup 
seems like a good idea.

> 
> But anyway, the if the migrate_from fails, then the resource
> should still be running at the origin host, right?

No, because it wasn't running where the cluster thought it's running. So after 
a failed migration the VM isn't running (as it did before migration).

So actually we have two problems:
1: Xen migration failure is not detected in-time by the cluster
2: The cluster mixes up nodes and node configurations (this problem has a SR at 
SUSE for six moths at least, but nobody (me included) seems to know what's 
wrong. I'd bet that it's a very obscure bug in the cluster communication 
layer...

Regards,
Ulrich



_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Antw: Re: Xen Migration/resource cleanup problem in SLES11 SP3

Reply via email to