On 16-02-15 13:16, Andrei Mikhailovsky wrote:
> I had similar issues at least two or thee times. The host agent would 
> disconnect from the management server. The agent would not connect back to 
> the management server without manual intervention, however, it would happily 
> continue running the vms. The management server would initiate the HA and 
> fire up vms, which are already running on the disconnected host. I ended up 
> with a handful of vms and virtual routers being ran on two hypervisors, thus 
> corrupting the disk and having all sorts of issues ((( . 
> 
> I think there has to be a better way of dealing with this case. At least on 
> an image level. Perhaps a host should keep some sort of lock file or a file 
> for every image where it would record a time stamp. Something like: 
> 
> f5ffa8b0-d852-41c8-a386-6efb8241e2e7 and 
> f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp 
> 
> Thus, the f5ffa8b0-d852-41c8-a386-6efb8241e2e7 is the name of the disk image 
> and f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp is the image's time stamp. 
> 
> The hypervisor should record the time stamp in this file while the vm is 
> running. Let's say every 5-10 seconds. If the timestamp is old, we can assume 
> that the volume is no longer used by the hypervisor. 
> 
> When a vm is started, the timestamp file should be checked and if the 
> timestamp is recent, the vm should not start, otherwise, the vm should start 
> and the timestamp file should be regularly updated. 
> 
> I am sure there are better ways of doing this, but at least this method 
> should not allow two vms running on different hosts to use the same volume 
> and corrupt the data. 
> 
> In ceph, as far as I remember, a new feature is being developed to provide a 
> locking mechanism of an rbd image. Not sure if this will do the job? 
>

Something like this is still on my wishlist for Ceph/RBD, something like
you propose.

For NFS we currently have this in place, but for Ceph/RBD we don't. It's
a matter of code in the Agent and the investigators inside the
Management Server which decide if HA should kick in.

Wido

> Andrei 
> 
> ----- Original Message -----
> 
>> From: "Wido den Hollander" <w...@widodh.nl>
>> To: dev@cloudstack.apache.org
>> Sent: Monday, 16 February, 2015 11:32:13 AM
>> Subject: Re: Disable HA temporary ?
> 
>> On 16-02-15 11:00, Andrija Panic wrote:
>>> Hi team,
>>>
>>> I just had funny behaviour few days ago - one of my hosts was under
>>> heavy
>>> load (some disk/network load) and it went disconnected from MGMT
>>> server.
>>>
>>> Then MGMT server stared doing HA thing, but without being able to
>>> make sure
>>> that the VMs on the disconnected hosts are really shutdown (and
>>> they were
>>> NOT).
>>>
>>> So MGMT started again some VMs on other hosts, thus resulting in
>>> having 2
>>> copies of the same VM, using shared strage - so corruption happened
>>> on the
>>> disk.
>>>
>>> Is there a way to temporary disable HA feature on global level, or
>>> anything
>>> similar ?
> 
>> Not that I'm aware of, but this is something I also ran in to a
>> couple
>> of times.
> 
>> It would indeed be nice if there could be a way to stop the HA
>> process
>> completely as an Admin.
> 
>> Wido
> 
>>> Thanks
>>>
> 

Reply via email to