On 16-02-15 13:16, Andrei Mikhailovsky wrote: > I had similar issues at least two or thee times. The host agent would > disconnect from the management server. The agent would not connect back to > the management server without manual intervention, however, it would happily > continue running the vms. The management server would initiate the HA and > fire up vms, which are already running on the disconnected host. I ended up > with a handful of vms and virtual routers being ran on two hypervisors, thus > corrupting the disk and having all sorts of issues ((( . > > I think there has to be a better way of dealing with this case. At least on > an image level. Perhaps a host should keep some sort of lock file or a file > for every image where it would record a time stamp. Something like: > > f5ffa8b0-d852-41c8-a386-6efb8241e2e7 and > f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp > > Thus, the f5ffa8b0-d852-41c8-a386-6efb8241e2e7 is the name of the disk image > and f5ffa8b0-d852-41c8-a386-6efb8241e2e7-timestamp is the image's time stamp. > > The hypervisor should record the time stamp in this file while the vm is > running. Let's say every 5-10 seconds. If the timestamp is old, we can assume > that the volume is no longer used by the hypervisor. > > When a vm is started, the timestamp file should be checked and if the > timestamp is recent, the vm should not start, otherwise, the vm should start > and the timestamp file should be regularly updated. > > I am sure there are better ways of doing this, but at least this method > should not allow two vms running on different hosts to use the same volume > and corrupt the data. > > In ceph, as far as I remember, a new feature is being developed to provide a > locking mechanism of an rbd image. Not sure if this will do the job? >
Something like this is still on my wishlist for Ceph/RBD, something like you propose. For NFS we currently have this in place, but for Ceph/RBD we don't. It's a matter of code in the Agent and the investigators inside the Management Server which decide if HA should kick in. Wido > Andrei > > ----- Original Message ----- > >> From: "Wido den Hollander" <w...@widodh.nl> >> To: dev@cloudstack.apache.org >> Sent: Monday, 16 February, 2015 11:32:13 AM >> Subject: Re: Disable HA temporary ? > >> On 16-02-15 11:00, Andrija Panic wrote: >>> Hi team, >>> >>> I just had funny behaviour few days ago - one of my hosts was under >>> heavy >>> load (some disk/network load) and it went disconnected from MGMT >>> server. >>> >>> Then MGMT server stared doing HA thing, but without being able to >>> make sure >>> that the VMs on the disconnected hosts are really shutdown (and >>> they were >>> NOT). >>> >>> So MGMT started again some VMs on other hosts, thus resulting in >>> having 2 >>> copies of the same VM, using shared strage - so corruption happened >>> on the >>> disk. >>> >>> Is there a way to temporary disable HA feature on global level, or >>> anything >>> similar ? > >> Not that I'm aware of, but this is something I also ran in to a >> couple >> of times. > >> It would indeed be nice if there could be a way to stop the HA >> process >> completely as an Admin. > >> Wido > >>> Thanks >>> >