Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Bryan Whitehead Wed, 24 Jul 2013 22:49:13 -0700

Starting off, there is never going to be a way to "conclusively"
decide if a host is down. This is just the nature of complex systems.
We can only hope our software does "well" - and if "well" is "wrong" -
we have a way to clean up the mess created.


That said, I like the old behavior 3.0.x has. As I mentioned in -3535
I've had a host lose its network (e1000 oops in kernel) and HA got
triggered. The storage (in this case gluster using a sharedmountpount)
wouldn't let qemu-kvm start on another host because the underlying
qcow2 file was locked by an already running qemu-kvm process (on the
machine that lost network). So HA being triggered didn't ruin any VM
disks. Gluster was running on Infiniband so the shared storage with
working locks prevented HA from screwing things up.

Further, even if gluster lost connectivity, gluster itself would
split-brain and later I could decide which qcow2/disk image should be
"truth". Do I keep the VM that kept on running? Or do I keep the
version HA booted and fscked? That's for me - the user - to decide.

As a cloudstack admin/user I understand the risks of HA and I choose
to live with them - I've even made sure that should such a disaster
happen I can recover (gluster will split brain as well). The #1 reason
for choosing HA is I want the VM to be available as much as possible.

Right now 4.1 DOES NOT have HA... I don't know how "emailing the admin
to figure out what to do" is being entertained as an option. That's
just nonsense and is NOT HIGH AVAILABILITY. IMHO If one is so
terrified of HA screwing up they should probably pass on HA and
manually start things up.

When a simple reproducible test like pulling the plug on a host can't
trigger an HA event - then that feature doesn't exist. It is simple as
that.

On Wed, Jul 24, 2013 at 9:31 PM, Koushik Das <[email protected]> wrote:
> There is another bug for the same. CLOUDSTACK-3421
> This document nicely explains how HA works in Cloudstack 
> https://cwiki.apache.org/confluence/display/CLOUDSTACK/High+Availability+Developer's+Guide.
>
> As can be seen from the logs in this case, Cloudstack is not able to 
> conclusively determine if the host is 'down' and so does nothing. Suppose HA 
> was done for the VMs in this case and later on the host came back up. This 
> will corrupt the VM disks which is not desirable.
>
> Possible options:
> - If host state cannot be determined conclusively for some configurable time 
> then the host may be put into some special state and then admin can take 
> appropriate action by manually triggering HA
> - If KVM cluster has the concept of something like a 'master' from which the 
> state of any host in the cluster can be determined. Something similar is 
> there for XS.
>
> Thoughts?
>
>
>> -----Original Message-----
>> From: Bryan Whitehead [mailto:[email protected]]
>> Sent: Thursday, July 25, 2013 7:58 AM
>> To: [email protected]
>> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts
>>
>> CLOUDSTACK-3535 bug looks like it is describing the problem perfectly.
>> What else can we add?
>>
>> On Wed, Jul 24, 2013 at 7:20 PM, Chip Childers <[email protected]>
>> wrote:
>> > This sucks.
>> >
>> > Can one of the folks on this thread please open a bug with as much
>> > information as possible?  I'd like to make sure that someone picks up
>> > the issue and gets it resolved for the next release.
>> >
>> >
>> >
>> > On Wed, Jul 24, 2013 at 7:26 PM, Bryan Whitehead
>> <[email protected]>wrote:
>> >
>> >> This same thing happened to me - but it was a Power-Supply that died
>> >> on a box. All my templates have HA turned on.
>> >>
>> >> All the VM's (including 1 system-router-vm) were shown as "Running"
>> >> and the host itself was simply marked "Disconnected". When I tried to
>> >> shutdown the VM's to start them again I got errors about not being
>> >> able to communicate with the agent. I tried restarting the management
>> >> server but that didn't change anything.
>> >>
>> >> Getting the router working again was extremely annoying. After
>> >> changing it to Stopped it kept trying to start it again on the dead
>> >> host. I marked it destroyed then restarted the network with the force
>> >> option. That fixed it. After I hacked the DB to get all my VM's not
>> >> running with state Running to Stopped, then I was able to start all
>> >> the VM's that were down on the bad host.
>> >>
>> >> Anyway, The time between host death and me finding out was about 4
>> >> days - as these were on managed servers of a customer and their
>> >> monitoring of each host wasn't working. They were pretty unhappy. :(
>> >>
>> >> Other notes: this is KVM with sharedmountpoint on a gluster mount.
>> >> After host got back online gluster rsynced about 200GB of data - I
>> >> migrated VM's to the host at the same time as normal. I've had a
>> >> similar things happen with 3.0.2 install of cloudstack and everything
>> >> seamlessly restarted. Disappointing this happened with 4.1
>> >>
>> >> On Wed, Jul 24, 2013 at 9:23 AM, Indra Pramana <[email protected]> wrote:
>> >> > Dear Chip, Geoff and all,
>> >> >
>> >> > I scrutinized the management server's logs during the time when I
>> >> shutdown
>> >> > the host and the time when I turned the host back on.
>> >> >
>> >> > This is the management server's logs when the host is being shut down:
>> >> >
>> >> > http://pastebin.com/4wfV830Z
>> >> >
>> >> > During the time, I noted that there are quite a lot of "Sending
>> >> Disconnect
>> >> > to listener" messages, which implies that the management server try
>> >> > to notify other listeners that the host is going down. However,
>> >> subsequently I
>> >> > didn't see any messages on the logs showing that the management
>> >> > server is trying to activate the HA capability to start the
>> >> > affected VMs on another available host.
>> >> >
>> >> > This is the management server's logs when the host is being turned
>> >> > back
>> >> on:
>> >> >
>> >> > http://pastebin.com/JrLJxbXH
>> >> >
>> >> > When the agent is reconnected, then CloudStack marked the affected
>> >> > VMs as stopped from previously running:
>> >> >
>> >> > ===
>> >> > 2013-07-24 23:04:57,406 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> >> > (AgentConnectTaskPool-7:null) Found 5 VMs for host 34
>> >> > 2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> >> > (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
>> >> > realState = Stopped
>> >> > 2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> >> > (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
>> >> > realState = Stopped
>> >> > 2013-07-24 23:04:57,408 DEBUG
>> >> > [cloud.ha.HighAvailabilityManagerImpl]
>> >> > (AgentConnectTaskPool-7:null) VM does not require investigation so
>> >> > I'm marking it as Stopped: VM[User|Ubuntu-12-04-2-64bit]
>> >> > 2013-07-24 23:04:57,450 DEBUG [cloud.capacity.CapacityManagerImpl]
>> >> > (AgentConnectTaskPool-7:null) VM state transitted from :Running to
>> >> Stopping
>> >> > with event: StopRequestedvm's original host id: 28 new host id: 34
>> >> > host
>> >> id
>> >> > before state transition: 34
>> >> > ===
>> >> >
>> >> > Then the HA starts to kick in.
>> >> >
>> >> > ===
>> >> > 2013-07-24 23:04:57,955 INFO
>> >> > [cloud.ha.HighAvailabilityManagerImpl]
>> >> > (HA-Worker-1:work-307) Processing
>> >> > HAWork[307-HA-273-Stopped-Scheduled]
>> >> > 2013-07-24 23:04:57,956 DEBUG [cloud.capacity.CapacityManagerImpl]
>> >> > (AgentConnectTaskPool-7:null) VM state transitted from :Running to
>> >> Stopping
>> >> > with event: StopRequestedvm's original host id: 28 new host id: 34
>> >> > host
>> >> id
>> >> > before state transition: 34
>> >> > 2013-07-24 23:04:57,960 DEBUG [agent.transport.Request]
>> >> > (AgentConnectTaskPool-7:null) Seq 34-105644038: Sending  { Cmd ,
>> MgmtId:
>> >> > 161342671900, via: 34, Ver: v1, Flags: 100111,
>> >> > [{"StopCommand":{"isProxy":false,"vmName":"i-2-281-VM","wait":0}}]
>> >> > }
>> >> > 2013-07-24 23:04:57,968 INFO
>> >> > [cloud.ha.HighAvailabilityManagerImpl]
>> >> > (HA-Worker-1:work-307) HA on VM[User|Ubuntu-12-04-2-64bit]
>> >> > 2013-07-24 23:04:57,984 DEBUG [cloud.capacity.CapacityManagerImpl]
>> >> > (HA-Worker-1:work-307) VM state transitted from :Stopped to
>> >> > Starting with
>> >> > event: StartRequestedvm's original host id: 28 new host id: null
>> >> > host id before state transition: null
>> >> > 2013-07-24 23:04:57,984 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> >> > (HA-Worker-1:work-307) Successfully transitioned to start state for
>> >> > VM[User|Ubuntu-12-04-2-64bit] reservation id =
>> >> > b56364ef-90d8-443f-a348-7660fda48d34
>> >> > 2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> >> > (HA-Worker-1:work-307) Trying to deploy VM, vm has dcId: 6 and
>> >> > podId: 6
>> >> > 2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> >> > (HA-Worker-1:work-307) Deploy avoids pods: null, clusters: null, hosts:
>> >> null
>> >> > 2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> >> > (HA-Worker-1:work-307) Root volume is ready, need to place VM in
>> >> > volume's cluster
>> >> > 2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> >> > (HA-Worker-1:work-307) Vol[295|vm=273|ROOT] is READY, changing
>> >> > deployment plan to use this pool's dcId: 6 , podId: 6 , and
>> >> > clusterId: 6 ===
>> >> >
>> >> > My question is why HA only kicks in when the host is turned back
>> >> > on? By right it should kick in soon after the host is shut down and
>> >> > marked as "Disconnected".
>> >> >
>> >> > Any insights on the possible solutions to this problem is highly
>> >> > appreciated.
>> >> >
>> >> > Looking forward to your reply, thank you.
>> >> >
>> >> > Cheers.
>> >> >
>> >> >
>> >> >
>> >> > On Thu, Jul 25, 2013 at 12:00 AM, Indra Pramana <[email protected]>
>> wrote:
>> >> >
>> >> >> Hi Chip,
>> >> >>
>> >> >> Yes, "Offer HA" is set to "Yes" on all my compute offerings.
>> >> >>
>> >> >> Hi Geoff,
>> >> >>
>> >> >> Yes, I am using KVM. Is this a known issue and is there any
>> >> >> solution to this problem?
>> >> >>
>> >> >> Looking forward to your reply, thank you.
>> >> >>
>> >> >> Cheers.
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Wed, Jul 24, 2013 at 11:38 PM, Geoff Higginbottom <
>> >> >> [email protected]> wrote:
>> >> >>
>> >> >>> Is it running on KVM, we are seeing some real issue with HA
>> >> >>> simply not working on KVM.
>> >> >>>
>> >> >>> Regards
>> >> >>>
>> >> >>> Geoff Higginbottom
>> >> >>>
>> >> >>> D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581
>> >> >>>
>> >> >>> [email protected]
>> >> >>>
>> >> >>> -----Original Message-----
>> >> >>> From: Chip Childers [mailto:[email protected]]
>> >> >>> Sent: 24 July 2013 16:37
>> >> >>> To: <[email protected]>
>> >> >>> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor
>> >> >>> hosts
>> >> >>>
>> >> >>> Did you enable HA for your compute offering?
>> >> >>>
>> >> >>> On Jul 24, 2013, at 11:25 AM, Indra Pramana <[email protected]> wrote:
>> >> >>>
>> >> >>> > Dear all,
>> >> >>> >
>> >> >>> > I tried to shutdown one of my hypervisor hosts to simulate a
>> >> >>> > server failure, and the HA is not working, all the VMs on the
>> >> >>> > affected host is not started on another available host.
>> >> >>> >
>> >> >>> > I am using CloudStack 4.1.0 with KVM hypervisors and Ceph RBD
>> >> >>> > for primary storage.
>> >> >>> >
>> >> >>> > My issue is similar to what is being described here:
>> >> >>> >
>> >> >>> > https://issues.apache.org/jira/browse/CLOUDSTACK-3535
>> >> >>> >
>> >> >>> > Except that on my case, the host is indeed marked as
>> "Disconnected"
>> >> >>> > but there is no attempt from CloudStack to try starting the VMs
>> >> >>> > on another host. I can't provide logs since there's nothing on
>> >> >>> > the logs which suggest that CloudStack tries to activate the HA
>> >> >>> > and start the affected VMs on another host.
>> >> >>> >
>> >> >>> > Anyone has similar experience? Anyone knows if the above bug
>> >> >>> > has been resolved?
>> >> >>> >
>> >> >>> > Looking forward to your reply, thank you.
>> >> >>> >
>> >> >>> > Cheers.
>> >> >>> This email and any attachments to it may be confidential and are
>> >> intended
>> >> >>> solely for the use of the individual to whom it is addressed. Any
>> >> views or
>> >> >>> opinions expressed are solely those of the author and do not
>> >> necessarily
>> >> >>> represent those of Shape Blue Ltd or related companies. If you
>> >> >>> are not
>> >> the
>> >> >>> intended recipient of this email, you must neither take any
>> >> >>> action
>> >> based
>> >> >>> upon its contents, nor copy or show it to anyone. Please contact
>> >> >>> the
>> >> sender
>> >> >>> if you believe you have received this email in error. Shape Blue
>> >> >>> Ltd
>> >> is a
>> >> >>> company incorporated in England & Wales. ShapeBlue Services India
>> >> >>> LLP
>> >> is
>> >> >>> operated under license from Shape Blue Ltd. ShapeBlue is a
>> >> >>> registered trademark.
>> >> >>>
>> >> >>
>> >> >>
>> >>
>> >>

Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Reply via email to