there are workaround / database update to declare a host died so that HA operations can be triggered?
2013/7/25 Lennert den Teuling <lenn...@pcextreme.nl> > Op 25-07-13 07:48, Bryan Whitehead schreef: > > Starting off, there is never going to be a way to "conclusively" >> decide if a host is down. This is just the nature of complex systems. >> We can only hope our software does "well" - and if "well" is "wrong" - >> we have a way to clean up the mess created. >> >> That said, I like the old behavior 3.0.x has. As I mentioned in -3535 >> I've had a host lose its network (e1000 oops in kernel) and HA got >> triggered. The storage (in this case gluster using a sharedmountpount) >> wouldn't let qemu-kvm start on another host because the underlying >> qcow2 file was locked by an already running qemu-kvm process (on the >> machine that lost network). So HA being triggered didn't ruin any VM >> disks. Gluster was running on Infiniband so the shared storage with >> working locks prevented HA from screwing things up. >> >> Further, even if gluster lost connectivity, gluster itself would >> split-brain and later I could decide which qcow2/disk image should be >> "truth". Do I keep the VM that kept on running? Or do I keep the >> version HA booted and fscked? That's for me - the user - to decide. >> >> As a cloudstack admin/user I understand the risks of HA and I choose >> to live with them - I've even made sure that should such a disaster >> happen I can recover (gluster will split brain as well). The #1 reason >> for choosing HA is I want the VM to be available as much as possible. >> >> Right now 4.1 DOES NOT have HA... I don't know how "emailing the admin >> to figure out what to do" is being entertained as an option. That's >> just nonsense and is NOT HIGH AVAILABILITY. IMHO If one is so >> terrified of HA screwing up they should probably pass on HA and >> manually start things up. >> >> When a simple reproducible test like pulling the plug on a host can't >> trigger an HA event - then that feature doesn't exist. It is simple as >> that. >> > > I would like to add that when testing this on our development cluster, > something bizar happened: > > First, when i killed the VMs _and_ the agent on the host the HA worked > just fine, after 10 minutes everything was restarted on a working host. > > The second time i turned of the host, nothing happened: > > 2013-07-25 15:31:41,347 DEBUG [cloud.ha.**AbstractInvestigatorImpl] > (AgentTaskPool-3:null) host (192.168.122.32) cannot be pinged, returning > null ('I don't know') > 2013-07-25 15:31:41,348 DEBUG [cloud.ha.**UserVmDomRInvestigator] > (AgentTaskPool-3:null) could not reach agent, could not reach agent's host, > returning that we don't have enough information > 2013-07-25 15:31:41,348 DEBUG [cloud.ha.**HighAvailabilityManagerImpl] > (AgentTaskPool-3:null) null unable to determine the state of the host. > Moving on. > 2013-07-25 15:31:41,348 DEBUG [cloud.ha.**HighAvailabilityManagerImpl] > (AgentTaskPool-3:null) null unable to determine the state of the host. > Moving on. > 2013-07-25 15:31:41,349 WARN [agent.manager.**AgentManagerImpl] > (AgentTaskPool-3:null) Agent state cannot be determined, do nothing > > So when the host is still pingable it's "OK" to do a HA, but when it is > totally unreachable it's not? > > My third try was even worse. I killed the agent, forgot to kill the VMs > and the management server restarted the VMs on another host and it seems > that all images are corrupted. > > 2013-07-25 15:37:31,614 DEBUG [agent.manager.**AgentManagerImpl] > (HA-Worker-2:work-29) Details from executing class > com.cloud.agent.api.**PingTestCommand: > PING 192.168.122.170 (192.168.122.170): 56 data bytes6 > 4 bytes from 192.168.122.161: Destination Host UnreachableVr HL TOS Len > ID Flg off TTL Pro cks Src Dst Data 4 5 00 5400 0000 0 0040 > 40 01 0cc4 192.168.122.161 192.168.122.170 --- 192. > 168.122.170 ping statistics ---1 packets transmitted, 0 packets received, > 100% packet lossUnable to ping the vm, exiting > 2013-07-25 15:37:31,614 DEBUG [cloud.ha.**UserVmDomRInvestigator] > (HA-Worker-2:work-29) VM[User|c88924e9-a8c9-4705-**acc8-3237ffcf009d] > could not be pinged, returning that it is unknown > > Ping is disabled by default if you use security groups, so a ping test is > not reliable. > > Concluding that a VM is down on a simple ping test, is when you use > security groups for example not the right option. (It's even dangerous) > > I will do some more tests, but if it's true that my last HA was based on a > failed ping i will need to turn ping on on all my production instances asap. > > I do agree with Bryan that HA needs to go automatically without > intervention of a sysadmin. > > I think you could base a HA operation on: > - An unreachable agent > - Unpingable host > - A file with a timestamp on the network storage which updates every X > seconds, when it's not updated, something is wrong. > > Ideally the management server would turn of the host using IPMI to make > sure it's dead, then you are sure no corruption will happen. > > > On Wed, Jul 24, 2013 at 9:31 PM, Koushik Das <koushik....@citrix.com> >> wrote: >> >>> There is another bug for the same. CLOUDSTACK-3421 >>> This document nicely explains how HA works in Cloudstack >>> https://cwiki.apache.org/**confluence/display/CLOUDSTACK/** >>> High+Availability+Developer's+**Guide<https://cwiki.apache.org/confluence/display/CLOUDSTACK/High+Availability+Developer's+Guide> >>> . >>> >>> As can be seen from the logs in this case, Cloudstack is not able to >>> conclusively determine if the host is 'down' and so does nothing. Suppose >>> HA was done for the VMs in this case and later on the host came back up. >>> This will corrupt the VM disks which is not desirable. >>> >>> Possible options: >>> - If host state cannot be determined conclusively for some configurable >>> time then the host may be put into some special state and then admin can >>> take appropriate action by manually triggering HA >>> - If KVM cluster has the concept of something like a 'master' from which >>> the state of any host in the cluster can be determined. Something similar >>> is there for XS. >>> >>> Thoughts? >>> >>> >>> -----Original Message----- >>>> From: Bryan Whitehead [mailto:dri...@megahappy.net] >>>> Sent: Thursday, July 25, 2013 7:58 AM >>>> To: users@cloudstack.apache.org >>>> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts >>>> >>>> CLOUDSTACK-3535 bug looks like it is describing the problem perfectly. >>>> What else can we add? >>>> >>>> On Wed, Jul 24, 2013 at 7:20 PM, Chip Childers < >>>> chip.child...@sungard.com> >>>> wrote: >>>> >>>>> This sucks. >>>>> >>>>> Can one of the folks on this thread please open a bug with as much >>>>> information as possible? I'd like to make sure that someone picks up >>>>> the issue and gets it resolved for the next release. >>>>> >>>>> >>>>> >>>>> On Wed, Jul 24, 2013 at 7:26 PM, Bryan Whitehead >>>>> >>>> <dri...@megahappy.net>wrote: >>>> >>>>> >>>>> This same thing happened to me - but it was a Power-Supply that died >>>>>> on a box. All my templates have HA turned on. >>>>>> >>>>>> All the VM's (including 1 system-router-vm) were shown as "Running" >>>>>> and the host itself was simply marked "Disconnected". When I tried to >>>>>> shutdown the VM's to start them again I got errors about not being >>>>>> able to communicate with the agent. I tried restarting the management >>>>>> server but that didn't change anything. >>>>>> >>>>>> Getting the router working again was extremely annoying. After >>>>>> changing it to Stopped it kept trying to start it again on the dead >>>>>> host. I marked it destroyed then restarted the network with the force >>>>>> option. That fixed it. After I hacked the DB to get all my VM's not >>>>>> running with state Running to Stopped, then I was able to start all >>>>>> the VM's that were down on the bad host. >>>>>> >>>>>> Anyway, The time between host death and me finding out was about 4 >>>>>> days - as these were on managed servers of a customer and their >>>>>> monitoring of each host wasn't working. They were pretty unhappy. :( >>>>>> >>>>>> Other notes: this is KVM with sharedmountpoint on a gluster mount. >>>>>> After host got back online gluster rsynced about 200GB of data - I >>>>>> migrated VM's to the host at the same time as normal. I've had a >>>>>> similar things happen with 3.0.2 install of cloudstack and everything >>>>>> seamlessly restarted. Disappointing this happened with 4.1 >>>>>> >>>>>> On Wed, Jul 24, 2013 at 9:23 AM, Indra Pramana <in...@sg.or.id> >>>>>> wrote: >>>>>> >>>>>>> Dear Chip, Geoff and all, >>>>>>> >>>>>>> I scrutinized the management server's logs during the time when I >>>>>>> >>>>>> shutdown >>>>>> >>>>>>> the host and the time when I turned the host back on. >>>>>>> >>>>>>> This is the management server's logs when the host is being shut >>>>>>> down: >>>>>>> >>>>>>> http://pastebin.com/4wfV830Z >>>>>>> >>>>>>> During the time, I noted that there are quite a lot of "Sending >>>>>>> >>>>>> Disconnect >>>>>> >>>>>>> to listener" messages, which implies that the management server try >>>>>>> to notify other listeners that the host is going down. However, >>>>>>> >>>>>> subsequently I >>>>>> >>>>>>> didn't see any messages on the logs showing that the management >>>>>>> server is trying to activate the HA capability to start the >>>>>>> affected VMs on another available host. >>>>>>> >>>>>>> This is the management server's logs when the host is being turned >>>>>>> back >>>>>>> >>>>>> on: >>>>>> >>>>>>> >>>>>>> http://pastebin.com/JrLJxbXH >>>>>>> >>>>>>> When the agent is reconnected, then CloudStack marked the affected >>>>>>> VMs as stopped from previously running: >>>>>>> >>>>>>> === >>>>>>> 2013-07-24 23:04:57,406 DEBUG [cloud.vm.**VirtualMachineManagerImpl] >>>>>>> (AgentConnectTaskPool-7:null) Found 5 VMs for host 34 >>>>>>> 2013-07-24 23:04:57,408 DEBUG [cloud.vm.**VirtualMachineManagerImpl] >>>>>>> (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and >>>>>>> realState = Stopped >>>>>>> 2013-07-24 23:04:57,408 DEBUG [cloud.vm.**VirtualMachineManagerImpl] >>>>>>> (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and >>>>>>> realState = Stopped >>>>>>> 2013-07-24 23:04:57,408 DEBUG >>>>>>> [cloud.ha.**HighAvailabilityManagerImpl] >>>>>>> (AgentConnectTaskPool-7:null) VM does not require investigation so >>>>>>> I'm marking it as Stopped: VM[User|Ubuntu-12-04-2-64bit] >>>>>>> 2013-07-24 23:04:57,450 DEBUG [cloud.capacity.**CapacityManagerImpl] >>>>>>> (AgentConnectTaskPool-7:null) VM state transitted from :Running to >>>>>>> >>>>>> Stopping >>>>>> >>>>>>> with event: StopRequestedvm's original host id: 28 new host id: 34 >>>>>>> host >>>>>>> >>>>>> id >>>>>> >>>>>>> before state transition: 34 >>>>>>> === >>>>>>> >>>>>>> Then the HA starts to kick in. >>>>>>> >>>>>>> === >>>>>>> 2013-07-24 23:04:57,955 INFO >>>>>>> [cloud.ha.**HighAvailabilityManagerImpl] >>>>>>> (HA-Worker-1:work-307) Processing >>>>>>> HAWork[307-HA-273-Stopped-**Scheduled] >>>>>>> 2013-07-24 23:04:57,956 DEBUG [cloud.capacity.**CapacityManagerImpl] >>>>>>> (AgentConnectTaskPool-7:null) VM state transitted from :Running to >>>>>>> >>>>>> Stopping >>>>>> >>>>>>> with event: StopRequestedvm's original host id: 28 new host id: 34 >>>>>>> host >>>>>>> >>>>>> id >>>>>> >>>>>>> before state transition: 34 >>>>>>> 2013-07-24 23:04:57,960 DEBUG [agent.transport.Request] >>>>>>> (AgentConnectTaskPool-7:null) Seq 34-105644038: Sending { Cmd , >>>>>>> >>>>>> MgmtId: >>>> >>>>> 161342671900, via: 34, Ver: v1, Flags: 100111, >>>>>>> [{"StopCommand":{"isProxy":**false,"vmName":"i-2-281-VM","** >>>>>>> wait":0}}] >>>>>>> } >>>>>>> 2013-07-24 23:04:57,968 INFO >>>>>>> [cloud.ha.**HighAvailabilityManagerImpl] >>>>>>> (HA-Worker-1:work-307) HA on VM[User|Ubuntu-12-04-2-64bit] >>>>>>> 2013-07-24 23:04:57,984 DEBUG [cloud.capacity.**CapacityManagerImpl] >>>>>>> (HA-Worker-1:work-307) VM state transitted from :Stopped to >>>>>>> Starting with >>>>>>> event: StartRequestedvm's original host id: 28 new host id: null >>>>>>> host id before state transition: null >>>>>>> 2013-07-24 23:04:57,984 DEBUG [cloud.vm.**VirtualMachineManagerImpl] >>>>>>> (HA-Worker-1:work-307) Successfully transitioned to start state for >>>>>>> VM[User|Ubuntu-12-04-2-64bit] reservation id = >>>>>>> b56364ef-90d8-443f-a348-**7660fda48d34 >>>>>>> 2013-07-24 23:04:58,025 DEBUG [cloud.vm.**VirtualMachineManagerImpl] >>>>>>> (HA-Worker-1:work-307) Trying to deploy VM, vm has dcId: 6 and >>>>>>> podId: 6 >>>>>>> 2013-07-24 23:04:58,025 DEBUG [cloud.vm.**VirtualMachineManagerImpl] >>>>>>> (HA-Worker-1:work-307) Deploy avoids pods: null, clusters: null, >>>>>>> hosts: >>>>>>> >>>>>> null >>>>>> >>>>>>> 2013-07-24 23:04:58,031 DEBUG [cloud.vm.**VirtualMachineManagerImpl] >>>>>>> (HA-Worker-1:work-307) Root volume is ready, need to place VM in >>>>>>> volume's cluster >>>>>>> 2013-07-24 23:04:58,031 DEBUG [cloud.vm.**VirtualMachineManagerImpl] >>>>>>> (HA-Worker-1:work-307) Vol[295|vm=273|ROOT] is READY, changing >>>>>>> deployment plan to use this pool's dcId: 6 , podId: 6 , and >>>>>>> clusterId: 6 === >>>>>>> >>>>>>> My question is why HA only kicks in when the host is turned back >>>>>>> on? By right it should kick in soon after the host is shut down and >>>>>>> marked as "Disconnected". >>>>>>> >>>>>>> Any insights on the possible solutions to this problem is highly >>>>>>> appreciated. >>>>>>> >>>>>>> Looking forward to your reply, thank you. >>>>>>> >>>>>>> Cheers. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 25, 2013 at 12:00 AM, Indra Pramana <in...@sg.or.id> >>>>>>> >>>>>> wrote: >>>> >>>>> >>>>>>> Hi Chip, >>>>>>>> >>>>>>>> Yes, "Offer HA" is set to "Yes" on all my compute offerings. >>>>>>>> >>>>>>>> Hi Geoff, >>>>>>>> >>>>>>>> Yes, I am using KVM. Is this a known issue and is there any >>>>>>>> solution to this problem? >>>>>>>> >>>>>>>> Looking forward to your reply, thank you. >>>>>>>> >>>>>>>> Cheers. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jul 24, 2013 at 11:38 PM, Geoff Higginbottom < >>>>>>>> geoff.higginbottom@shapeblue.**com<geoff.higginbot...@shapeblue.com>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Is it running on KVM, we are seeing some real issue with HA >>>>>>>>> simply not working on KVM. >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> >>>>>>>>> Geoff Higginbottom >>>>>>>>> >>>>>>>>> D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581 >>>>>>>>> >>>>>>>>> geoff.higginbottom@shapeblue.**com<geoff.higginbot...@shapeblue.com> >>>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: Chip Childers >>>>>>>>> [mailto:chip.childers@sungard.**com<chip.child...@sungard.com> >>>>>>>>> ] >>>>>>>>> Sent: 24 July 2013 16:37 >>>>>>>>> To: <users@cloudstack.apache.org> >>>>>>>>> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor >>>>>>>>> hosts >>>>>>>>> >>>>>>>>> Did you enable HA for your compute offering? >>>>>>>>> >>>>>>>>> On Jul 24, 2013, at 11:25 AM, Indra Pramana <in...@sg.or.id> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Dear all, >>>>>>>>>> >>>>>>>>>> I tried to shutdown one of my hypervisor hosts to simulate a >>>>>>>>>> server failure, and the HA is not working, all the VMs on the >>>>>>>>>> affected host is not started on another available host. >>>>>>>>>> >>>>>>>>>> I am using CloudStack 4.1.0 with KVM hypervisors and Ceph RBD >>>>>>>>>> for primary storage. >>>>>>>>>> >>>>>>>>>> My issue is similar to what is being described here: >>>>>>>>>> >>>>>>>>>> https://issues.apache.org/**jira/browse/CLOUDSTACK-3535<https://issues.apache.org/jira/browse/CLOUDSTACK-3535> >>>>>>>>>> >>>>>>>>>> Except that on my case, the host is indeed marked as >>>>>>>>>> >>>>>>>>> "Disconnected" >>>> >>>>> but there is no attempt from CloudStack to try starting the VMs >>>>>>>>>> on another host. I can't provide logs since there's nothing on >>>>>>>>>> the logs which suggest that CloudStack tries to activate the HA >>>>>>>>>> and start the affected VMs on another host. >>>>>>>>>> >>>>>>>>>> Anyone has similar experience? Anyone knows if the above bug >>>>>>>>>> has been resolved? >>>>>>>>>> >>>>>>>>>> Looking forward to your reply, thank you. >>>>>>>>>> >>>>>>>>>> Cheers. >>>>>>>>>> >>>>>>>>> This email and any attachments to it may be confidential and are >>>>>>>>> >>>>>>>> intended >>>>>> >>>>>>> solely for the use of the individual to whom it is addressed. Any >>>>>>>>> >>>>>>>> views or >>>>>> >>>>>>> opinions expressed are solely those of the author and do not >>>>>>>>> >>>>>>>> necessarily >>>>>> >>>>>>> represent those of Shape Blue Ltd or related companies. If you >>>>>>>>> are not >>>>>>>>> >>>>>>>> the >>>>>> >>>>>>> intended recipient of this email, you must neither take any >>>>>>>>> action >>>>>>>>> >>>>>>>> based >>>>>> >>>>>>> upon its contents, nor copy or show it to anyone. Please contact >>>>>>>>> the >>>>>>>>> >>>>>>>> sender >>>>>> >>>>>>> if you believe you have received this email in error. Shape Blue >>>>>>>>> Ltd >>>>>>>>> >>>>>>>> is a >>>>>> >>>>>>> company incorporated in England & Wales. ShapeBlue Services India >>>>>>>>> LLP >>>>>>>>> >>>>>>>> is >>>>>> >>>>>>> operated under license from Shape Blue Ltd. ShapeBlue is a >>>>>>>>> registered trademark. >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >