Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Salvatore Sciacco Mon, 29 Jul 2013 13:49:56 -0700

there are workaround / database update to declare a host died so that HA
operations can be triggered?





2013/7/25 Lennert den Teuling <lenn...@pcextreme.nl>

> Op 25-07-13 07:48, Bryan Whitehead schreef:
>
>  Starting off, there is never going to be a way to "conclusively"
>> decide if a host is down. This is just the nature of complex systems.
>> We can only hope our software does "well" - and if "well" is "wrong" -
>> we have a way to clean up the mess created.
>>
>> That said, I like the old behavior 3.0.x has. As I mentioned in -3535
>> I've had a host lose its network (e1000 oops in kernel) and HA got
>> triggered. The storage (in this case gluster using a sharedmountpount)
>> wouldn't let qemu-kvm start on another host because the underlying
>> qcow2 file was locked by an already running qemu-kvm process (on the
>> machine that lost network). So HA being triggered didn't ruin any VM
>> disks. Gluster was running on Infiniband so the shared storage with
>> working locks prevented HA from screwing things up.
>>
>> Further, even if gluster lost connectivity, gluster itself would
>> split-brain and later I could decide which qcow2/disk image should be
>> "truth". Do I keep the VM that kept on running? Or do I keep the
>> version HA booted and fscked? That's for me - the user - to decide.
>>
>> As a cloudstack admin/user I understand the risks of HA and I choose
>> to live with them - I've even made sure that should such a disaster
>> happen I can recover (gluster will split brain as well). The #1 reason
>> for choosing HA is I want the VM to be available as much as possible.
>>
>> Right now 4.1 DOES NOT have HA... I don't know how "emailing the admin
>> to figure out what to do" is being entertained as an option. That's
>> just nonsense and is NOT HIGH AVAILABILITY. IMHO If one is so
>> terrified of HA screwing up they should probably pass on HA and
>> manually start things up.
>>
>> When a simple reproducible test like pulling the plug on a host can't
>> trigger an HA event - then that feature doesn't exist. It is simple as
>> that.
>>
>
> I would like to add that when testing this on our development cluster,
> something bizar happened:
>
> First, when i killed the VMs _and_ the agent on the host the HA worked
> just fine, after 10 minutes everything was restarted on a working host.
>
> The second time i turned of the host, nothing happened:
>
> 2013-07-25 15:31:41,347 DEBUG [cloud.ha.**AbstractInvestigatorImpl]
> (AgentTaskPool-3:null) host (192.168.122.32) cannot be pinged, returning
> null ('I don't know')
> 2013-07-25 15:31:41,348 DEBUG [cloud.ha.**UserVmDomRInvestigator]
> (AgentTaskPool-3:null) could not reach agent, could not reach agent's host,
> returning that we don't have enough information
> 2013-07-25 15:31:41,348 DEBUG [cloud.ha.**HighAvailabilityManagerImpl]
> (AgentTaskPool-3:null) null unable to determine the state of the host.
> Moving on.
> 2013-07-25 15:31:41,348 DEBUG [cloud.ha.**HighAvailabilityManagerImpl]
> (AgentTaskPool-3:null) null unable to determine the state of the host.
> Moving on.
> 2013-07-25 15:31:41,349 WARN  [agent.manager.**AgentManagerImpl]
> (AgentTaskPool-3:null) Agent state cannot be determined, do nothing
>
> So when the host is still pingable it's "OK" to do a HA, but when it is
> totally unreachable it's not?
>
> My third try was even worse. I killed the agent, forgot to kill the VMs
> and the management server restarted the VMs on another host and it seems
> that all images are corrupted.
>
> 2013-07-25 15:37:31,614 DEBUG [agent.manager.**AgentManagerImpl]
> (HA-Worker-2:work-29) Details from executing class 
> com.cloud.agent.api.**PingTestCommand:
> PING 192.168.122.170 (192.168.122.170): 56 data bytes6
> 4 bytes from 192.168.122.161: Destination Host UnreachableVr HL TOS  Len
>   ID Flg  off TTL Pro  cks      Src      Dst Data 4  5  00 5400 0000 0 0040
>  40  01 0cc4 192.168.122.161  192.168.122.170 --- 192.
> 168.122.170 ping statistics ---1 packets transmitted, 0 packets received,
> 100% packet lossUnable to ping the vm, exiting
> 2013-07-25 15:37:31,614 DEBUG [cloud.ha.**UserVmDomRInvestigator]
> (HA-Worker-2:work-29) VM[User|c88924e9-a8c9-4705-**acc8-3237ffcf009d]
> could not be pinged, returning that it is unknown
>
> Ping is disabled by default if you use security groups, so a ping test is
> not reliable.
>
> Concluding that a VM is down on a simple ping test, is when you use
> security groups for example not the right option. (It's even dangerous)
>
> I will do some more tests, but if it's true that my last HA was based on a
> failed ping i will need to turn ping on on all my production instances asap.
>
> I do agree with Bryan that HA needs to go automatically without
> intervention of a sysadmin.
>
> I think you could base a HA operation on:
> - An unreachable agent
> - Unpingable host
> - A file with a timestamp on the network storage which updates every X
> seconds, when it's not updated, something is wrong.
>
> Ideally the management server would turn of the host using IPMI to make
> sure it's dead, then you are sure no corruption will happen.
>
>
>  On Wed, Jul 24, 2013 at 9:31 PM, Koushik Das <koushik....@citrix.com>
>> wrote:
>>
>>> There is another bug for the same. CLOUDSTACK-3421
>>> This document nicely explains how HA works in Cloudstack
>>> https://cwiki.apache.org/**confluence/display/CLOUDSTACK/**
>>> High+Availability+Developer's+**Guide<https://cwiki.apache.org/confluence/display/CLOUDSTACK/High+Availability+Developer's+Guide>
>>> .
>>>
>>> As can be seen from the logs in this case, Cloudstack is not able to
>>> conclusively determine if the host is 'down' and so does nothing. Suppose
>>> HA was done for the VMs in this case and later on the host came back up.
>>> This will corrupt the VM disks which is not desirable.
>>>
>>> Possible options:
>>> - If host state cannot be determined conclusively for some configurable
>>> time then the host may be put into some special state and then admin can
>>> take appropriate action by manually triggering HA
>>> - If KVM cluster has the concept of something like a 'master' from which
>>> the state of any host in the cluster can be determined. Something similar
>>> is there for XS.
>>>
>>> Thoughts?
>>>
>>>
>>>  -----Original Message-----
>>>> From: Bryan Whitehead [mailto:dri...@megahappy.net]
>>>> Sent: Thursday, July 25, 2013 7:58 AM
>>>> To: users@cloudstack.apache.org
>>>> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts
>>>>
>>>> CLOUDSTACK-3535 bug looks like it is describing the problem perfectly.
>>>> What else can we add?
>>>>
>>>> On Wed, Jul 24, 2013 at 7:20 PM, Chip Childers <
>>>> chip.child...@sungard.com>
>>>> wrote:
>>>>
>>>>> This sucks.
>>>>>
>>>>> Can one of the folks on this thread please open a bug with as much
>>>>> information as possible?  I'd like to make sure that someone picks up
>>>>> the issue and gets it resolved for the next release.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jul 24, 2013 at 7:26 PM, Bryan Whitehead
>>>>>
>>>> <dri...@megahappy.net>wrote:
>>>>
>>>>>
>>>>>  This same thing happened to me - but it was a Power-Supply that died
>>>>>> on a box. All my templates have HA turned on.
>>>>>>
>>>>>> All the VM's (including 1 system-router-vm) were shown as "Running"
>>>>>> and the host itself was simply marked "Disconnected". When I tried to
>>>>>> shutdown the VM's to start them again I got errors about not being
>>>>>> able to communicate with the agent. I tried restarting the management
>>>>>> server but that didn't change anything.
>>>>>>
>>>>>> Getting the router working again was extremely annoying. After
>>>>>> changing it to Stopped it kept trying to start it again on the dead
>>>>>> host. I marked it destroyed then restarted the network with the force
>>>>>> option. That fixed it. After I hacked the DB to get all my VM's not
>>>>>> running with state Running to Stopped, then I was able to start all
>>>>>> the VM's that were down on the bad host.
>>>>>>
>>>>>> Anyway, The time between host death and me finding out was about 4
>>>>>> days - as these were on managed servers of a customer and their
>>>>>> monitoring of each host wasn't working. They were pretty unhappy. :(
>>>>>>
>>>>>> Other notes: this is KVM with sharedmountpoint on a gluster mount.
>>>>>> After host got back online gluster rsynced about 200GB of data - I
>>>>>> migrated VM's to the host at the same time as normal. I've had a
>>>>>> similar things happen with 3.0.2 install of cloudstack and everything
>>>>>> seamlessly restarted. Disappointing this happened with 4.1
>>>>>>
>>>>>> On Wed, Jul 24, 2013 at 9:23 AM, Indra Pramana <in...@sg.or.id>
>>>>>> wrote:
>>>>>>
>>>>>>> Dear Chip, Geoff and all,
>>>>>>>
>>>>>>> I scrutinized the management server's logs during the time when I
>>>>>>>
>>>>>> shutdown
>>>>>>
>>>>>>> the host and the time when I turned the host back on.
>>>>>>>
>>>>>>> This is the management server's logs when the host is being shut
>>>>>>> down:
>>>>>>>
>>>>>>> http://pastebin.com/4wfV830Z
>>>>>>>
>>>>>>> During the time, I noted that there are quite a lot of "Sending
>>>>>>>
>>>>>> Disconnect
>>>>>>
>>>>>>> to listener" messages, which implies that the management server try
>>>>>>> to notify other listeners that the host is going down. However,
>>>>>>>
>>>>>> subsequently I
>>>>>>
>>>>>>> didn't see any messages on the logs showing that the management
>>>>>>> server is trying to activate the HA capability to start the
>>>>>>> affected VMs on another available host.
>>>>>>>
>>>>>>> This is the management server's logs when the host is being turned
>>>>>>> back
>>>>>>>
>>>>>> on:
>>>>>>
>>>>>>>
>>>>>>> http://pastebin.com/JrLJxbXH
>>>>>>>
>>>>>>> When the agent is reconnected, then CloudStack marked the affected
>>>>>>> VMs as stopped from previously running:
>>>>>>>
>>>>>>> ===
>>>>>>> 2013-07-24 23:04:57,406 DEBUG [cloud.vm.**VirtualMachineManagerImpl]
>>>>>>> (AgentConnectTaskPool-7:null) Found 5 VMs for host 34
>>>>>>> 2013-07-24 23:04:57,408 DEBUG [cloud.vm.**VirtualMachineManagerImpl]
>>>>>>> (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
>>>>>>> realState = Stopped
>>>>>>> 2013-07-24 23:04:57,408 DEBUG [cloud.vm.**VirtualMachineManagerImpl]
>>>>>>> (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
>>>>>>> realState = Stopped
>>>>>>> 2013-07-24 23:04:57,408 DEBUG
>>>>>>> [cloud.ha.**HighAvailabilityManagerImpl]
>>>>>>> (AgentConnectTaskPool-7:null) VM does not require investigation so
>>>>>>> I'm marking it as Stopped: VM[User|Ubuntu-12-04-2-64bit]
>>>>>>> 2013-07-24 23:04:57,450 DEBUG [cloud.capacity.**CapacityManagerImpl]
>>>>>>> (AgentConnectTaskPool-7:null) VM state transitted from :Running to
>>>>>>>
>>>>>> Stopping
>>>>>>
>>>>>>> with event: StopRequestedvm's original host id: 28 new host id: 34
>>>>>>> host
>>>>>>>
>>>>>> id
>>>>>>
>>>>>>> before state transition: 34
>>>>>>> ===
>>>>>>>
>>>>>>> Then the HA starts to kick in.
>>>>>>>
>>>>>>> ===
>>>>>>> 2013-07-24 23:04:57,955 INFO
>>>>>>> [cloud.ha.**HighAvailabilityManagerImpl]
>>>>>>> (HA-Worker-1:work-307) Processing
>>>>>>> HAWork[307-HA-273-Stopped-**Scheduled]
>>>>>>> 2013-07-24 23:04:57,956 DEBUG [cloud.capacity.**CapacityManagerImpl]
>>>>>>> (AgentConnectTaskPool-7:null) VM state transitted from :Running to
>>>>>>>
>>>>>> Stopping
>>>>>>
>>>>>>> with event: StopRequestedvm's original host id: 28 new host id: 34
>>>>>>> host
>>>>>>>
>>>>>> id
>>>>>>
>>>>>>> before state transition: 34
>>>>>>> 2013-07-24 23:04:57,960 DEBUG [agent.transport.Request]
>>>>>>> (AgentConnectTaskPool-7:null) Seq 34-105644038: Sending  { Cmd ,
>>>>>>>
>>>>>> MgmtId:
>>>>
>>>>> 161342671900, via: 34, Ver: v1, Flags: 100111,
>>>>>>> [{"StopCommand":{"isProxy":**false,"vmName":"i-2-281-VM","**
>>>>>>> wait":0}}]
>>>>>>> }
>>>>>>> 2013-07-24 23:04:57,968 INFO
>>>>>>> [cloud.ha.**HighAvailabilityManagerImpl]
>>>>>>> (HA-Worker-1:work-307) HA on VM[User|Ubuntu-12-04-2-64bit]
>>>>>>> 2013-07-24 23:04:57,984 DEBUG [cloud.capacity.**CapacityManagerImpl]
>>>>>>> (HA-Worker-1:work-307) VM state transitted from :Stopped to
>>>>>>> Starting with
>>>>>>> event: StartRequestedvm's original host id: 28 new host id: null
>>>>>>> host id before state transition: null
>>>>>>> 2013-07-24 23:04:57,984 DEBUG [cloud.vm.**VirtualMachineManagerImpl]
>>>>>>> (HA-Worker-1:work-307) Successfully transitioned to start state for
>>>>>>> VM[User|Ubuntu-12-04-2-64bit] reservation id =
>>>>>>> b56364ef-90d8-443f-a348-**7660fda48d34
>>>>>>> 2013-07-24 23:04:58,025 DEBUG [cloud.vm.**VirtualMachineManagerImpl]
>>>>>>> (HA-Worker-1:work-307) Trying to deploy VM, vm has dcId: 6 and
>>>>>>> podId: 6
>>>>>>> 2013-07-24 23:04:58,025 DEBUG [cloud.vm.**VirtualMachineManagerImpl]
>>>>>>> (HA-Worker-1:work-307) Deploy avoids pods: null, clusters: null,
>>>>>>> hosts:
>>>>>>>
>>>>>> null
>>>>>>
>>>>>>> 2013-07-24 23:04:58,031 DEBUG [cloud.vm.**VirtualMachineManagerImpl]
>>>>>>> (HA-Worker-1:work-307) Root volume is ready, need to place VM in
>>>>>>> volume's cluster
>>>>>>> 2013-07-24 23:04:58,031 DEBUG [cloud.vm.**VirtualMachineManagerImpl]
>>>>>>> (HA-Worker-1:work-307) Vol[295|vm=273|ROOT] is READY, changing
>>>>>>> deployment plan to use this pool's dcId: 6 , podId: 6 , and
>>>>>>> clusterId: 6 ===
>>>>>>>
>>>>>>> My question is why HA only kicks in when the host is turned back
>>>>>>> on? By right it should kick in soon after the host is shut down and
>>>>>>> marked as "Disconnected".
>>>>>>>
>>>>>>> Any insights on the possible solutions to this problem is highly
>>>>>>> appreciated.
>>>>>>>
>>>>>>> Looking forward to your reply, thank you.
>>>>>>>
>>>>>>> Cheers.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 25, 2013 at 12:00 AM, Indra Pramana <in...@sg.or.id>
>>>>>>>
>>>>>> wrote:
>>>>
>>>>>
>>>>>>>  Hi Chip,
>>>>>>>>
>>>>>>>> Yes, "Offer HA" is set to "Yes" on all my compute offerings.
>>>>>>>>
>>>>>>>> Hi Geoff,
>>>>>>>>
>>>>>>>> Yes, I am using KVM. Is this a known issue and is there any
>>>>>>>> solution to this problem?
>>>>>>>>
>>>>>>>> Looking forward to your reply, thank you.
>>>>>>>>
>>>>>>>> Cheers.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jul 24, 2013 at 11:38 PM, Geoff Higginbottom <
>>>>>>>> geoff.higginbottom@shapeblue.**com<geoff.higginbot...@shapeblue.com>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  Is it running on KVM, we are seeing some real issue with HA
>>>>>>>>> simply not working on KVM.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Geoff Higginbottom
>>>>>>>>>
>>>>>>>>> D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581
>>>>>>>>>
>>>>>>>>> geoff.higginbottom@shapeblue.**com<geoff.higginbot...@shapeblue.com>
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Chip Childers 
>>>>>>>>> [mailto:chip.childers@sungard.**com<chip.child...@sungard.com>
>>>>>>>>> ]
>>>>>>>>> Sent: 24 July 2013 16:37
>>>>>>>>> To: <users@cloudstack.apache.org>
>>>>>>>>> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor
>>>>>>>>> hosts
>>>>>>>>>
>>>>>>>>> Did you enable HA for your compute offering?
>>>>>>>>>
>>>>>>>>> On Jul 24, 2013, at 11:25 AM, Indra Pramana <in...@sg.or.id>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>  Dear all,
>>>>>>>>>>
>>>>>>>>>> I tried to shutdown one of my hypervisor hosts to simulate a
>>>>>>>>>> server failure, and the HA is not working, all the VMs on the
>>>>>>>>>> affected host is not started on another available host.
>>>>>>>>>>
>>>>>>>>>> I am using CloudStack 4.1.0 with KVM hypervisors and Ceph RBD
>>>>>>>>>> for primary storage.
>>>>>>>>>>
>>>>>>>>>> My issue is similar to what is being described here:
>>>>>>>>>>
>>>>>>>>>> https://issues.apache.org/**jira/browse/CLOUDSTACK-3535<https://issues.apache.org/jira/browse/CLOUDSTACK-3535>
>>>>>>>>>>
>>>>>>>>>> Except that on my case, the host is indeed marked as
>>>>>>>>>>
>>>>>>>>> "Disconnected"
>>>>
>>>>> but there is no attempt from CloudStack to try starting the VMs
>>>>>>>>>> on another host. I can't provide logs since there's nothing on
>>>>>>>>>> the logs which suggest that CloudStack tries to activate the HA
>>>>>>>>>> and start the affected VMs on another host.
>>>>>>>>>>
>>>>>>>>>> Anyone has similar experience? Anyone knows if the above bug
>>>>>>>>>> has been resolved?
>>>>>>>>>>
>>>>>>>>>> Looking forward to your reply, thank you.
>>>>>>>>>>
>>>>>>>>>> Cheers.
>>>>>>>>>>
>>>>>>>>> This email and any attachments to it may be confidential and are
>>>>>>>>>
>>>>>>>> intended
>>>>>>
>>>>>>> solely for the use of the individual to whom it is addressed. Any
>>>>>>>>>
>>>>>>>> views or
>>>>>>
>>>>>>> opinions expressed are solely those of the author and do not
>>>>>>>>>
>>>>>>>> necessarily
>>>>>>
>>>>>>> represent those of Shape Blue Ltd or related companies. If you
>>>>>>>>> are not
>>>>>>>>>
>>>>>>>> the
>>>>>>
>>>>>>> intended recipient of this email, you must neither take any
>>>>>>>>> action
>>>>>>>>>
>>>>>>>> based
>>>>>>
>>>>>>> upon its contents, nor copy or show it to anyone. Please contact
>>>>>>>>> the
>>>>>>>>>
>>>>>>>> sender
>>>>>>
>>>>>>> if you believe you have received this email in error. Shape Blue
>>>>>>>>> Ltd
>>>>>>>>>
>>>>>>>> is a
>>>>>>
>>>>>>> company incorporated in England & Wales. ShapeBlue Services India
>>>>>>>>> LLP
>>>>>>>>>
>>>>>>>> is
>>>>>>
>>>>>>> operated under license from Shape Blue Ltd. ShapeBlue is a
>>>>>>>>> registered trademark.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>

Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Reply via email to