Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Lennert den Teuling Thu, 25 Jul 2013 07:17:23 -0700

Op 25-07-13 07:48, Bryan Whitehead schreef:

Starting off, there is never going to be a way to "conclusively"
decide if a host is down. This is just the nature of complex systems.
We can only hope our software does "well" - and if "well" is "wrong" -
we have a way to clean up the mess created.


That said, I like the old behavior 3.0.x has. As I mentioned in -3535
I've had a host lose its network (e1000 oops in kernel) and HA got
triggered. The storage (in this case gluster using a sharedmountpount)
wouldn't let qemu-kvm start on another host because the underlying
qcow2 file was locked by an already running qemu-kvm process (on the
machine that lost network). So HA being triggered didn't ruin any VM
disks. Gluster was running on Infiniband so the shared storage with
working locks prevented HA from screwing things up.

Further, even if gluster lost connectivity, gluster itself would
split-brain and later I could decide which qcow2/disk image should be
"truth". Do I keep the VM that kept on running? Or do I keep the
version HA booted and fscked? That's for me - the user - to decide.

As a cloudstack admin/user I understand the risks of HA and I choose
to live with them - I've even made sure that should such a disaster
happen I can recover (gluster will split brain as well). The #1 reason
for choosing HA is I want the VM to be available as much as possible.

Right now 4.1 DOES NOT have HA... I don't know how "emailing the admin
to figure out what to do" is being entertained as an option. That's
just nonsense and is NOT HIGH AVAILABILITY. IMHO If one is so
terrified of HA screwing up they should probably pass on HA and
manually start things up.

When a simple reproducible test like pulling the plug on a host can't
trigger an HA event - then that feature doesn't exist. It is simple as
that.

I would like to add that when testing this on our development cluster,something bizar happened:

First, when i killed the VMs _and_ the agent on the host the HA workedjust fine, after 10 minutes everything was restarted on a working host.


The second time i turned of the host, nothing happened:

2013-07-25 15:31:41,347 DEBUG [cloud.ha.AbstractInvestigatorImpl](AgentTaskPool-3:null) host (192.168.122.32) cannot be pinged, returningnull ('I don't know')2013-07-25 15:31:41,348 DEBUG [cloud.ha.UserVmDomRInvestigator](AgentTaskPool-3:null) could not reach agent, could not reach agent'shost, returning that we don't have enough information2013-07-25 15:31:41,348 DEBUG [cloud.ha.HighAvailabilityManagerImpl](AgentTaskPool-3:null) null unable to determine the state of the host.Moving on.2013-07-25 15:31:41,348 DEBUG [cloud.ha.HighAvailabilityManagerImpl](AgentTaskPool-3:null) null unable to determine the state of the host.Moving on.2013-07-25 15:31:41,349 WARN [agent.manager.AgentManagerImpl](AgentTaskPool-3:null) Agent state cannot be determined, do nothing

So when the host is still pingable it's "OK" to do a HA, but when it istotally unreachable it's not?

My third try was even worse. I killed the agent, forgot to kill the VMsand the management server restarted the VMs on another host and it seemsthat all images are corrupted.

2013-07-25 15:37:31,614 DEBUG [agent.manager.AgentManagerImpl](HA-Worker-2:work-29) Details from executing classcom.cloud.agent.api.PingTestCommand: PING 192.168.122.170(192.168.122.170): 56 data bytes64 bytes from 192.168.122.161: Destination Host UnreachableVr HL TOS LenID Flg off TTL Pro cks Src Dst Data 4 5 00 5400 00000 0040 40 01 0cc4 192.168.122.161 192.168.122.170 --- 192.168.122.170 ping statistics ---1 packets transmitted, 0 packetsreceived, 100% packet lossUnable to ping the vm, exiting2013-07-25 15:37:31,614 DEBUG [cloud.ha.UserVmDomRInvestigator](HA-Worker-2:work-29) VM[User|c88924e9-a8c9-4705-acc8-3237ffcf009d]could not be pinged, returning that it is unknown

Ping is disabled by default if you use security groups, so a ping testis not reliable.

Concluding that a VM is down on a simple ping test, is when you usesecurity groups for example not the right option. (It's even dangerous)

I will do some more tests, but if it's true that my last HA was based ona failed ping i will need to turn ping on on all my production instancesasap.

I do agree with Bryan that HA needs to go automatically withoutintervention of a sysadmin.


I think you could base a HA operation on:
- An unreachable agent
- Unpingable host

- A file with a timestamp on the network storage which updates every Xseconds, when it's not updated, something is wrong.

Ideally the management server would turn of the host using IPMI to makesure it's dead, then you are sure no corruption will happen.

On Wed, Jul 24, 2013 at 9:31 PM, Koushik Das <[email protected]> wrote:

There is another bug for the same. CLOUDSTACK-3421
This document nicely explains how HA works in Cloudstack 
https://cwiki.apache.org/confluence/display/CLOUDSTACK/High+Availability+Developer's+Guide.

As can be seen from the logs in this case, Cloudstack is not able to 
conclusively determine if the host is 'down' and so does nothing. Suppose HA 
was done for the VMs in this case and later on the host came back up. This will 
corrupt the VM disks which is not desirable.

Possible options:
- If host state cannot be determined conclusively for some configurable time 
then the host may be put into some special state and then admin can take 
appropriate action by manually triggering HA
- If KVM cluster has the concept of something like a 'master' from which the 
state of any host in the cluster can be determined. Something similar is there 
for XS.

Thoughts?

-----Original Message-----
From: Bryan Whitehead [mailto:[email protected]]
Sent: Thursday, July 25, 2013 7:58 AM
To: [email protected]
Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

CLOUDSTACK-3535 bug looks like it is describing the problem perfectly.
What else can we add?

On Wed, Jul 24, 2013 at 7:20 PM, Chip Childers <[email protected]>
wrote:

This sucks.

Can one of the folks on this thread please open a bug with as much
information as possible?  I'd like to make sure that someone picks up
the issue and gets it resolved for the next release.



On Wed, Jul 24, 2013 at 7:26 PM, Bryan Whitehead

<[email protected]>wrote:

This same thing happened to me - but it was a Power-Supply that died
on a box. All my templates have HA turned on.

All the VM's (including 1 system-router-vm) were shown as "Running"
and the host itself was simply marked "Disconnected". When I tried to
shutdown the VM's to start them again I got errors about not being
able to communicate with the agent. I tried restarting the management
server but that didn't change anything.

Getting the router working again was extremely annoying. After
changing it to Stopped it kept trying to start it again on the dead
host. I marked it destroyed then restarted the network with the force
option. That fixed it. After I hacked the DB to get all my VM's not
running with state Running to Stopped, then I was able to start all
the VM's that were down on the bad host.

Anyway, The time between host death and me finding out was about 4
days - as these were on managed servers of a customer and their
monitoring of each host wasn't working. They were pretty unhappy. :(

Other notes: this is KVM with sharedmountpoint on a gluster mount.
After host got back online gluster rsynced about 200GB of data - I
migrated VM's to the host at the same time as normal. I've had a
similar things happen with 3.0.2 install of cloudstack and everything
seamlessly restarted. Disappointing this happened with 4.1

On Wed, Jul 24, 2013 at 9:23 AM, Indra Pramana <[email protected]> wrote:

Dear Chip, Geoff and all,

I scrutinized the management server's logs during the time when I

shutdown

the host and the time when I turned the host back on.

This is the management server's logs when the host is being shut down:

http://pastebin.com/4wfV830Z

During the time, I noted that there are quite a lot of "Sending

Disconnect

to listener" messages, which implies that the management server try
to notify other listeners that the host is going down. However,

subsequently I

didn't see any messages on the logs showing that the management
server is trying to activate the HA capability to start the
affected VMs on another available host.

This is the management server's logs when the host is being turned
back

on:


http://pastebin.com/JrLJxbXH

When the agent is reconnected, then CloudStack marked the affected
VMs as stopped from previously running:

===
2013-07-24 23:04:57,406 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(AgentConnectTaskPool-7:null) Found 5 VMs for host 34
2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
realState = Stopped
2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
realState = Stopped
2013-07-24 23:04:57,408 DEBUG
[cloud.ha.HighAvailabilityManagerImpl]
(AgentConnectTaskPool-7:null) VM does not require investigation so
I'm marking it as Stopped: VM[User|Ubuntu-12-04-2-64bit]
2013-07-24 23:04:57,450 DEBUG [cloud.capacity.CapacityManagerImpl]
(AgentConnectTaskPool-7:null) VM state transitted from :Running to

Stopping

with event: StopRequestedvm's original host id: 28 new host id: 34
host

id

before state transition: 34
===

Then the HA starts to kick in.

===
2013-07-24 23:04:57,955 INFO
[cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-1:work-307) Processing
HAWork[307-HA-273-Stopped-Scheduled]
2013-07-24 23:04:57,956 DEBUG [cloud.capacity.CapacityManagerImpl]
(AgentConnectTaskPool-7:null) VM state transitted from :Running to

Stopping

with event: StopRequestedvm's original host id: 28 new host id: 34
host

id

before state transition: 34
2013-07-24 23:04:57,960 DEBUG [agent.transport.Request]
(AgentConnectTaskPool-7:null) Seq 34-105644038: Sending  { Cmd ,

MgmtId:

161342671900, via: 34, Ver: v1, Flags: 100111,
[{"StopCommand":{"isProxy":false,"vmName":"i-2-281-VM","wait":0}}]
}
2013-07-24 23:04:57,968 INFO
[cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-1:work-307) HA on VM[User|Ubuntu-12-04-2-64bit]
2013-07-24 23:04:57,984 DEBUG [cloud.capacity.CapacityManagerImpl]
(HA-Worker-1:work-307) VM state transitted from :Stopped to
Starting with
event: StartRequestedvm's original host id: 28 new host id: null
host id before state transition: null
2013-07-24 23:04:57,984 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(HA-Worker-1:work-307) Successfully transitioned to start state for
VM[User|Ubuntu-12-04-2-64bit] reservation id =
b56364ef-90d8-443f-a348-7660fda48d34
2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(HA-Worker-1:work-307) Trying to deploy VM, vm has dcId: 6 and
podId: 6
2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(HA-Worker-1:work-307) Deploy avoids pods: null, clusters: null, hosts:

null

2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(HA-Worker-1:work-307) Root volume is ready, need to place VM in
volume's cluster
2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(HA-Worker-1:work-307) Vol[295|vm=273|ROOT] is READY, changing
deployment plan to use this pool's dcId: 6 , podId: 6 , and
clusterId: 6 ===

My question is why HA only kicks in when the host is turned back
on? By right it should kick in soon after the host is shut down and
marked as "Disconnected".

Any insights on the possible solutions to this problem is highly
appreciated.

Looking forward to your reply, thank you.

Cheers.



On Thu, Jul 25, 2013 at 12:00 AM, Indra Pramana <[email protected]>

wrote:

Hi Chip,

Yes, "Offer HA" is set to "Yes" on all my compute offerings.

Hi Geoff,

Yes, I am using KVM. Is this a known issue and is there any
solution to this problem?

Looking forward to your reply, thank you.

Cheers.



On Wed, Jul 24, 2013 at 11:38 PM, Geoff Higginbottom <
[email protected]> wrote:

Is it running on KVM, we are seeing some real issue with HA
simply not working on KVM.

Regards

Geoff Higginbottom

D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581

[email protected]

-----Original Message-----
From: Chip Childers [mailto:[email protected]]
Sent: 24 July 2013 16:37
To: <[email protected]>
Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor
hosts

Did you enable HA for your compute offering?

On Jul 24, 2013, at 11:25 AM, Indra Pramana <[email protected]> wrote:

Dear all,

I tried to shutdown one of my hypervisor hosts to simulate a
server failure, and the HA is not working, all the VMs on the
affected host is not started on another available host.

I am using CloudStack 4.1.0 with KVM hypervisors and Ceph RBD
for primary storage.

My issue is similar to what is being described here:

https://issues.apache.org/jira/browse/CLOUDSTACK-3535

Except that on my case, the host is indeed marked as

"Disconnected"

but there is no attempt from CloudStack to try starting the VMs
on another host. I can't provide logs since there's nothing on
the logs which suggest that CloudStack tries to activate the HA
and start the affected VMs on another host.

Anyone has similar experience? Anyone knows if the above bug
has been resolved?

Looking forward to your reply, thank you.

Cheers.

This email and any attachments to it may be confidential and are

intended

solely for the use of the individual to whom it is addressed. Any

views or

opinions expressed are solely those of the author and do not

necessarily

represent those of Shape Blue Ltd or related companies. If you
are not

the

intended recipient of this email, you must neither take any
action

based

upon its contents, nor copy or show it to anyone. Please contact
the

sender

if you believe you have received this email in error. Shape Blue
Ltd

is a

company incorporated in England & Wales. ShapeBlue Services India
LLP

is

operated under license from Shape Blue Ltd. ShapeBlue is a
registered trademark.

Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Reply via email to