[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128206#comment-14128206
 ] 

Brenn Oosterbaan edited comment on CLOUDSTACK-7184 at 9/10/14 8:09 AM:
-----------------------------------------------------------------------

"I've seen similar with KVM - I'm not sure this is necessarily tied to Xen? I'd 
suggest that possibly CS be a little more thorough before deciding a VM is 
down...maybe via channels other than the agent/VR?"

John is right on the money here. Although the patch comitted by Daan does give 
the possibility to specify a check interval for the Xen storage heartbeat 
script (instead of using the default of 5 seconds) it is not the root cause of 
this issue.

There are two mechanisms at work here. The xen heartbeat script which checks if 
the storage is reachable on a specific hypervisors, and Cloudstack which 
determines if a hypervisor is up or not.

When we set the Xen heartbeat interval to 180 seconds we basicly said: it's ok 
for vm's living on a hypervisor to 'hang' for 180 seconds in case of storage 
fail-overs or other issues.
Cloudstack has its own checking mechanisms to determine if a hypervisor is down 
or not. Those checks are not in line with the xen heartbeat interval. Which 
means that even though we decided 180 seconds of unavailability is fine, 
Cloudstack tries to connect to the hypervisor 3 times (in ~30 seconds) and then 
decides it is down and starts the vm's on another hypervisor. 
That is the issue/bug Remi meant to identify when filing this ticket.

I personally feel there should be two additional options: 
hypervisor.heartbeat.interval and hypervisor.heartbeat.max_retry.
This would allow us to decide to (for instance) set the interval to 15 seconds 
and the max_retry to 12. Which would then add up to 180 seconds as well. 
Since the default heartbeat timeout is 60 seconds I would set the defaults for 
these to a combination which allows for 60 seconds as well. Otherwise you will 
never be sure the hypervisor it self has actually rebooted and thus VM 
corruption could still take place.

regards,

Brenn


was (Author: boosterb...@schubergphilis.com):
"I've seen similar with KVM - I'm not sure this is necessarily tied to Xen? I'd 
suggest that possibly CS be a little more thorough before deciding a VM is 
down...maybe via channels other than the agent/VR?"

John is right on the money here. Although the patch comitted by Daan does give 
the possibility to specify a check interval for the Xen storage heartbeat 
script (instead of using the default of 5 seconds) it is not the root cause of 
this issue.

There are two mechanisms at work here. The xen heartbeat script which checks if 
the storage is reachable on a specific hypervisors, and Cloudstack which 
determines if a hypervisor is up or not.

When we set the Xen heartbeat interval to 180 seconds we basicly said: it's ok 
for vm's living on a hypervisor to 'hang' for 180 seconds in case of storage 
fail-overs or other issues.
Cloudstack has it's own checking mechanisms to determine if a hypervisor is 
down or not. Those checks are not in line with the xen heartbeat interval. 
Which means that even though we decided 180 seconds of unavailability is fine, 
Cloudstack tries to connect to the hypervisor 3 times (in ~30 seconds) and then 
decides it is down and starts the vm's on another hypervisor. 
That is the issue/bug Remi meant to identify when filing this ticket.

I personally feel there should be two additional global options: 
hypervisor.heartbeat.interval and hypervisor.heartbeat.max_retry.
This would allow us to decide to (for instance) set the interval to 15 seconds 
and the max_retry to 12. Which would then add up to 180 seconds as well. 
Since the default heartbeat timeout is 60 seconds I would set the defaults for 
these to a combination which allows for 60 seconds as well. Otherwise you will 
never be sure the hypervisor it self has actually rebooted and thus VM 
corruption could still take place.

regards,

Brenn

> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-7184
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>          Components: Hypervisor Controller, Management Server, XenServer
>    Affects Versions: 4.3.0, 4.4.0, 4.5.0
>         Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>            Reporter: Remi Bergsma
>            Assignee: Daan Hoogland
>            Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .....
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .....
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up:     2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to