[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128266#comment-14128266
 ] 

Remi Bergsma edited comment on CLOUDSTACK-7184 at 9/10/14 9:34 AM:
-------------------------------------------------------------------

@[~dahn] Now that you're working on this script, please also look at 
CLOUDSTACK-7527 (https://issues.apache.org/jira/browse/CLOUDSTACK-7527). Thx!



was (Author: remibergsma):
[~dahn] Now that you're working on this script, please also look at 
CLOUDSTACK-7527 (https://issues.apache.org/jira/browse/CLOUDSTACK-7527). Thx!


> HA should wait for at least 'xen.heartbeat.interval' sec before starting HA 
> on vm's when host is marked down
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-7184
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>          Components: Hypervisor Controller, Management Server, XenServer
>    Affects Versions: 4.3.0, 4.4.0, 4.5.0
>         Environment: CloudStack 4.3 with XenServer 6.2 hypervisors
>            Reporter: Remi Bergsma
>            Assignee: Daan Hoogland
>            Priority: Blocker
>
> Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did 
> discover this and marked the host as down, and immediately started HA. Just 
> 18 seconds later the hypervisor returned and we ended up with 5 vm's that 
> were running on two hypervisors at the same time. 
> This, of course, resulted in file system corruption and the loss of the vm's. 
> One side of the story is why XenServer allowed this to happen (will not 
> bother you with this one). The CloudStack side of the story: HA should only 
> start after at least xen.heartbeat.interval seconds. If the host is down long 
> enough, the Xen heartbeat script will fence the hypervisor and prevent 
> corruption. If it is not down long enough, nothing should happen.
> Logs (short):
> 2014-07-25 05:03:28,596 WARN  [c.c.a.m.DirectAgentAttache] 
> (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX)
> .....
> 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] 
> (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX.  Starting HA on 
> the VMs
> .....
> 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager 
> Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = 
> AgentDisconnected, Host id = 505, name = mccpvmXX]
> cs marks host down: 2014-07-25  05:03:31,920
> cs marks host up:     2014-07-25  05:03:49,655



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to