[ https://issues.apache.org/jira/browse/CLOUDSTACK-7184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14128253#comment-14128253 ]
Remi Bergsma commented on CLOUDSTACK-7184: ------------------------------------------ +1! > HA should wait for at least 'xen.heartbeat.interval' sec before starting HA > on vm's when host is marked down > ------------------------------------------------------------------------------------------------------------ > > Key: CLOUDSTACK-7184 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7184 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Hypervisor Controller, Management Server, XenServer > Affects Versions: 4.3.0, 4.4.0, 4.5.0 > Environment: CloudStack 4.3 with XenServer 6.2 hypervisors > Reporter: Remi Bergsma > Assignee: Daan Hoogland > Priority: Blocker > > Hypervisor got isolated for 30 seconds due to a network issue. CloudStack did > discover this and marked the host as down, and immediately started HA. Just > 18 seconds later the hypervisor returned and we ended up with 5 vm's that > were running on two hypervisors at the same time. > This, of course, resulted in file system corruption and the loss of the vm's. > One side of the story is why XenServer allowed this to happen (will not > bother you with this one). The CloudStack side of the story: HA should only > start after at least xen.heartbeat.interval seconds. If the host is down long > enough, the Xen heartbeat script will fence the hypervisor and prevent > corruption. If it is not down long enough, nothing should happen. > Logs (short): > 2014-07-25 05:03:28,596 WARN [c.c.a.m.DirectAgentAttache] > (DirectAgent-122:ctx-690badc5) Unable to get current status on 505(mccpvmXX) > ..... > 2014-07-25 05:03:31,920 ERROR [c.c.a.m.AgentManagerImpl] > (AgentTaskPool-10:ctx-11b9af3e) Host is down: 505-mccpvmXX. Starting HA on > the VMs > ..... > 2014-07-25 05:03:49,655 DEBUG [c.c.h.Status] (ClusteredAgentManager > Timer:ctx-0e00979c) Transition:[Resource state = Enabled, Agent event = > AgentDisconnected, Host id = 505, name = mccpvmXX] > cs marks host down: 2014-07-25 05:03:31,920 > cs marks host up: 2014-07-25 05:03:49,655 -- This message was sent by Atlassian JIRA (v6.3.4#6332)