[ https://issues.apache.org/jira/browse/CLOUDSTACK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029099#comment-14029099 ]
Koushik Das commented on CLOUDSTACK-6857: ----------------------------------------- Can you share the full logs? Based on the log snippet none of the available investigators were able to determine if VM is alive. In such a case something called 'fencers' tries to fence off the VM. If fencers fail nothing is done to the VM. Full logs will help understand what all happened. > Losing the connection from CloudStack Manager to the agent will force a > shutdown when connection is re-established > ------------------------------------------------------------------------------------------------------------------ > > Key: CLOUDSTACK-6857 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6857 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: Management Server > Affects Versions: 4.3.0 > Environment: Ubuntu 12.04 > Reporter: c-hemp > Priority: Critical > > If a physical host is not pingable that host goes into alert mode. If the > physical hosts is unreachable, the virtual router is either unreachable or > unable to ping a virtual on the physical host, and the manager is unable to > ping the virtual instance it assumes the virtual is down and puts it into a > stop state. > When the connection is restablished, it gets the state from the database, > sees that it is now in a stopped state, and will then shutdown the instance. > This behavior can cause major outages if there is any type of network loss > once the connectivity comes back. This is especially critical when using > CloudStack across multiple colos. > The logs when it happens: > 14-06-06 02:01:22,259 INFO [c.c.h.HighAvailabilityManagerImpl] > (HA-Worker-1:ctx-be848615 work-1953) PingInvestigator found > VM[User|cephvmstage013]to be alive? null > 2014-06-06 02:01:22,259 DEBUG [c.c.h.ManagementIPSystemVMInvestigator] > (HA-Worker-1:ctx-be848615 work-1953) Not a System Vm, unable to determine > state of VM[User|cephvmstage013] returning null > 2014-06-06 02:01:22,259 DEBUG [c.c.h.ManagementIPSystemVMInvestigator] > (HA-Worker-1:ctx-be848615 work-1953) Testing if VM[User|cephvmstage013] is > alive > 2014-06-06 02:01:22,260 DEBUG [c.c.h.ManagementIPSystemVMInvestigator] > (HA-Worker-1:ctx-be848615 work-1953) Unable to find a management nic, cannot > ping this system VM, unable to determine state of VM[User|cephvmstage013] > returning null > 2014-06-06 02:01:22,260 INFO [c.c.h.HighAvailabilityManagerImpl] > (HA-Worker-1:ctx-be848615 work-1953) ManagementIPSysVMInvestigator found > VM[User|cephvmstage013]to be alive? null > 2014-06-06 02:01:22,263 INFO [c.c.h.HighAvailabilityManagerImpl] > (HA-Worker-4:ctx-e8eea7fb work-1950) KVMInvestigator found > VM[User|cephvmstage013]to be alive? null > 2014-06-06 02:01:22,263 INFO [c.c.h.HighAvailabilityManagerImpl] > (HA-Worker-4:ctx-e8eea7fb work-1950) HypervInvestigator found > VM[User|cephvmstage013]to be alive? null > 2014-06-06 02:01:22,419 INFO [c.c.h.HighAvailabilityManagerImpl] > (HA-Worker-1:ctx-be848615 work-1953) KVMInvestigator found > VM[User|cephvmstage013]to be alive? null > 2014-06-06 02:01:22,419 INFO [c.c.h.HighAvailabilityManagerImpl] > (HA-Worker-1:ctx-be848615 work-1953) HypervInvestigator found > VM[User|cephvmstage013]to be alive? null > 2014-06-06 02:01:22,584 WARN [c.c.v.VirtualMachineManagerImpl] > (HA-Worker-1:ctx-be848615 work-1953) Unable to actually stop > VM[User|cephvmstage013] but continue with release because it's a force stop > 2014-06-06 02:01:22,585 DEBUG [c.c.v.VirtualMachineManagerImpl] > (HA-Worker-1:ctx-be848615 work-1953) VM[User|cephvmstage013] is stopped on > the host. Proceeding to release resource held. > 2014-06-06 02:01:22,648 WARN [c.c.v.VirtualMachineManagerImpl] > (HA-Worker-4:ctx-e8eea7fb work-1950) Unable to actually stop > VM[User|cephvmstage013] but continue with release because it's a force stop > 2014-06-06 02:01:22,650 DEBUG [c.c.v.VirtualMachineManagerImpl] > (HA-Worker-4:ctx-e8eea7fb work-1950) VM[User|cephvmstage013] is stopped on > the host. Proceeding to release resource held. > 2014-06-06 02:01:22,704 DEBUG [c.c.v.VirtualMachineManagerImpl] > (HA-Worker-4:ctx-e8eea7fb work-1950) Successfully released network resources > for the vm VM[User|cephvmstage013] > 2014-06-06 02:01:22,704 DEBUG [c.c.v.VirtualMachineManagerImpl] > (HA-Worker-4:ctx-e8eea7fb work-1950) Successfully released storage resources > for the vm VM[User|cephvmstage013] > 2014-06-06 02:01:22,774 DEBUG [c.c.v.VirtualMachineManagerImpl] > (HA-Worker-1:ctx-be848615 work-1953) Successfully released network resources > for the vm VM[User|cephvmstage013] > 2014-06-06 02:01:22,774 DEBUG [c.c.v.VirtualMachineManagerImpl] > (HA-Worker-1:ctx-be848615 work-1953) Successfully released storage resources > for the vm VM[User|cephvmstage013] > The behavior should change to be set into an alert state, then once > connectivity is re-established, if the instance is up, update the manager > with the running status -- This message was sent by Atlassian JIRA (v6.2#6252)