[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029099#comment-14029099
 ] 

Koushik Das commented on CLOUDSTACK-6857:
-----------------------------------------

Can you share the full logs? Based on the log snippet none of the available 
investigators were able to determine if VM is alive. In such a case something 
called 'fencers' tries to fence off the VM. If fencers fail nothing is done to 
the VM. Full logs will help understand what all happened.

> Losing the connection from CloudStack Manager to the agent will force a 
> shutdown when connection is re-established
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-6857
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-6857
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>          Components: Management Server
>    Affects Versions: 4.3.0
>         Environment: Ubuntu 12.04
>            Reporter: c-hemp
>            Priority: Critical
>
> If a physical host is not pingable that host goes into alert mode. If the 
> physical hosts is unreachable, the virtual router is either unreachable or 
> unable to ping a virtual on the physical host, and the manager is unable to 
> ping the virtual instance it assumes the virtual is down and puts it into a 
> stop state.  
> When the connection is restablished, it gets the state from the database, 
> sees that it is now in a stopped state, and will then shutdown the instance.
> This behavior can cause major outages if there is any type of network loss 
> once the connectivity comes back.  This is especially critical when using 
> CloudStack across multiple colos.
> The logs when it happens:
> 14-06-06 02:01:22,259 INFO  [c.c.h.HighAvailabilityManagerImpl] 
> (HA-Worker-1:ctx-be848615 work-1953) PingInvestigator found 
> VM[User|cephvmstage013]to be alive? null
> 2014-06-06 02:01:22,259 DEBUG [c.c.h.ManagementIPSystemVMInvestigator] 
> (HA-Worker-1:ctx-be848615 work-1953) Not a System Vm, unable to determine 
> state of VM[User|cephvmstage013] returning null
> 2014-06-06 02:01:22,259 DEBUG [c.c.h.ManagementIPSystemVMInvestigator] 
> (HA-Worker-1:ctx-be848615 work-1953) Testing if VM[User|cephvmstage013] is 
> alive
> 2014-06-06 02:01:22,260 DEBUG [c.c.h.ManagementIPSystemVMInvestigator] 
> (HA-Worker-1:ctx-be848615 work-1953) Unable to find a management nic, cannot 
> ping this system VM, unable to determine state of VM[User|cephvmstage013] 
> returning null
> 2014-06-06 02:01:22,260 INFO  [c.c.h.HighAvailabilityManagerImpl] 
> (HA-Worker-1:ctx-be848615 work-1953) ManagementIPSysVMInvestigator found 
> VM[User|cephvmstage013]to be alive? null
> 2014-06-06 02:01:22,263 INFO  [c.c.h.HighAvailabilityManagerImpl] 
> (HA-Worker-4:ctx-e8eea7fb work-1950) KVMInvestigator found 
> VM[User|cephvmstage013]to be alive? null
> 2014-06-06 02:01:22,263 INFO  [c.c.h.HighAvailabilityManagerImpl] 
> (HA-Worker-4:ctx-e8eea7fb work-1950) HypervInvestigator found 
> VM[User|cephvmstage013]to be alive? null
> 2014-06-06 02:01:22,419 INFO  [c.c.h.HighAvailabilityManagerImpl] 
> (HA-Worker-1:ctx-be848615 work-1953) KVMInvestigator found 
> VM[User|cephvmstage013]to be alive? null
> 2014-06-06 02:01:22,419 INFO  [c.c.h.HighAvailabilityManagerImpl] 
> (HA-Worker-1:ctx-be848615 work-1953) HypervInvestigator found 
> VM[User|cephvmstage013]to be alive? null
> 2014-06-06 02:01:22,584 WARN  [c.c.v.VirtualMachineManagerImpl] 
> (HA-Worker-1:ctx-be848615 work-1953) Unable to actually stop 
> VM[User|cephvmstage013] but continue with release because it's a force stop
> 2014-06-06 02:01:22,585 DEBUG [c.c.v.VirtualMachineManagerImpl] 
> (HA-Worker-1:ctx-be848615 work-1953) VM[User|cephvmstage013] is stopped on 
> the host.  Proceeding to release resource held.
> 2014-06-06 02:01:22,648 WARN  [c.c.v.VirtualMachineManagerImpl] 
> (HA-Worker-4:ctx-e8eea7fb work-1950) Unable to actually stop 
> VM[User|cephvmstage013] but continue with release because it's a force stop
> 2014-06-06 02:01:22,650 DEBUG [c.c.v.VirtualMachineManagerImpl] 
> (HA-Worker-4:ctx-e8eea7fb work-1950) VM[User|cephvmstage013] is stopped on 
> the host.  Proceeding to release resource held.
> 2014-06-06 02:01:22,704 DEBUG [c.c.v.VirtualMachineManagerImpl] 
> (HA-Worker-4:ctx-e8eea7fb work-1950) Successfully released network resources 
> for the vm VM[User|cephvmstage013]
> 2014-06-06 02:01:22,704 DEBUG [c.c.v.VirtualMachineManagerImpl] 
> (HA-Worker-4:ctx-e8eea7fb work-1950) Successfully released storage resources 
> for the vm VM[User|cephvmstage013]
> 2014-06-06 02:01:22,774 DEBUG [c.c.v.VirtualMachineManagerImpl] 
> (HA-Worker-1:ctx-be848615 work-1953) Successfully released network resources 
> for the vm VM[User|cephvmstage013]
> 2014-06-06 02:01:22,774 DEBUG [c.c.v.VirtualMachineManagerImpl] 
> (HA-Worker-1:ctx-be848615 work-1953) Successfully released storage resources 
> for the vm VM[User|cephvmstage013]
> The behavior should change to be set into an alert state, then once 
> connectivity is re-established, if the instance is up, update the manager 
> with the running status



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to