[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerard Lynch updated CLOUDSTACK-3421:
-------------------------------------

    Attachment: catalina_management-server.zip

Attached our management server catalina.out and management-server.log files.

Let me know if you require anything further.
                
> When hypervisor is down, no HA occurs with log output "Agent state cannot be 
> determined, do nothing"
> ----------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-3421
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-3421
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>          Components: KVM, Management Server
>    Affects Versions: 4.1.0
>         Environment: CentOS 6.4 minimal install
> Libvirt, KVM/Qemu
> CloudStack 4.1
> GlusterFS 3.2, replicated+distributed as primary storage via Shared Mount 
> Point
> 3 physical servers
> * 1 management server, running NFS secondary storage
> ** 1 nic, management+storage
> * 2 hypervisor nodes, running glusterfs-server 
> ** 4x nic, management+storage, public, guest, gluster peering
> * Advanced zone
> * KVM
> * 4 networks: 
>  eth0: cloudbr0: management+secondary storage, 
>  eth2: cloudbr1: public
>  eth3: cloudbr2: guest
>  eth1: gluster peering
> * Shared Mount Point
> * custom network offering with redundant routers enabled
> * global settings tweaked to increase speed of identifying down state
> ** ping.interval: 10sec
>            Reporter: Gerard Lynch
>            Priority: Critical
>             Fix For: 4.1.1, 4.2.0, Future
>
>         Attachments: catalina_management-server.zip
>
>
> We wanted to test CloudStack's HA capabilities by simulating outages to find 
> out how long it would take to recover.  One of the tests was simulating loss 
> of a hypervisor node by shutting it down.   When we tested this, we found 
> that CloudStack failed to bring up any of the VMs (System or Instance), which 
> were on the down node, until the node was powered back up and reconnected.
> In the logs, we see repeating occurances of:
> INFO  [utils.exception.CSExceptionErrorCode] (AgentTaskPool-11:) Could not 
> find exception: com.cloud.exception.OperationTimedoutException in error code 
> list for exceptions
> INFO  [utils.exception.CSExceptionErrorCode] (AgentTaskPool-10:) Could not 
> find exception: com.cloud.exception.OperationTimedoutException in error code 
> list for exceptions
> WARN  [agent.manager.AgentAttache] (AgentTaskPool-11:) Seq 14-660013135: 
> Timed out on Seq 14-660013135:  { Cmd , MgmtId: 93515041483, via: 14, Ver: 
> v1, Flags: 100011, [{"CheckHealthCommand":{"wait":50}}] }
> WARN  [agent.manager.AgentAttache] (AgentTaskPool-10:) Seq 15-1097531400: 
> Timed out on Seq 15-1097531400:  { Cmd , MgmtId: 93515041483, via: 15, Ver: 
> v1, Flags: 100011, [{"CheckHealthCommand":{"wait":50}}] }
> WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-11:) Operation timed 
> out: Commands 660013135 to Host 14 timed out after 100
> WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-10:) Operation timed 
> out: Commands 1097531400 to Host 15 timed out after 100
> WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-11:) Agent state cannot 
> be determined, do nothing
> WARN  [agent.manager.AgentManagerImpl] (AgentTaskPool-10:) Agent state cannot 
> be determined, do nothing
> To reproduce: 
> 1. Build the environment as detailed above
> 2. Register an ISO
> 3. Create a new guest network using the custom network offering (that offers 
> redundant routers)
> 3. Provision an instance
> 4. Ensure the system VMs and instance are on the first hypervisor node
> 5. Shutdown the first hypervisor node (or pull the plug)
> Expected result:
>   All system VMs and instance(s) should be brought up on the 2nd hypervisor 
> node.
> Actual result:
>   We see the first hypervisor node marked "disconnected."
>   All System VMs and the Instance are still marked "Running", however ping to 
> any of them fails. 
>   Ping to the redundant router on the 2nd hypervisor node is still working.
>   We see in the logs 
>   "INFO  [utils.exception.CSExceptionErrorCode] (AgentTaskPool-11:) Could not 
> find exception: com.cloud.exception.OperationTimedoutException in error code 
> list for exceptions"
>   Followed by
>   "Agent state cannot be determined, do nothing"
> Searching for "Cloudstack Agent state cannot be determined, do nothing" lead 
> to: CLOUDSTACK-803 - https://reviews.apache.org/r/8853/
> Which caused me some concern, because if I read the logic in the ticket 
> correctly... The management server will not perform any HA actions if it's 
> unable to determine the state of a hypervisor node.  In the scenario above, 
> it's not a loss of connectivity, but an actual outage on the hypervisor... so 
> I'd rather like HA to occur.  Split brain is a concern, but I think that 
> something along the lines of "if hypervisor can't see management or gateway, 
> stop instances)" is more relevant than "do nothing"
> I'm hoping this is something really obvious and simple to resolve, because 
> otherwise this is a pretty serious issue as currently any accidental 
> shutdown, or hardware fault will cause a continuous outage requiring manual 
> action to resolve.
> Thanks

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to