[ https://issues.apache.org/jira/browse/CLOUDSTACK-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714422#comment-13714422 ]
Gerard Lynch commented on CLOUDSTACK-3421: ------------------------------------------ There seems to be a requirement for such things as qurom and fencing then. As a host going down and no HA taking place is a rather big negative. Are these things on the road map? Thanks > When hypervisor is down, no HA occurs with log output "Agent state cannot be > determined, do nothing" > ---------------------------------------------------------------------------------------------------- > > Key: CLOUDSTACK-3421 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-3421 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: KVM, Management Server > Affects Versions: 4.1.0 > Environment: CentOS 6.4 minimal install > Libvirt, KVM/Qemu > CloudStack 4.1 > GlusterFS 3.2, replicated+distributed as primary storage via Shared Mount > Point > 3 physical servers > * 1 management server, running NFS secondary storage > ** 1 nic, management+storage > * 2 hypervisor nodes, running glusterfs-server > ** 4x nic, management+storage, public, guest, gluster peering > * Advanced zone > * KVM > * 4 networks: > eth0: cloudbr0: management+secondary storage, > eth2: cloudbr1: public > eth3: cloudbr2: guest > eth1: gluster peering > * Shared Mount Point > * custom network offering with redundant routers enabled > * global settings tweaked to increase speed of identifying down state > ** ping.interval: 10sec > Reporter: Gerard Lynch > Priority: Critical > Fix For: 4.1.1, 4.2.0, Future > > Attachments: catalina_management-server.zip > > > We wanted to test CloudStack's HA capabilities by simulating outages to find > out how long it would take to recover. One of the tests was simulating loss > of a hypervisor node by shutting it down. When we tested this, we found > that CloudStack failed to bring up any of the VMs (System or Instance), which > were on the down node, until the node was powered back up and reconnected. > In the logs, we see repeating occurances of: > INFO [utils.exception.CSExceptionErrorCode] (AgentTaskPool-11:) Could not > find exception: com.cloud.exception.OperationTimedoutException in error code > list for exceptions > INFO [utils.exception.CSExceptionErrorCode] (AgentTaskPool-10:) Could not > find exception: com.cloud.exception.OperationTimedoutException in error code > list for exceptions > WARN [agent.manager.AgentAttache] (AgentTaskPool-11:) Seq 14-660013135: > Timed out on Seq 14-660013135: { Cmd , MgmtId: 93515041483, via: 14, Ver: > v1, Flags: 100011, [{"CheckHealthCommand":{"wait":50}}] } > WARN [agent.manager.AgentAttache] (AgentTaskPool-10:) Seq 15-1097531400: > Timed out on Seq 15-1097531400: { Cmd , MgmtId: 93515041483, via: 15, Ver: > v1, Flags: 100011, [{"CheckHealthCommand":{"wait":50}}] } > WARN [agent.manager.AgentManagerImpl] (AgentTaskPool-11:) Operation timed > out: Commands 660013135 to Host 14 timed out after 100 > WARN [agent.manager.AgentManagerImpl] (AgentTaskPool-10:) Operation timed > out: Commands 1097531400 to Host 15 timed out after 100 > WARN [agent.manager.AgentManagerImpl] (AgentTaskPool-11:) Agent state cannot > be determined, do nothing > WARN [agent.manager.AgentManagerImpl] (AgentTaskPool-10:) Agent state cannot > be determined, do nothing > To reproduce: > 1. Build the environment as detailed above > 2. Register an ISO > 3. Create a new guest network using the custom network offering (that offers > redundant routers) > 3. Provision an instance > 4. Ensure the system VMs and instance are on the first hypervisor node > 5. Shutdown the first hypervisor node (or pull the plug) > Expected result: > All system VMs and instance(s) should be brought up on the 2nd hypervisor > node. > Actual result: > We see the first hypervisor node marked "disconnected." > All System VMs and the Instance are still marked "Running", however ping to > any of them fails. > Ping to the redundant router on the 2nd hypervisor node is still working. > We see in the logs > "INFO [utils.exception.CSExceptionErrorCode] (AgentTaskPool-11:) Could not > find exception: com.cloud.exception.OperationTimedoutException in error code > list for exceptions" > Followed by > "Agent state cannot be determined, do nothing" > Searching for "Cloudstack Agent state cannot be determined, do nothing" lead > to: CLOUDSTACK-803 - https://reviews.apache.org/r/8853/ > Which caused me some concern, because if I read the logic in the ticket > correctly... The management server will not perform any HA actions if it's > unable to determine the state of a hypervisor node. In the scenario above, > it's not a loss of connectivity, but an actual outage on the hypervisor... so > I'd rather like HA to occur. Split brain is a concern, but I think that > something along the lines of "if hypervisor can't see management or gateway, > stop instances)" is more relevant than "do nothing" > I'm hoping this is something really obvious and simple to resolve, because > otherwise this is a pretty serious issue as currently any accidental > shutdown, or hardware fault will cause a continuous outage requiring manual > action to resolve. > Thanks -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira