Github user marcaurele commented on the issue: https://github.com/apache/cloudstack/pull/1640 I'll do the long explanation regarding the bug in production we had. Twice we had the VPN going down between 2 of our zones which resulted in the lost of the communication between the management server and all the agents of the other zone. After a few minutes the network went back and as the agents were reconnecting we started to see VMs being shutdown, even though we don't have HA enabled. We have 2 management servers in front of haproxy to balance agents & requests. When the network went back, the agent reconnected to the other management server 01. The server 02 had already started to issue some commands to shutdown the VM. The important lines are here, I kept the line showing one VM being shutdown `111750`: ``` srv02 2016-08-08 11:56:03,854 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) PingInvestigator was able to determine host 44692 is in Disconnected srv02 2016-08-08 11:56:03,855 WARN [agent.manager.AgentManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) Agent is disconnected but the host is still up: 44692-virt-hv041 srv02 2016-08-08 11:56:03,857 WARN [apache.cloudstack.alerts] (AgentTaskPool-16:ctx-8b5b6956) alertType:: 7 // dataCenterId:: 2 // podId:: 2 // clusterId:: null // message:: Host disconnected, name: virt-hv041 (id:44692), availability zone: ch-dk-2, pod: DK2-AZ1-POD01 srv02 2016-08-08 11:56:03,884 INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) Host 44692 is disconnecting with event AgentDisconnected srv02 2016-08-08 11:56:03,895 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) The next status of agent 44692is Alert, current status is Up srv02 2016-08-08 11:56:03,896 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) Deregistering link for 44692 with state Alert srv02 2016-08-08 11:56:03,896 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) Remove Agent : 44692 srv02 2016-08-08 11:56:03,897 DEBUG [agent.manager.AgentAttache] (AgentTaskPool-16:ctx-8b5b6956) Seq 44692-4932286016900831481: Sending disconnect to class com.cloud.network.security.SecurityGroupListener srv02 2016-08-08 11:56:03,897 DEBUG [agent.manager.AgentAttache] (AgentTaskPool-16:ctx-8b5b6956) Seq 44692-4932286016900834802: Sending disconnect to class com.cloud.network.security.SecurityGroupListener srv02 2016-08-08 11:56:03,897 DEBUG [agent.manager.AgentAttache] (AgentTaskPool-16:ctx-8b5b6956) Seq 44692-4932286016900886805: Sending disconnect to class com.cloud.network.security.SecurityGroupListener ... srv02 2016-08-08 11:56:03,979 DEBUG [cloud.network.NetworkUsageManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) Disconnected called on 44692 with status Alert srv02 2016-08-08 11:56:03,987 DEBUG [cloud.host.Status] (AgentTaskPool-16:ctx-8b5b6956) Transition:[Resource state = Enabled, Agent event = AgentDisconnected, Host id = 44692, name = virt-hv041] srv02 2016-08-08 11:56:03,998 DEBUG [cloud.cluster.ClusterManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) Forwarding [{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":44692,"event":"AgentDisconnected","contextMap":{},"wait":0}}] to 345049010805 srv02 2016-08-08 11:56:03,998 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Worker-2:ctx-e087e71f) Cluster PDU 90520739220960 -> 345049010805. agent: 44692, pdu seq: 15, pdu ack seq: 0, json: [{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":44692,"event":"AgentDisconnected","contextMap":{},"wait":0}}] srv01 2016-08-08 11:56:04,002 DEBUG [agent.manager.ClusteredAgentManagerImpl] (Cluster-Worker-9:ctx-9101a6d4) Dispatch ->44692, json: [{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":44692,"event":"AgentDisconnected","contextMap":{},"wait":0}}] srv01 2016-08-08 11:56:04,002 DEBUG [agent.manager.ClusteredAgentManagerImpl] (Cluster-Worker-9:ctx-9101a6d4) Intercepting command for agent change: agent 44692 event: AgentDisconnected srv01 2016-08-08 11:56:04,002 DEBUG [agent.manager.ClusteredAgentManagerImpl] (Cluster-Worker-9:ctx-9101a6d4) Received agent disconnect event for host 44692 srv02 2016-08-08 11:56:04,002 DEBUG [cloud.cluster.ClusterManagerImpl] (Cluster-Worker-2:ctx-e087e71f) Cluster PDU 90520739220960 -> 345049010805 completed. time: 3ms. agent: 44692, pdu seq: 15, pdu ack seq: 0, json: [{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":44692,"event":"AgentDisconnected","contextMap":{},"wait":0}}] srv01 2016-08-08 11:56:04,004 DEBUG [agent.manager.ClusteredAgentManagerImpl] (Cluster-Worker-9:ctx-9101a6d4) Not processing AgentDisconnected event for the host id=44692 as the host is directly connected to the current management server 345049010805 srv02 2016-08-08 11:56:04,004 WARN [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) Scheduling restart for VMs on host 44692-virt-hv041 ... srv02 2016-08-08 11:56:04,286 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) Notifying HA Mgr of to restart vm 111750-i-8114-111750-VM srv02 2016-08-08 11:56:04,296 INFO [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-16:ctx-8b5b6956) Schedule vm for HA: VM[User|i-8114-111750-VM] ... srv01 2016-08-08 11:56:17,394 DEBUG [agent.manager.AgentManagerImpl] (StatsCollector-4:ctx-dcd63254) Can not send command com.cloud.agent.api.GetVmStatsCommand due to Host 44692 is not up srv02 2016-08-08 11:56:17,395 DEBUG [agent.manager.AgentManagerImpl] (StatsCollector-3:ctx-8cb3d7fb) Can not send command com.cloud.agent.api.GetVmStatsCommand due to Host 44692 is not up srv01 2016-08-08 11:56:37,825 DEBUG [agent.manager.AgentManagerImpl] (AgentManager-Handler-13:null) Ping from 44692 srv01 2016-08-08 11:56:37,826 DEBUG [cloud.vm.VirtualMachinePowerStateSyncImpl] (AgentManager-Handler-13:null) Process host VM state report from ping process. host: 44692 srv01 2016-08-08 11:56:37,886 DEBUG [cloud.vm.VirtualMachinePowerStateSyncImpl] (AgentManager-Handler-13:null) Process VM state report. host: 44692, number of records in report: 26 srv01 2016-08-08 11:56:37,891 DEBUG [cloud.vm.VirtualMachinePowerStateSyncImpl] (AgentManager-Handler-13:null) VM state report is updated. host: 44692, vm id: 81688, power state: PowerOn ... srv01 2016-08-08 11:57:38,581 DEBUG [cloud.vm.VirtualMachinePowerStateSyncImpl] (AgentManager-Handler-1:null) VM state report is updated. host: 44692, vm id: 111750, power state: PowerOn ... srv01 2016-08-08 12:00:50,544 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-f9c5c881 work-858) Processing HAWork[858-HA-111750-Running-Investigating] srv01 2016-08-08 12:00:50,555 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-f9c5c881 work-858) HA on VM[User|i-8114-111750-VM] srv01 2016-08-08 12:00:50,565 DEBUG [agent.transport.Request] (HA-Worker-2:ctx-f9c5c881 work-858) Seq 44692-6380193297100439580: Sending { Cmd , MgmtId: 345049010805, via: 44692(virt-hv041), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckVirtualMachineCommand":{"vmName":"i-8114-111750-VM","wait":20}}] } ... srv01 2016-08-08 12:00:50,658 DEBUG [agent.transport.Request] (HA-Worker-2:ctx-f9c5c881 work-858) Seq 44692-6380193297100439580: Received: { Ans: , MgmtId: 345049010805, via: 44692, Ver: v1, Flags: 10, { CheckVirtualMachineAnswer } } srv01 2016-08-08 12:00:50,658 DEBUG [cloud.ha.CheckOnAgentInvestigator] (HA-Worker-2:ctx-f9c5c881 work-858) Agent responded with state Running srv01 2016-08-08 12:00:50,658 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-f9c5c881 work-858) Rescheduling because the host is not up but the vm is alive srv01 2016-08-08 12:00:50,658 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-f9c5c881 work-858) VM i-8114-111750-VM is found to be alive by SimpleInvestigator srv01 2016-08-08 12:00:50,658 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-f9c5c881 work-858) Rescheduling HAWork[858-HA-111750-Running-Investigating] to try again at Mon Aug 08 12:01:51 CEST 2016 srv01 2016-08-08 12:00:50,658 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-f9c5c881 work-858) SimpleInvestigator found VM[User|i-8114-111750-VM]to be alive? true ... srv01 2016-08-08 12:02:35,374 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-ee61de22 work-858) Processing HAWork[858-HA-111750-Running-Investigating] srv01 2016-08-08 12:02:35,381 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-ee61de22 work-858) HA on VM[User|i-8114-111750-VM] ... srv01 2016-08-08 12:02:35,478 DEBUG [cloud.ha.CheckOnAgentInvestigator] (HA-Worker-4:ctx-ee61de22 work-858) Agent responded with state Running srv01 2016-08-08 12:02:35,478 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-ee61de22 work-858) Rescheduling because the host is not up but the vm is alive srv01 2016-08-08 12:02:35,478 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-ee61de22 work-858) VM i-8114-111750-VM is found to be alive by SimpleInvestigator srv01 2016-08-08 12:02:35,478 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-ee61de22 work-858) Rescheduling HAWork[858-HA-111750-Running-Investigating] to try again at Mon Aug 08 12:03:36 CEST 2016 srv01 2016-08-08 12:02:35,478 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-4:ctx-ee61de22 work-858) SimpleInvestigator found VM[User|i-8114-111750-VM]to be alive? true ... srv02 2016-08-08 12:03:46,230 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-11e5dfc9 work-858) Processing HAWork[858-HA-111750-Running-Investigating] srv02 2016-08-08 12:03:46,237 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-11e5dfc9 work-858) HA on VM[User|i-8114-111750-VM] srv02 2016-08-08 12:03:46,246 DEBUG [cloud.ha.CheckOnAgentInvestigator] (HA-Worker-2:ctx-11e5dfc9 work-858) Unable to reach the agent for VM[User|i-8114-111750-VM]: Resource [Host:44692] is unreachable: Host 44692: Host with specified id is not in the right state: Alert srv02 2016-08-08 12:03:46,246 DEBUG [cloud.ha.UserVmDomRInvestigator] (HA-Worker-2:ctx-11e5dfc9 work-858) testing if VM[User|i-8114-111750-VM] is alive srv02 2016-08-08 12:03:46,246 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-11e5dfc9 work-858) SimpleInvestigator found VM[User|i-8114-111750-VM]to be alive? null srv02 2016-08-08 12:03:46,246 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-11e5dfc9 work-858) XenServerInvestigator found VM[User|i-8114-111750-VM]to be alive? null srv02 2016-08-08 12:03:46,259 DEBUG [agent.manager.AgentManagerImpl] (HA-Worker-2:ctx-11e5dfc9 work-858) Host with id null doesn't exist srv02 2016-08-08 12:03:46,259 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-2:ctx-11e5dfc9 work-858) Not a System Vm, unable to determine state of VM[User|i-8114-111750-VM] returning null srv02 2016-08-08 12:03:46,259 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-2:ctx-11e5dfc9 work-858) Testing if VM[User|i-8114-111750-VM] is alive srv02 2016-08-08 12:03:46,259 DEBUG [cloud.ha.UserVmDomRInvestigator] (HA-Worker-2:ctx-11e5dfc9 work-858) Returning null since we're unable to determine state of VM[User|i-8114-111750-VM] srv02 2016-08-08 12:03:46,259 DEBUG [cloud.ha.UserVmDomRInvestigator] (HA-Worker-2:ctx-11e5dfc9 work-858) VM[User|i-8114-111750-VM] could not be pinged, returning that it is unknown srv02 2016-08-08 12:03:46,259 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-11e5dfc9 work-858) PingInvestigator found VM[User|i-8114-111750-VM]to be alive? null srv01 2016-08-08 12:03:46,261 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-3:ctx-e1a90562 work-859) HA on VM[User|i-10355-116755-VM] srv02 2016-08-08 12:03:46,264 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-2:ctx-11e5dfc9 work-858) Unable to find a management nic, cannot ping this system VM, unable to determine state of VM[User|i-8114-111750-VM] returning null srv02 2016-08-08 12:03:46,265 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-11e5dfc9 work-858) ManagementIPSysVMInvestigator found VM[User|i-8114-111750-VM]to be alive? null srv01 2016-08-08 12:03:46,266 DEBUG [agent.transport.Request] (HA-Worker-3:ctx-e1a90562 work-859) Seq 44680-8591460714139353168: Sending { Cmd , MgmtId: 345049010805, via: 44680(virt-hv029), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckVirtualMachineCommand":{"vmName":"i-10355-116755-VM","wait":20}}] } srv02 2016-08-08 12:03:46,272 DEBUG [agent.transport.Request] (HA-Worker-2:ctx-11e5dfc9 work-858) Seq 44672-2138365398070947852: Sending { Cmd , MgmtId: 90520739220960, via: 44672(virt-hv019), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckOnHostCommand":{"host":{"guid":"eea983e8-6156-4ddb-b5c5-e4c4efa9ff4b-LibvirtComputingResource","privateNetwork":{"ip":"....","netmask":"....","mac":"....","isSecurityGroupEnabled":false},"storageNetwork1":{"ip":"....","netmask":"....","mac":"....","isSecurityGroupEnabled":false}},"wait":20}}] } srv02 2016-08-08 12:03:46,327 DEBUG [agent.transport.Request] (AgentManager-Handler-3:null) Seq 44672-2138365398070947852: Processing: { Ans: , MgmtId: 90520739220960, via: 44672, Ver: v1, Flags: 10, [{"com.cloud.agent.api.Answer":{"result":true,"wait":0}}] } srv02 2016-08-08 12:03:46,328 DEBUG [agent.transport.Request] (HA-Worker-2:ctx-11e5dfc9 work-858) Seq 44672-2138365398070947852: Received: { Ans: , MgmtId: 90520739220960, via: 44672, Ver: v1, Flags: 10, { Answer } } srv02 2016-08-08 12:03:46,328 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-11e5dfc9 work-858) Fencing off VM that we don't know the state of srv02 2016-08-08 12:03:46,328 INFO [cloud.ha.HighAvailabilityManagerImpl] (HA-Worker-2:ctx-11e5dfc9 work-858) KVMInvestigator found VM[User|i-8114-111750-VM]to be alive? null srv01 2016-08-08 12:03:46,333 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:ctx-bc9b35f1) No need to calibrate cpu capacity, host:44687 usedCpu: 0 reservedCpu: 83524 srv01 2016-08-08 12:03:46,333 DEBUG [cloud.capacity.CapacityManagerImpl] (CapacityChecker:ctx-bc9b35f1) No need to calibrate memory capacity, host:44687 usedMem: 0 reservedMem: 44023414784 srv02 2016-08-08 12:03:46,333 DEBUG [agent.transport.Request] (HA-Worker-2:ctx-11e5dfc9 work-858) Seq 44672-2138365398070947853: Sending { Cmd , MgmtId: 90520739220960, via: 44672(virt-hv019), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.FenceCommand":{"vmName":"i-8114-111750-VM","hostGuid":"eea983e8-6156-4ddb-b5c5-e4c4efa9ff4b-LibvirtComputingResource","hostIp":"....","inSeq":false,"wait":0}}] } ``` The way I reproduced it in our lab is my dropping the traffic from one hypervisor to/from the management servers: ``` sudo iptables -I OUTPUT -d 10.16.13.83 -j DROP sudo iptables -I INPUT -s 10.16.13.83 -j DROP ``` When you see the host changing its state to Alert, restart the agent so that it will send quickly a `startupCommand` and re-enable the traffic if you want to see the VMs being shutdown. ``` sudo service cloudstack-agent restart sudo iptables -D OUTPUT -d 10.16.13.83 -j DROP sudo iptables -D INPUT -s 10.16.13.83 -j DROP ``` During that time window (right after the host/agent turned in Alert) you'll start to see AsyncJobs coming in the database to stop VMs, which is the thing I'm requesting a change for. If the agent is disconnected for enough time the commands will vanished since the host cannot be reached. But if the timing is just *bad*, the agent is back online while the jobs are being created, and it will process some of those commands. Playing with the `ping.timeout` value would only move the window of the bug in another direction. It's not too easy to make it happen but feasible. The problem could be handle differently, but I found this way a good solution without adding too much complexity in the HA worker tasks, like checking between investigation the state of the host. IMO if the VM are not HA, then nothing should be done: **the fencing should not be started** and let the powerState report change the VMs state when it could reach back the agent on the host. Currently the fencing is being done(and the restart is skipped since a check for HA is done before). We are running a custom version based on 4.4.2 with many merges, but as I could see the bug still present on master, I pushed this PR. Hope this helps you better understand the problem.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---