Github user marcaurele commented on the issue:

    https://github.com/apache/cloudstack/pull/1640
  
    I'll do the long explanation regarding the bug in production we had.
    
    Twice we had the VPN going down between 2 of our zones which resulted in 
the lost of the communication between the management server and all the agents 
of the other zone. After a few minutes the network went back and as the agents 
were reconnecting we started to see VMs being shutdown, even though we don't 
have HA enabled. We have 2 management servers in front of haproxy to balance 
agents & requests. When the network went back, the agent reconnected to the 
other management server 01. The server 02 had already started to issue some 
commands to shutdown the VM.
    
    The important lines are here, I kept the line showing one VM being shutdown 
`111750`:
    
    ```
    srv02 2016-08-08 11:56:03,854 DEBUG [cloud.ha.HighAvailabilityManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) PingInvestigator was able to determine host 
44692 is in Disconnected
    srv02 2016-08-08 11:56:03,855 WARN  [agent.manager.AgentManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) Agent is disconnected but the host is still up: 
44692-virt-hv041
    srv02 2016-08-08 11:56:03,857 WARN  [apache.cloudstack.alerts] 
(AgentTaskPool-16:ctx-8b5b6956)  alertType:: 7 // dataCenterId:: 2 // podId:: 2 
// clusterId:: null // message:: Host disconnected, name: virt-hv041 
(id:44692), availability zone: ch-dk-2, pod: DK2-AZ1-POD01
    srv02 2016-08-08 11:56:03,884 INFO  [agent.manager.AgentManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) Host 44692 is disconnecting with event 
AgentDisconnected
    srv02 2016-08-08 11:56:03,895 DEBUG [agent.manager.AgentManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) The next status of agent 44692is Alert, current 
status is Up
    srv02 2016-08-08 11:56:03,896 DEBUG [agent.manager.AgentManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) Deregistering link for 44692 with state Alert
    srv02 2016-08-08 11:56:03,896 DEBUG [agent.manager.AgentManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) Remove Agent : 44692
    srv02 2016-08-08 11:56:03,897 DEBUG [agent.manager.AgentAttache] 
(AgentTaskPool-16:ctx-8b5b6956) Seq 44692-4932286016900831481: Sending 
disconnect to class com.cloud.network.security.SecurityGroupListener
    srv02 2016-08-08 11:56:03,897 DEBUG [agent.manager.AgentAttache] 
(AgentTaskPool-16:ctx-8b5b6956) Seq 44692-4932286016900834802: Sending 
disconnect to class com.cloud.network.security.SecurityGroupListener
    srv02 2016-08-08 11:56:03,897 DEBUG [agent.manager.AgentAttache] 
(AgentTaskPool-16:ctx-8b5b6956) Seq 44692-4932286016900886805: Sending 
disconnect to class com.cloud.network.security.SecurityGroupListener
    ...
    srv02 2016-08-08 11:56:03,979 DEBUG [cloud.network.NetworkUsageManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) Disconnected called on 44692 with status Alert
    srv02 2016-08-08 11:56:03,987 DEBUG [cloud.host.Status] 
(AgentTaskPool-16:ctx-8b5b6956) Transition:[Resource state = Enabled, Agent 
event = AgentDisconnected, Host id = 44692, name = virt-hv041]
    srv02 2016-08-08 11:56:03,998 DEBUG [cloud.cluster.ClusterManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) Forwarding 
[{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":44692,"event":"AgentDisconnected","contextMap":{},"wait":0}}]
 to 345049010805
    srv02 2016-08-08 11:56:03,998 DEBUG [cloud.cluster.ClusterManagerImpl] 
(Cluster-Worker-2:ctx-e087e71f) Cluster PDU 90520739220960 -> 345049010805. 
agent: 44692, pdu seq: 15, pdu ack seq: 0, json: 
[{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":44692,"event":"AgentDisconnected","contextMap":{},"wait":0}}]
    srv01 2016-08-08 11:56:04,002 DEBUG 
[agent.manager.ClusteredAgentManagerImpl] (Cluster-Worker-9:ctx-9101a6d4) 
Dispatch ->44692, json: 
[{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":44692,"event":"AgentDisconnected","contextMap":{},"wait":0}}]
    srv01 2016-08-08 11:56:04,002 DEBUG 
[agent.manager.ClusteredAgentManagerImpl] (Cluster-Worker-9:ctx-9101a6d4) 
Intercepting command for agent change: agent 44692 event: AgentDisconnected
    srv01 2016-08-08 11:56:04,002 DEBUG 
[agent.manager.ClusteredAgentManagerImpl] (Cluster-Worker-9:ctx-9101a6d4) 
Received agent disconnect event for host 44692
    srv02 2016-08-08 11:56:04,002 DEBUG [cloud.cluster.ClusterManagerImpl] 
(Cluster-Worker-2:ctx-e087e71f) Cluster PDU 90520739220960 -> 345049010805 
completed. time: 3ms. agent: 44692, pdu seq: 15, pdu ack seq: 0, json: 
[{"com.cloud.agent.api.ChangeAgentCommand":{"agentId":44692,"event":"AgentDisconnected","contextMap":{},"wait":0}}]
    srv01 2016-08-08 11:56:04,004 DEBUG 
[agent.manager.ClusteredAgentManagerImpl] (Cluster-Worker-9:ctx-9101a6d4) Not 
processing AgentDisconnected event for the host id=44692 as the host is 
directly connected to the current management server 345049010805
    srv02 2016-08-08 11:56:04,004 WARN  [cloud.ha.HighAvailabilityManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) Scheduling restart for VMs on host 
44692-virt-hv041
    ...
    srv02 2016-08-08 11:56:04,286 DEBUG [cloud.ha.HighAvailabilityManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) Notifying HA Mgr of to restart vm 
111750-i-8114-111750-VM
    srv02 2016-08-08 11:56:04,296 INFO  [cloud.ha.HighAvailabilityManagerImpl] 
(AgentTaskPool-16:ctx-8b5b6956) Schedule vm for HA:  VM[User|i-8114-111750-VM]
    ...
    srv01 2016-08-08 11:56:17,394 DEBUG [agent.manager.AgentManagerImpl] 
(StatsCollector-4:ctx-dcd63254) Can not send command 
com.cloud.agent.api.GetVmStatsCommand due to Host 44692 is not up
    srv02 2016-08-08 11:56:17,395 DEBUG [agent.manager.AgentManagerImpl] 
(StatsCollector-3:ctx-8cb3d7fb) Can not send command 
com.cloud.agent.api.GetVmStatsCommand due to Host 44692 is not up
    srv01 2016-08-08 11:56:37,825 DEBUG [agent.manager.AgentManagerImpl] 
(AgentManager-Handler-13:null) Ping from 44692
    srv01 2016-08-08 11:56:37,826 DEBUG 
[cloud.vm.VirtualMachinePowerStateSyncImpl] (AgentManager-Handler-13:null) 
Process host VM state report from ping process. host: 44692
    srv01 2016-08-08 11:56:37,886 DEBUG 
[cloud.vm.VirtualMachinePowerStateSyncImpl] (AgentManager-Handler-13:null) 
Process VM state report. host: 44692, number of records in report: 26
    srv01 2016-08-08 11:56:37,891 DEBUG 
[cloud.vm.VirtualMachinePowerStateSyncImpl] (AgentManager-Handler-13:null) VM 
state report is updated. host: 44692, vm id: 81688, power state: PowerOn
    ...
    srv01 2016-08-08 11:57:38,581 DEBUG 
[cloud.vm.VirtualMachinePowerStateSyncImpl] (AgentManager-Handler-1:null) VM 
state report is updated. host: 44692, vm id: 111750, power state: PowerOn
    ...
    srv01 2016-08-08 12:00:50,544 INFO  [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-2:ctx-f9c5c881 work-858) Processing 
HAWork[858-HA-111750-Running-Investigating]
    srv01 2016-08-08 12:00:50,555 INFO  [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-2:ctx-f9c5c881 work-858) HA on VM[User|i-8114-111750-VM]
    srv01 2016-08-08 12:00:50,565 DEBUG [agent.transport.Request] 
(HA-Worker-2:ctx-f9c5c881 work-858) Seq 44692-6380193297100439580: Sending  { 
Cmd , MgmtId: 345049010805, via: 44692(virt-hv041), Ver: v1, Flags: 100011, 
[{"com.cloud.agent.api.CheckVirtualMachineCommand":{"vmName":"i-8114-111750-VM","wait":20}}]
 }
    ...
    srv01 2016-08-08 12:00:50,658 DEBUG [agent.transport.Request] 
(HA-Worker-2:ctx-f9c5c881 work-858) Seq 44692-6380193297100439580: Received:  { 
Ans: , MgmtId: 345049010805, via: 44692, Ver: v1, Flags: 10, { 
CheckVirtualMachineAnswer } }
    srv01 2016-08-08 12:00:50,658 DEBUG [cloud.ha.CheckOnAgentInvestigator] 
(HA-Worker-2:ctx-f9c5c881 work-858) Agent responded with state Running
    srv01 2016-08-08 12:00:50,658 DEBUG [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-2:ctx-f9c5c881 work-858) Rescheduling because the host is not up but 
the vm is alive
    srv01 2016-08-08 12:00:50,658 DEBUG [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-2:ctx-f9c5c881 work-858) VM i-8114-111750-VM is found to be alive by 
SimpleInvestigator
    srv01 2016-08-08 12:00:50,658 INFO  [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-2:ctx-f9c5c881 work-858) Rescheduling 
HAWork[858-HA-111750-Running-Investigating] to try again at Mon Aug 08 12:01:51 
CEST 2016
    srv01 2016-08-08 12:00:50,658 INFO  [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-2:ctx-f9c5c881 work-858) SimpleInvestigator found 
VM[User|i-8114-111750-VM]to be alive? true
    ...
    srv01 2016-08-08 12:02:35,374 INFO  [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-4:ctx-ee61de22 work-858) Processing 
HAWork[858-HA-111750-Running-Investigating]
    srv01 2016-08-08 12:02:35,381 INFO  [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-4:ctx-ee61de22 work-858) HA on VM[User|i-8114-111750-VM]
    ...
    srv01 2016-08-08 12:02:35,478 DEBUG [cloud.ha.CheckOnAgentInvestigator] 
(HA-Worker-4:ctx-ee61de22 work-858) Agent responded with state Running
    srv01 2016-08-08 12:02:35,478 DEBUG [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-4:ctx-ee61de22 work-858) Rescheduling because the host is not up but 
the vm is alive
    srv01 2016-08-08 12:02:35,478 DEBUG [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-4:ctx-ee61de22 work-858) VM i-8114-111750-VM is found to be alive by 
SimpleInvestigator
    srv01 2016-08-08 12:02:35,478 INFO  [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-4:ctx-ee61de22 work-858) Rescheduling 
HAWork[858-HA-111750-Running-Investigating] to try again at Mon Aug 08 12:03:36 
CEST 2016
    srv01 2016-08-08 12:02:35,478 INFO  [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-4:ctx-ee61de22 work-858) SimpleInvestigator found 
VM[User|i-8114-111750-VM]to be alive? true
    ...
    srv02 2016-08-08 12:03:46,230 INFO  [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-2:ctx-11e5dfc9 work-858) Processing 
HAWork[858-HA-111750-Running-Investigating]
    srv02 2016-08-08 12:03:46,237 INFO  [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-2:ctx-11e5dfc9 work-858) HA on VM[User|i-8114-111750-VM]
    srv02 2016-08-08 12:03:46,246 DEBUG [cloud.ha.CheckOnAgentInvestigator] 
(HA-Worker-2:ctx-11e5dfc9 work-858) Unable to reach the agent for 
VM[User|i-8114-111750-VM]: Resource [Host:44692] is unreachable: Host 44692: 
Host with specified id is not in the right state: Alert
    srv02 2016-08-08 12:03:46,246 DEBUG [cloud.ha.UserVmDomRInvestigator] 
(HA-Worker-2:ctx-11e5dfc9 work-858) testing if VM[User|i-8114-111750-VM] is 
alive
    srv02 2016-08-08 12:03:46,246 INFO  [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-2:ctx-11e5dfc9 work-858) SimpleInvestigator found 
VM[User|i-8114-111750-VM]to be alive? null
    srv02 2016-08-08 12:03:46,246 INFO  [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-2:ctx-11e5dfc9 work-858) XenServerInvestigator found 
VM[User|i-8114-111750-VM]to be alive? null
    srv02 2016-08-08 12:03:46,259 DEBUG [agent.manager.AgentManagerImpl] 
(HA-Worker-2:ctx-11e5dfc9 work-858) Host with id null doesn't exist
    srv02 2016-08-08 12:03:46,259 DEBUG 
[cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-2:ctx-11e5dfc9 work-858) 
Not a System Vm, unable to determine state of VM[User|i-8114-111750-VM] 
returning null
    srv02 2016-08-08 12:03:46,259 DEBUG 
[cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-2:ctx-11e5dfc9 work-858) 
Testing if VM[User|i-8114-111750-VM] is alive
    srv02 2016-08-08 12:03:46,259 DEBUG [cloud.ha.UserVmDomRInvestigator] 
(HA-Worker-2:ctx-11e5dfc9 work-858) Returning null since we're unable to 
determine state of VM[User|i-8114-111750-VM]
    srv02 2016-08-08 12:03:46,259 DEBUG [cloud.ha.UserVmDomRInvestigator] 
(HA-Worker-2:ctx-11e5dfc9 work-858) VM[User|i-8114-111750-VM] could not be 
pinged, returning that it is unknown
    srv02 2016-08-08 12:03:46,259 INFO  [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-2:ctx-11e5dfc9 work-858) PingInvestigator found 
VM[User|i-8114-111750-VM]to be alive? null
    srv01 2016-08-08 12:03:46,261 INFO  [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-3:ctx-e1a90562 work-859) HA on VM[User|i-10355-116755-VM]
    srv02 2016-08-08 12:03:46,264 DEBUG 
[cloud.ha.ManagementIPSystemVMInvestigator] (HA-Worker-2:ctx-11e5dfc9 work-858) 
Unable to find a management nic, cannot ping this system VM, unable to 
determine state of VM[User|i-8114-111750-VM] returning null
    srv02 2016-08-08 12:03:46,265 INFO  [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-2:ctx-11e5dfc9 work-858) ManagementIPSysVMInvestigator found 
VM[User|i-8114-111750-VM]to be alive? null
    srv01 2016-08-08 12:03:46,266 DEBUG [agent.transport.Request] 
(HA-Worker-3:ctx-e1a90562 work-859) Seq 44680-8591460714139353168: Sending  { 
Cmd , MgmtId: 345049010805, via: 44680(virt-hv029), Ver: v1, Flags: 100011, 
[{"com.cloud.agent.api.CheckVirtualMachineCommand":{"vmName":"i-10355-116755-VM","wait":20}}]
 }
    srv02 2016-08-08 12:03:46,272 DEBUG [agent.transport.Request] 
(HA-Worker-2:ctx-11e5dfc9 work-858) Seq 44672-2138365398070947852: Sending  { 
Cmd , MgmtId: 90520739220960, via: 44672(virt-hv019), Ver: v1, Flags: 100011, 
[{"com.cloud.agent.api.CheckOnHostCommand":{"host":{"guid":"eea983e8-6156-4ddb-b5c5-e4c4efa9ff4b-LibvirtComputingResource","privateNetwork":{"ip":"....","netmask":"....","mac":"....","isSecurityGroupEnabled":false},"storageNetwork1":{"ip":"....","netmask":"....","mac":"....","isSecurityGroupEnabled":false}},"wait":20}}]
 }
    srv02 2016-08-08 12:03:46,327 DEBUG [agent.transport.Request] 
(AgentManager-Handler-3:null) Seq 44672-2138365398070947852: Processing:  { 
Ans: , MgmtId: 90520739220960, via: 44672, Ver: v1, Flags: 10, 
[{"com.cloud.agent.api.Answer":{"result":true,"wait":0}}] }
    srv02 2016-08-08 12:03:46,328 DEBUG [agent.transport.Request] 
(HA-Worker-2:ctx-11e5dfc9 work-858) Seq 44672-2138365398070947852: Received:  { 
Ans: , MgmtId: 90520739220960, via: 44672, Ver: v1, Flags: 10, { Answer } }
    srv02 2016-08-08 12:03:46,328 DEBUG [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-2:ctx-11e5dfc9 work-858) Fencing off VM that we don't know the state 
of
    srv02 2016-08-08 12:03:46,328 INFO  [cloud.ha.HighAvailabilityManagerImpl] 
(HA-Worker-2:ctx-11e5dfc9 work-858) KVMInvestigator found 
VM[User|i-8114-111750-VM]to be alive? null
    srv01 2016-08-08 12:03:46,333 DEBUG [cloud.capacity.CapacityManagerImpl] 
(CapacityChecker:ctx-bc9b35f1) No need to calibrate cpu capacity, host:44687 
usedCpu: 0 reservedCpu: 83524
    srv01 2016-08-08 12:03:46,333 DEBUG [cloud.capacity.CapacityManagerImpl] 
(CapacityChecker:ctx-bc9b35f1) No need to calibrate memory capacity, host:44687 
usedMem: 0 reservedMem: 44023414784
    srv02 2016-08-08 12:03:46,333 DEBUG [agent.transport.Request] 
(HA-Worker-2:ctx-11e5dfc9 work-858) Seq 44672-2138365398070947853: Sending  { 
Cmd , MgmtId: 90520739220960, via: 44672(virt-hv019), Ver: v1, Flags: 100011, 
[{"com.cloud.agent.api.FenceCommand":{"vmName":"i-8114-111750-VM","hostGuid":"eea983e8-6156-4ddb-b5c5-e4c4efa9ff4b-LibvirtComputingResource","hostIp":"....","inSeq":false,"wait":0}}]
 }
    
    ```
    
    The way I reproduced it in our lab is my dropping the traffic from one 
hypervisor to/from the management servers:
    ```
    sudo iptables -I OUTPUT -d 10.16.13.83 -j DROP
    sudo iptables -I INPUT -s 10.16.13.83 -j DROP
    ```
    
    When you see the host changing its state to Alert, restart the agent so 
that it will send quickly a `startupCommand` and re-enable the traffic if you 
want to see the VMs being shutdown. 
    ```
    sudo service cloudstack-agent restart
    sudo iptables -D OUTPUT -d 10.16.13.83 -j DROP
    sudo iptables -D INPUT -s 10.16.13.83 -j DROP
    ```
    
    During that time window (right after the host/agent turned in Alert) you'll 
start to see AsyncJobs coming in the database to stop VMs, which is the thing 
I'm requesting a change for. If the agent is disconnected for enough time the 
commands will vanished since the host cannot be reached. But if the timing is 
just *bad*, the agent is back online while the jobs are being created, and it 
will process some of those commands. Playing with the `ping.timeout` value 
would only move the window of the bug in another direction.
    
    It's not too easy to make it happen but feasible.
    
    The problem could be handle differently, but I found this way a good 
solution without adding too much complexity in the HA worker tasks, like 
checking between investigation the state of the host. IMO if the VM are not HA, 
then nothing should be done: **the fencing should not be started** and let the 
powerState report change the VMs state when it could reach back the agent on 
the host. Currently the fencing is being done(and the restart is skipped since 
a check for HA is done before).
    
    We are running a custom version based on 4.4.2 with many merges, but as I 
could see the bug still present on master, I pushed this PR.
    
    Hope this helps you better understand the problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to