Dear all, We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD storage. All our agents got disconnected from the management server and unable to connect again, despite rebooting the management server and stopping and restarting the cloudstack-agent many times.
We even tried to physically reboot a hypervisor host (sacrificing all the running VMs inside) to see if it can reconnect after boot-up, and it's not able to reconnect (keep on "Connecting" state). Here's the excerpts from the logs: ==== 2016-03-31 10:07:49,346 DEBUG [cloud.agent.Agent] (UgentTask-5:null) Sending ping: Seq 0-11: { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags: 11, [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":0,"wait":0}}] } 2016-03-31 10:07:49,395 DEBUG [cloud.agent.Agent] (Agent-Handler-2:null) Received response: Seq 0-11: { Ans: , MgmtId: 161342671900, via: 75, Ver: v1, Flags: 100010, [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":0,"wait":0},"result":true,"wait":0}}] } 2016-03-31 10:08:49,271 DEBUG [kvm.resource.LibvirtComputingResource] (UgentTask-5:null) Executing: /usr/share/cloudstack-common/scripts/vm/network/security_group.py get_rule_logs_for_vms 2016-03-31 10:08:49,350 DEBUG [kvm.resource.LibvirtComputingResource] (UgentTask-5:null) Execution is successful. 2016-03-31 10:08:49,353 DEBUG [cloud.agent.Agent] (UgentTask-5:null) Sending ping: Seq 0-12: { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags: 11, [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":0,"wait":0}}] } 2016-03-31 10:08:49,406 DEBUG [cloud.agent.Agent] (Agent-Handler-3:null) Received response: Seq 0-12: { Ans: , MgmtId: 161342671900, via: 75, Ver: v1, Flags: 100010, [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":0,"wait":0},"result":true,"wait":0}}] } 2016-03-31 10:09:49,272 DEBUG [kvm.resource.LibvirtComputingResource] (UgentTask-5:null) Executing: /usr/share/cloudstack-common/scripts/vm/network/security_group.py get_rule_logs_for_vms 2016-03-31 10:09:49,345 DEBUG [kvm.resource.LibvirtComputingResource] (UgentTask-5:null) Execution is successful. 2016-03-31 10:09:49,347 DEBUG [cloud.agent.Agent] (UgentTask-5:null) Sending ping: Seq 0-13: { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags: 11, [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"newStates":{},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":0,"wait":0}}] } 2016-03-31 10:09:49,398 DEBUG [cloud.agent.Agent] (Agent-Handler-4:null) Received response: Seq 0-13: { Ans: , MgmtId: 161342671900, via: 75, Ver: v1, Flags: 100010, [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":0,"wait":0},"result":true,"wait":0}}] } ==== On the existing hypervisor hosts, normally the agent would stuck at this stage and from Cloudstack GUI, we don't see the agent in "Connecting" state, it will be either on "Disconnected" or "Alert" state. ==== 2016-03-31 07:37:09,819 DEBUG [utils.script.Script] (main:null) Executing: /bin/bash -c uname -r 2016-03-31 07:37:09,829 DEBUG [utils.script.Script] (main:null) Execution is successful. 2016-03-31 07:37:09,832 DEBUG [cloud.agent.Agent] (main:null) Adding shutdown hook 2016-03-31 07:37:09,833 INFO [cloud.agent.Agent] (main:null) Agent [id = 73 : type = LibvirtComputingResource : zone = 6 : pod = 6 : workers = 5 : host = 10.x.x.x : port = 8250 2016-03-31 07:37:09,856 INFO [utils.nio.NioClient] (Agent-Selector:null) Connecting to 10.x.x.x:8250 2016-03-31 07:37:10,178 INFO [utils.nio.NioClient] (Agent-Selector:null) SSL: Handshake done 2016-03-31 07:37:10,179 INFO [utils.nio.NioClient] (Agent-Selector:null) Connected to 10.x.x.x:8250 ==== No other significant and useful logs found on both the agents and management server logs. Anyone can give a clue on what could be the problem? Have been trying to reconnect in the past couple of hours without any issues. Any help is greatly appreciated. Looking forward to your reply, thnk you. Cheers. -ip-