Looks like the KVM investigator is not able to determine the state of the agent. Can you share the full log?
> -----Original Message----- > From: Valery Ciareszka [mailto:valery.teres...@gmail.com] > Sent: Thursday, July 11, 2013 7:47 PM > To: users > Subject: cs 4.1 host disconnected status > > Hi all. > > I use the following environment: CS 4.1, KVM, Centos 6.4 > (management+node1+node2), OpenIndiana NFS server as primary and > secondary storage. > and I have the following problem: > If I switch one hypervisor node off via ipmi (simulate server crash), it never > goes to Disconnected status in management. Accordingly, ha-enabled VMs > are not restarted on another hypervisor node, because it believes that > disconnected node is still online. > > > I get following in management server logs: > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request] > (AgentManager-Handler-13:null) Seq 19-1133189098: Processing: > { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10, > [{"Answer":{"result":false,"details": "Unable to ping computing host, > exiting","wait":0}}] } > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request] > (AgentTaskPool-1:null) Seq 19-1133189098: Received: { Ans: , MgmtId: > 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } } > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl] > (AgentTaskPool-1:null) host (172.16.20.241) cannot be pinged, returning null > ('I don't know') > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator] > (AgentTaskPool-1:null) could not reach agent, could not reach agent's > host, returning that we don't have enough information > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl] > (AgentTaskPool-1:null) null unable to determine the state of the host. > Moving on. > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl] > (AgentTaskPool-1:null) null unable to determine the state of the host. > Moving on. > 2013-07-11 10:19:16,153 WARN [agent.manager.AgentManagerImpl] > (AgentTaskPool-1:null) Agent state cannot be determined, do > nothing > > > If I power on dead node, it goes to state "Connecting" and then "Up" in > management interface. > > 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null) Ping > timeout for host 12, do invstigation > 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null) Ping > timeout for host 12, do invstigation > 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null) Ping > timeout for host 12, do invstigation > 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status] > (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled, Agent > event = AgentConnected, Host id = 12, name = ad112.colobridge.net] > 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status] > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name = > ad112.colobridge.net; old status = Up; event = AgentConnected; new status > = Connecting; old update count = 1285; new update count = 1286] > 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status] > (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled, Agent > event = Ready, Host id = 12, name = ad112.colobridge.net] > 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status] > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name = > ad112.colobridge.net; old status = Connecting; event = Ready; new status = > Up; old update count = 1286; new update count = 1287] > > > If I restart cloud-management service, dead node goes to state > "Disconnected" in management interface. > (there is nothing special in logs in this case) > > If I do nothing, dead node could stay in "Up" state forever (I waited for > 12 hours) in management interface, throwing into logs "Agent state cannot > be determined, do nothing" > > Would appreciate if someone could help/suggest how to deal with this > problem. > > -- > Regards, > Valery > > http://protocol.by/slayer