Hi all. I use the following environment: CS 4.1, KVM, Centos 6.4 (management+node1+node2), OpenIndiana NFS server as primary and secondary storage. and I have the following problem: If I switch one hypervisor node off via ipmi (simulate server crash), it never goes to Disconnected status in management. Accordingly, ha-enabled VMs are not restarted on another hypervisor node, because it believes that disconnected node is still online.
I get following in management server logs: 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request] (AgentManager-Handler-13:null) Seq 19-1133189098: Processing: { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10, [{"Answer":{"result":false,"details": "Unable to ping computing host, exiting","wait":0}}] } 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request] (AgentTaskPool-1:null) Seq 19-1133189098: Received: { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } } 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl] (AgentTaskPool-1:null) host (172.16.20.241) cannot be pinged, returning null ('I don't know') 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator] (AgentTaskPool-1:null) could not reach agent, could not reach agent's host, returning that we don't have enough information 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-1:null) null unable to determine the state of the host. Moving on. 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-1:null) null unable to determine the state of the host. Moving on. 2013-07-11 10:19:16,153 WARN [agent.manager.AgentManagerImpl] (AgentTaskPool-1:null) Agent state cannot be determined, do nothing If I power on dead node, it goes to state "Connecting" and then "Up" in management interface. 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null) Ping timeout for host 12, do invstigation 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null) Ping timeout for host 12, do invstigation 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null) Ping timeout for host 12, do invstigation 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status] (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled, Agent event = AgentConnected, Host id = 12, name = ad112.colobridge.net] 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status] (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name = ad112.colobridge.net; old status = Up; event = AgentConnected; new status = Connecting; old update count = 1285; new update count = 1286] 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status] (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled, Agent event = Ready, Host id = 12, name = ad112.colobridge.net] 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status] (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name = ad112.colobridge.net; old status = Connecting; event = Ready; new status = Up; old update count = 1286; new update count = 1287] If I restart cloud-management service, dead node goes to state "Disconnected" in management interface. (there is nothing special in logs in this case) If I do nothing, dead node could stay in "Up" state forever (I waited for 12 hours) in management interface, throwing into logs "Agent state cannot be determined, do nothing" Would appreciate if someone could help/suggest how to deal with this problem. -- Regards, Valery http://protocol.by/slayer