Koushik, Ok, imagine the server is offline (burned cpu/ power supply etc), and there is no way to get the host back online within 1-2 hours. However, CS management considers host as online. What is the proper way to deal with this issue ?
On Fri, Jul 12, 2013 at 2:20 PM, Koushik Das <koushik....@citrix.com> wrote: > I looked at the logs and none of the existing investigators are able to > determine that the host is down. I am not sure if there is a clean way to > identify if a host is down in case of KVM. Consider the following cases: > > 1. Host is actually shutdown > 2. Management nic of the host is plugged out of the network but host is up > and running > > There is no clean way to distinguish these cases. Cloudstack should only > mark the host as down in the first case. But not sure how one would achieve > this. > > -Koushik > > > -----Original Message----- > > From: Valery Ciareszka [mailto:valery.teres...@gmail.com] > > Sent: Friday, July 12, 2013 2:39 PM > > To: users@cloudstack.apache.org > > Subject: Re: cs 4.1 host disconnected status > > > > I've simulated crash again and here is the log: > > http://thesuki.org/temp/cs.log.txt > > I stripped out of there GET requests with api keys. > > Server was switched off at 8:36 > > > > On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das <koushik....@citrix.com > >wrote: > > > > > Looks like the KVM investigator is not able to determine the state of > > > the agent. Can you share the full log? > > > > > > > -----Original Message----- > > > > From: Valery Ciareszka [mailto:valery.teres...@gmail.com] > > > > Sent: Thursday, July 11, 2013 7:47 PM > > > > To: users > > > > Subject: cs 4.1 host disconnected status > > > > > > > > Hi all. > > > > > > > > I use the following environment: CS 4.1, KVM, Centos 6.4 > > > > (management+node1+node2), OpenIndiana NFS server as primary and > > > > secondary storage. > > > > and I have the following problem: > > > > If I switch one hypervisor node off via ipmi (simulate server > > > > crash), it > > > never > > > > goes to Disconnected status in management. Accordingly, ha-enabled > > > > VMs are not restarted on another hypervisor node, because it > > > > believes that disconnected node is still online. > > > > > > > > > > > > I get following in management server logs: > > > > > > > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request] > > > > (AgentManager-Handler-13:null) Seq 19-1133189098: > Processing: > > > > { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10, > > > > [{"Answer":{"result":false,"details": "Unable to ping computing > host, > > > > exiting","wait":0}}] } > > > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request] > > > > (AgentTaskPool-1:null) Seq 19-1133189098: Received: { Ans: , MgmtId: > > > > 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } } > > > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl] > > > > (AgentTaskPool-1:null) host (172.16.20.241) cannot be pinged, > > > > returning > > > null > > > > ('I don't know') > > > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator] > > > > (AgentTaskPool-1:null) could not reach agent, could not reach > agent's > > > > host, returning that we don't have enough information > > > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl] > > > > (AgentTaskPool-1:null) null unable to determine the state of the > host. > > > > Moving on. > > > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl] > > > > (AgentTaskPool-1:null) null unable to determine the state of the > host. > > > > Moving on. > > > > 2013-07-11 10:19:16,153 WARN [agent.manager.AgentManagerImpl] > > > > (AgentTaskPool-1:null) Agent state cannot be determined, do > > > > nothing > > > > > > > > > > > > If I power on dead node, it goes to state "Connecting" and then "Up" > > > > in management interface. > > > > > > > > 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null) > > > > Ping timeout for host 12, do invstigation > > > > 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null) > > > > Ping timeout for host 12, do invstigation > > > > 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null) > > > > Ping timeout for host 12, do invstigation > > > > 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status] > > > > (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled, > > > > Agent event = AgentConnected, Host id = 12, name = > > > > ad112.colobridge.net] > > > > 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status] > > > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name = > > > > ad112.colobridge.net; old status = Up; event = AgentConnected; new > > > status > > > > = Connecting; old update count = 1285; new update count = 1286] > > > > 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status] > > > > (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled, > > > > Agent event = Ready, Host id = 12, name = ad112.colobridge.net] > > > > 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status] > > > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name = > > > > ad112.colobridge.net; old status = Connecting; event = Ready; new > > > status = > > > > Up; old update count = 1286; new update count = 1287] > > > > > > > > > > > > If I restart cloud-management service, dead node goes to state > > > > "Disconnected" in management interface. > > > > (there is nothing special in logs in this case) > > > > > > > > If I do nothing, dead node could stay in "Up" state forever (I > > > > waited > > > for > > > > 12 hours) in management interface, throwing into logs "Agent state > > > > cannot be determined, do nothing" > > > > > > > > Would appreciate if someone could help/suggest how to deal with this > > > > problem. > > > > > > > > -- > > > > Regards, > > > > Valery > > > > > > > > http://protocol.by/slayer > > > > > > > > > > > -- > > Regards, > > Valery > > > > http://protocol.by/slayer > -- Regards, Valery http://protocol.by/slayer