Checkout https://issues.apache.org/jira/browse/CLOUDSTACK-3535.
-Koushik On 26-Aug-2013, at 7:16 PM, Valery Ciareszka <valery.teres...@gmail.com> wrote: > Koushik, > > Ok, imagine the server is offline (burned cpu/ power supply etc), and there > is no way to get the host back online within 1-2 hours. > However, CS management considers host as online. > What is the proper way to deal with this issue ? > > > > On Fri, Jul 12, 2013 at 2:20 PM, Koushik Das <koushik....@citrix.com> wrote: > >> I looked at the logs and none of the existing investigators are able to >> determine that the host is down. I am not sure if there is a clean way to >> identify if a host is down in case of KVM. Consider the following cases: >> >> 1. Host is actually shutdown >> 2. Management nic of the host is plugged out of the network but host is up >> and running >> >> There is no clean way to distinguish these cases. Cloudstack should only >> mark the host as down in the first case. But not sure how one would achieve >> this. >> >> -Koushik >> >>> -----Original Message----- >>> From: Valery Ciareszka [mailto:valery.teres...@gmail.com] >>> Sent: Friday, July 12, 2013 2:39 PM >>> To: users@cloudstack.apache.org >>> Subject: Re: cs 4.1 host disconnected status >>> >>> I've simulated crash again and here is the log: >>> http://thesuki.org/temp/cs.log.txt >>> I stripped out of there GET requests with api keys. >>> Server was switched off at 8:36 >>> >>> On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das <koushik....@citrix.com >>> wrote: >>> >>>> Looks like the KVM investigator is not able to determine the state of >>>> the agent. Can you share the full log? >>>> >>>>> -----Original Message----- >>>>> From: Valery Ciareszka [mailto:valery.teres...@gmail.com] >>>>> Sent: Thursday, July 11, 2013 7:47 PM >>>>> To: users >>>>> Subject: cs 4.1 host disconnected status >>>>> >>>>> Hi all. >>>>> >>>>> I use the following environment: CS 4.1, KVM, Centos 6.4 >>>>> (management+node1+node2), OpenIndiana NFS server as primary and >>>>> secondary storage. >>>>> and I have the following problem: >>>>> If I switch one hypervisor node off via ipmi (simulate server >>>>> crash), it >>>> never >>>>> goes to Disconnected status in management. Accordingly, ha-enabled >>>>> VMs are not restarted on another hypervisor node, because it >>>>> believes that disconnected node is still online. >>>>> >>>>> >>>>> I get following in management server logs: >>>>> >>>>> 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request] >>>>> (AgentManager-Handler-13:null) Seq 19-1133189098: >> Processing: >>>>> { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10, >>>>> [{"Answer":{"result":false,"details": "Unable to ping computing >> host, >>>>> exiting","wait":0}}] } >>>>> 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request] >>>>> (AgentTaskPool-1:null) Seq 19-1133189098: Received: { Ans: , MgmtId: >>>>> 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } } >>>>> 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl] >>>>> (AgentTaskPool-1:null) host (172.16.20.241) cannot be pinged, >>>>> returning >>>> null >>>>> ('I don't know') >>>>> 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator] >>>>> (AgentTaskPool-1:null) could not reach agent, could not reach >> agent's >>>>> host, returning that we don't have enough information >>>>> 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl] >>>>> (AgentTaskPool-1:null) null unable to determine the state of the >> host. >>>>> Moving on. >>>>> 2013-07-11 10:19:16,153 DEBUG [cloud.ha.HighAvailabilityManagerImpl] >>>>> (AgentTaskPool-1:null) null unable to determine the state of the >> host. >>>>> Moving on. >>>>> 2013-07-11 10:19:16,153 WARN [agent.manager.AgentManagerImpl] >>>>> (AgentTaskPool-1:null) Agent state cannot be determined, do >>>>> nothing >>>>> >>>>> >>>>> If I power on dead node, it goes to state "Connecting" and then "Up" >>>>> in management interface. >>>>> >>>>> 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null) >>>>> Ping timeout for host 12, do invstigation >>>>> 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null) >>>>> Ping timeout for host 12, do invstigation >>>>> 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null) >>>>> Ping timeout for host 12, do invstigation >>>>> 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status] >>>>> (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled, >>>>> Agent event = AgentConnected, Host id = 12, name = >>>>> ad112.colobridge.net] >>>>> 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status] >>>>> (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name = >>>>> ad112.colobridge.net; old status = Up; event = AgentConnected; new >>>> status >>>>> = Connecting; old update count = 1285; new update count = 1286] >>>>> 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status] >>>>> (AgentConnectTaskPool-5:null) Transition:[Resource state = Enabled, >>>>> Agent event = Ready, Host id = 12, name = ad112.colobridge.net] >>>>> 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status] >>>>> (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name = >>>>> ad112.colobridge.net; old status = Connecting; event = Ready; new >>>> status = >>>>> Up; old update count = 1286; new update count = 1287] >>>>> >>>>> >>>>> If I restart cloud-management service, dead node goes to state >>>>> "Disconnected" in management interface. >>>>> (there is nothing special in logs in this case) >>>>> >>>>> If I do nothing, dead node could stay in "Up" state forever (I >>>>> waited >>>> for >>>>> 12 hours) in management interface, throwing into logs "Agent state >>>>> cannot be determined, do nothing" >>>>> >>>>> Would appreciate if someone could help/suggest how to deal with this >>>>> problem. >>>>> >>>>> -- >>>>> Regards, >>>>> Valery >>>>> >>>>> http://protocol.by/slayer >>>> >>> >>> >>> >>> -- >>> Regards, >>> Valery >>> >>> http://protocol.by/slayer >> > > > > -- > Regards, > Valery > > http://protocol.by/slayer