[ https://issues.apache.org/jira/browse/CLOUDSTACK-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201921#comment-14201921 ]
Koushik Das commented on CLOUDSTACK-7853: ----------------------------------------- In case of PingTimeout, investigation is performed by the various investigators present to determine the state of the host. The below logs would appear when it happens. Once the host is available, one of these investigators should be able to detect that. Can you check the MS logs and see what is going on? 2014-11-05 16:32:07,211 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-4:ctx-d025d8e6) SimpleInvestigator unable to determine the state of the host. Moving on. 2014-11-05 16:32:07,211 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-4:ctx-d025d8e6) XenServerInvestigator unable to determine the state of the host. Moving on. 2014-11-05 16:32:07,273 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-4:ctx-d025d8e6) PingInvestigator unable to determine the state of the host. Moving on. 2014-11-05 16:32:07,273 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-4:ctx-d025d8e6) ManagementIPSysVMInvestigator unable to determine the state of the host. Moving on. 2014-11-05 16:32:07,273 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-4:ctx-d025d8e6) KVMInvestigator unable to determine the state of the host. Moving on. 2014-11-05 16:32:07,273 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-4:ctx-d025d8e6) HypervInvestigator unable to determine the state of the host. Moving on. 2014-11-05 16:32:07,323 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-4:ctx-d025d8e6) Simulator Investigator was able to determine host 1 is in Down > Hosts that are temporary Disconnected and get behind on ping (PingTimeout) > turn up in permanent state Alert > ----------------------------------------------------------------------------------------------------------- > > Key: CLOUDSTACK-7853 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7853 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Affects Versions: Future, 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0 > Reporter: Joris van Lieshout > Priority: Critical > > If for some reason (I've been unable to determine why but my suspicion is > that the management server is busy processing other agent requests and/or > xapi is temporary unavailable) a host that is Disconnected gets behind on > ping (PingTimeout) it it transitioned to a permanent state of Alert. > INFO [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-9551e174) Found the > following agents behind on ping: [421, 427, 425] > DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Ping timeout for host 421, > do invstigation > DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Transition:[Resource state > = Enabled, Agent event = PingTimeout, Host id = 421, name = xxxxxx1] > DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Agent status update: [id = > 421; name = xxxxxx1; old status = Disconnected; event = PingTimeout; new > status = Alert; old update count = 111; new update count = 112] > ----/ next cycle / ----- > INFO [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Found the > following agents behind on ping: [421, 427, 425] > DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Ping timeout for host 421, > do invstigation > DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Transition:[Resource state > = Enabled, Agent event = PingTimeout, Host id = 421, name = xxxxxx1] > DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Cannot transit agent > status with event PingTimeout for host 421, name=xxxxxx1, mangement server id > is 345052370017 > ERROR [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Caught the > following exception: > com.cloud.utils.exception.CloudRuntimeException: Cannot transit agent status > with event PingTimeout for host 421, mangement server id is > 345052370017,Unable to transition to a new state from Alert via PingTimeout > at > com.cloud.agent.manager.AgentManagerImpl.agentStatusTransitTo(AgentManagerImpl.java:1334) > at > com.cloud.agent.manager.AgentManagerImpl.disconnectAgent(AgentManagerImpl.java:1349) > at > com.cloud.agent.manager.AgentManagerImpl.disconnectInternal(AgentManagerImpl.java:1378) > at > com.cloud.agent.manager.AgentManagerImpl.disconnectWithInvestigation(AgentManagerImpl.java:1384) > at > com.cloud.agent.manager.AgentManagerImpl$MonitorTask.runInContext(AgentManagerImpl.java:1466) > at > org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) > at > org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56) > at > org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103) > at > org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53) > at > org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at > java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:701) > I think the bug occures because there is no valid state transition from Alert > via PingTimeout to something recoverable > Status.java > s_fsm.addTransition(Status.Alert, Event.AgentConnected, > Status.Connecting); > s_fsm.addTransition(Status.Alert, Event.Ping, Status.Up); > s_fsm.addTransition(Status.Alert, Event.Remove, Status.Removed); > s_fsm.addTransition(Status.Alert, Event.ManagementServerDown, > Status.Alert); > s_fsm.addTransition(Status.Alert, Event.AgentDisconnected, > Status.Alert); > s_fsm.addTransition(Status.Alert, Event.ShutdownRequested, > Status.Disconnected); > As a workaround to get out of this situation we put the cluster in Unmanage, > wait 10 minutes and put the cluster back in manage -- This message was sent by Atlassian JIRA (v6.3.4#6332)