[ https://issues.apache.org/jira/browse/CLOUDSTACK-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16275620#comment-16275620 ]
ASF subversion and git services commented on CLOUDSTACK-7853: ------------------------------------------------------------- Commit 9d6972cb244cc3f659624bbbc35f99fff1c2a44b in cloudstack's branch refs/heads/debian9-systemvmtemplate from [~rohit.ya...@shapeblue.com] [ https://gitbox.apache.org/repos/asf?p=cloudstack.git;h=9d6972c ] CLOUDSTACK-7853: Fix ping timeout edge case and refactor code Refresh InaccurateClock every 10seconds, refactor code to get ping timeout and ping interval. Signed-off-by: Rohit Yadav <rohit.ya...@shapeblue.com> > Hosts that are temporary Disconnected and get behind on ping (PingTimeout) > turn up in permanent state Alert > ----------------------------------------------------------------------------------------------------------- > > Key: CLOUDSTACK-7853 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7853 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Affects Versions: 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0 > Reporter: Joris van Lieshout > > If for some reason (I've been unable to determine why but my suspicion is > that the management server is busy processing other agent requests and/or > xapi is temporary unavailable) a host that is Disconnected gets behind on > ping (PingTimeout) it it transitioned to a permanent state of Alert. > INFO [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-9551e174) Found the > following agents behind on ping: [421, 427, 425] > DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Ping timeout for host 421, > do invstigation > DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Transition:[Resource state > = Enabled, Agent event = PingTimeout, Host id = 421, name = xxxxxx1] > DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Agent status update: [id = > 421; name = xxxxxx1; old status = Disconnected; event = PingTimeout; new > status = Alert; old update count = 111; new update count = 112] > ----/ next cycle / ----- > INFO [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Found the > following agents behind on ping: [421, 427, 425] > DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Ping timeout for host 421, > do invstigation > DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Transition:[Resource state > = Enabled, Agent event = PingTimeout, Host id = 421, name = xxxxxx1] > DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Cannot transit agent > status with event PingTimeout for host 421, name=xxxxxx1, mangement server id > is 345052370017 > ERROR [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Caught the > following exception: > com.cloud.utils.exception.CloudRuntimeException: Cannot transit agent status > with event PingTimeout for host 421, mangement server id is > 345052370017,Unable to transition to a new state from Alert via PingTimeout > at > com.cloud.agent.manager.AgentManagerImpl.agentStatusTransitTo(AgentManagerImpl.java:1334) > at > com.cloud.agent.manager.AgentManagerImpl.disconnectAgent(AgentManagerImpl.java:1349) > at > com.cloud.agent.manager.AgentManagerImpl.disconnectInternal(AgentManagerImpl.java:1378) > at > com.cloud.agent.manager.AgentManagerImpl.disconnectWithInvestigation(AgentManagerImpl.java:1384) > at > com.cloud.agent.manager.AgentManagerImpl$MonitorTask.runInContext(AgentManagerImpl.java:1466) > at > org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) > at > org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56) > at > org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103) > at > org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53) > at > org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at > java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:701) > I think the bug occures because there is no valid state transition from Alert > via PingTimeout to something recoverable > Status.java > s_fsm.addTransition(Status.Alert, Event.AgentConnected, > Status.Connecting); > s_fsm.addTransition(Status.Alert, Event.Ping, Status.Up); > s_fsm.addTransition(Status.Alert, Event.Remove, Status.Removed); > s_fsm.addTransition(Status.Alert, Event.ManagementServerDown, > Status.Alert); > s_fsm.addTransition(Status.Alert, Event.AgentDisconnected, > Status.Alert); > s_fsm.addTransition(Status.Alert, Event.ShutdownRequested, > Status.Disconnected); > As a workaround to get out of this situation we put the cluster in Unmanage, > wait 10 minutes and put the cluster back in manage -- This message was sent by Atlassian JIRA (v6.4.14#64029)