[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-7853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16275620#comment-16275620
 ] 

ASF subversion and git services commented on CLOUDSTACK-7853:
-------------------------------------------------------------

Commit 9d6972cb244cc3f659624bbbc35f99fff1c2a44b in cloudstack's branch 
refs/heads/debian9-systemvmtemplate from [~rohit.ya...@shapeblue.com]
[ https://gitbox.apache.org/repos/asf?p=cloudstack.git;h=9d6972c ]

CLOUDSTACK-7853: Fix ping timeout edge case and refactor code

Refresh InaccurateClock every 10seconds, refactor code to get ping timeout
and ping interval.

Signed-off-by: Rohit Yadav <rohit.ya...@shapeblue.com>


> Hosts that are temporary Disconnected and get behind on ping (PingTimeout) 
> turn up in permanent state Alert
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-7853
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7853
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>    Affects Versions: 4.3.0, 4.4.0, 4.5.0, 4.3.1, 4.4.1, 4.6.0
>            Reporter: Joris van Lieshout
>
> If for some reason (I've been unable to determine why but my suspicion is 
> that the management server is busy processing other agent requests and/or 
> xapi is temporary unavailable) a host that is Disconnected gets behind on 
> ping (PingTimeout) it it transitioned to a permanent state of Alert.
> INFO  [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-9551e174) Found the 
> following agents behind on ping: [421, 427, 425]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Ping timeout for host 421, 
> do invstigation
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Transition:[Resource state 
> = Enabled, Agent event = PingTimeout, Host id = 421, name = xxxxxx1]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-9551e174) Agent status update: [id = 
> 421; name = xxxxxx1; old status = Disconnected; event = PingTimeout; new 
> status = Alert; old update count = 111; new update count = 112]
> ----/ next cycle / -----
> INFO  [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Found the 
> following agents behind on ping: [421, 427, 425]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Ping timeout for host 421, 
> do invstigation
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Transition:[Resource state 
> = Enabled, Agent event = PingTimeout, Host id = 421, name = xxxxxx1]
> DEBUG [c.c.h.Status] (AgentMonitor-1:ctx-2a81b9f7) Cannot transit agent 
> status with event PingTimeout for host 421, name=xxxxxx1, mangement server id 
> is 345052370017
> ERROR [c.c.a.m.AgentManagerImpl] (AgentMonitor-1:ctx-2a81b9f7) Caught the 
> following exception: 
> com.cloud.utils.exception.CloudRuntimeException: Cannot transit agent status 
> with event PingTimeout for host 421, mangement server id is 
> 345052370017,Unable to transition to a new state from Alert via PingTimeout
>         at 
> com.cloud.agent.manager.AgentManagerImpl.agentStatusTransitTo(AgentManagerImpl.java:1334)
>         at 
> com.cloud.agent.manager.AgentManagerImpl.disconnectAgent(AgentManagerImpl.java:1349)
>         at 
> com.cloud.agent.manager.AgentManagerImpl.disconnectInternal(AgentManagerImpl.java:1378)
>         at 
> com.cloud.agent.manager.AgentManagerImpl.disconnectWithInvestigation(AgentManagerImpl.java:1384)
>         at 
> com.cloud.agent.manager.AgentManagerImpl$MonitorTask.runInContext(AgentManagerImpl.java:1466)
>         at 
> org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49)
>         at 
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56)
>         at 
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103)
>         at 
> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53)
>         at 
> org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at 
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:701)
> I think the bug occures because there is no valid state transition from Alert 
> via PingTimeout to something recoverable
> Status.java
>               s_fsm.addTransition(Status.Alert, Event.AgentConnected, 
> Status.Connecting);
>         s_fsm.addTransition(Status.Alert, Event.Ping, Status.Up);
>         s_fsm.addTransition(Status.Alert, Event.Remove, Status.Removed);
>         s_fsm.addTransition(Status.Alert, Event.ManagementServerDown, 
> Status.Alert);
>         s_fsm.addTransition(Status.Alert, Event.AgentDisconnected, 
> Status.Alert);
>         s_fsm.addTransition(Status.Alert, Event.ShutdownRequested, 
> Status.Disconnected);
>  As a workaround to get out of this situation we put the cluster in Unmanage, 
> wait 10 minutes and put the cluster back in manage



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to