The ping between mgt server and ssvm fails because mgt sends disconnect message to all nodes. If you look at the logs I pasted in first email, the mgt server thinks ssvm is lagging behind on ping and sends a disconnect message without investigation for all nodes. Also it happens at the beginning of every hour.
So I'm sure network is not the issue here. Sent from my iPhone > On 25-Jul-2019, at 4:46 PM, Andrija Panic <andrija.pa...@gmail.com> wrote: > > since basic network connectivity (ping failures) was down between mgmts and > nodes (and SSVM on it) - I would point my finger to your networking > equipment - i.e. I expect zero problems with ACS (since pings fail). > > Let us know how it goes. > > Andrija > >> On Thu, 25 Jul 2019 at 16:04, Rakesh v <www.rakeshv....@gmail.com> wrote: >> >> Yes I was monitoring it continuously. Below are the steps which I was >> doing when issue happened >> >> >> 1. Ping from MGT server to ssvm >> 2. Ping from ssvm to secondary storage ip >> 3. Ping from ssvm to public IP like 8.8.8.8 >> 4. Ping from MGT server to node in which ssvm was running >> >> >> Out of all these, the ping drops were observed from MGT server to ssvm and >> mgt server to nodes. Basically all nodes lost connection. Then it recovered >> itself after 1 minute. >> >> >> Sent from my iPhone >> >>> On 25-Jul-2019, at 3:48 PM, Andrija Panic <andrija.pa...@gmail.com> >> wrote: >>> >>> Can you observe the status of SSVM (is it >> UP/Connecting/Disconnected/Down) >>> while you have issues? >>> >>> I would advise checking your Secondary Storage itself - and also running >>> the SSVM diagnose script /usr/local/cloud/systemvm/ssvm-check.sh - >> observe >>> if any errors with NFS or others. >>> >>> Lastly - and don't laugh - check that you don't have issues with >> networking >>> equipment (some of us had VEEEERY strange issues in connectivity some >> years >>> ago with crappy QCT/Quanta Switches in MLAG setup) >>> >>> Andrija >>> >>>> On Thu, 25 Jul 2019 at 15:42, Rakesh v <www.rakeshv....@gmail.com> >> wrote: >>>> >>>> Yes I have set the ip's of the three MGT servers in the "host" field >>>> >>>> Sent from my iPhone >>>> >>>>> On 25-Jul-2019, at 2:14 PM, Pierre-Luc Dion <pdion...@apache.org> >> wrote: >>>>> >>>>> Do you have a load balancer in front of cloudstack? Did you set the >>>> global >>>>> settings "host" to the ip of the mgmt server? >>>>> >>>>> >>>>> Le jeu. 25 juill. 2019 03 h 24, Rakesh Venkatesh < >>>> www.rakeshv....@gmail.com> >>>>> a écrit : >>>>> >>>>>> Hello People >>>>>> >>>>>> >>>>>> I have a strange issue where mgt server times out to send a command to >>>>>> secondary storage VM every hour and because of this UI won't be >>>> accessible >>>>>> for a short duration of time. Sometimes I have to restart mgt server >> to >>>> get >>>>>> it back to working state and sometimes I don't need to restart it. I >>>> also >>>>>> see some exceptions while fetching the storage stats. >>>>>> >>>>>> >>>>>> The log says secondary storage VM is lagging behind mgt server in ping >>>> and >>>>>> it sends a disconnect message to other components. Can you let me know >>>> how >>>>>> to troubleshoot this issue? I destroyed the secondary storage VM but >> the >>>>>> issue still persists. I checked the date/time on the mgt server and >> SSVM >>>>>> and they are same. This is happening for quite a few days now. Below >> are >>>>>> the logs >>>>>> >>>>>> >>>>>> >>>>>> 2019-07-25 04:01:22,769 INFO [c.c.a.m.AgentManagerImpl] >>>>>> (AgentMonitor-1:ctx-c33dbe74) (logid:5442158c) Found the following >>>> agents >>>>>> behind on ping: [183] >>>>>> 2019-07-25 04:01:22,775 WARN [c.c.a.m.AgentManagerImpl] >>>>>> (AgentMonitor-1:ctx-c33dbe74) (logid:5442158c) Disconnect agent for >>>>>> CPVM/SSVM due to physical connection close. host: 183 >>>>>> 2019-07-25 04:01:22,778 INFO [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Host 183 is >>>> disconnecting >>>>>> with event ShutdownRequested >>>>>> 2019-07-25 04:01:22,781 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) The next status of >> agent >>>>>> 183is Disconnected, current status is Up >>>>>> 2019-07-25 04:01:22,781 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Deregistering link for >>>> 183 >>>>>> with state Disconnected >>>>>> 2019-07-25 04:01:22,781 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Remove Agent : 183 >>>>>> 2019-07-25 04:01:22,781 DEBUG [c.c.a.m.ConnectedAgentAttache] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Processing Disconnect. >>>>>> 2019-07-25 04:01:22,782 DEBUG [c.c.a.m.AgentAttache] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Seq >>>>>> 183-7541559051008607242: Sending disconnect to class >>>>>> com.cloud.agent.manager.SynchronousListener >>>>>> 2019-07-25 04:01:22,782 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: >> com.cloud.hypervisor.xenserver.discoverer.XcpServerDiscoverer >>>>>> 2019-07-25 04:01:22,782 DEBUG [c.c.u.n.NioConnection] >>>>>> (pool-2-thread-1:null) (logid:) Closing socket Socket[addr=/ >>>> 172.30.32.16 >>>>>> ,port=38250,localport=8250] >>>>>> 2019-07-25 04:01:22,782 DEBUG [c.c.a.m.AgentAttache] >>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Seq >>>>>> 183-7541559051008607242: Waiting some more time because this is the >>>> current >>>>>> command >>>>>> 2019-07-25 04:01:22,782 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: >> com.cloud.hypervisor.hyperv.discoverer.HypervServerDiscoverer >>>>>> 2019-07-25 04:01:22,783 DEBUG [c.c.a.m.AgentAttache] >>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Seq >>>>>> 183-7541559051008607242: Waiting some more time because this is the >>>> current >>>>>> command >>>>>> 2019-07-25 04:01:22,783 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: com.cloud.deploy.DeploymentPlanningManagerImpl >>>>>> 2019-07-25 04:01:22,783 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: com.cloud.network.security.SecurityGroupListener >>>>>> 2019-07-25 04:01:22,783 INFO [c.c.u.e.CSExceptionErrorCode] >>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Could not find >>>> exception: >>>>>> com.cloud.exception.OperationTimedoutException in error code list for >>>>>> exceptions >>>>>> 2019-07-25 04:01:22,783 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: >> org.apache.cloudstack.engine.orchestration.NetworkOrchestrator >>>>>> 2019-07-25 04:01:22,783 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: com.cloud.vm.ClusteredVirtualMachineManagerImpl >>>>>> 2019-07-25 04:01:22,783 WARN [c.c.a.m.AgentAttache] >>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Seq >>>>>> 183-7541559051008607242: Timed out on null >>>>>> 2019-07-25 04:01:22,783 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: com.cloud.storage.listener.StoragePoolMonitor >>>>>> 2019-07-25 04:01:22,784 DEBUG [c.c.a.m.AgentAttache] >>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Seq >>>>>> 183-7541559051008607242: Cancelling. >>>>>> 2019-07-25 04:01:22,784 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: com.cloud.storage.secondary.SecondaryStorageListener >>>>>> 2019-07-25 04:01:22,784 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: com.cloud.network.SshKeysDistriMonitor >>>>>> 2019-07-25 04:01:22,785 DEBUG [o.a.c.s.RemoteHostEndPoint] >>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Failed to send >> command, >>>>>> due to Agent:183, com.cloud.exception.OperationTimedoutException: >>>> Commands >>>>>> 7541559051008607242 to Host 183 timed out after 3600 >>>>>> 2019-07-25 04:01:22,785 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: >> com.cloud.network.router.VpcVirtualNetworkApplianceManagerImpl >>>>>> 2019-07-25 04:01:22,785 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: com.cloud.storage.download.DownloadListener >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> 2019-07-25 04:01:22,785 ERROR [c.c.s.StatsCollector] >>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Error trying to >>>> retrieve >>>>>> storage stats >>>>>> com.cloud.utils.exception.CloudRuntimeException: Failed to send >> command, >>>>>> due to Agent:183, com.cloud.exception.OperationTimedoutException: >>>> Commands >>>>>> 7541559051008607242 to Host 183 timed out after 3600 >>>>>> at >>>>>> >>>>>> >>>> >> org.apache.cloudstack.storage.RemoteHostEndPoint.sendMessage(RemoteHostEndPoint.java:133) >>>>>> at >>>>>> >>>>>> >>>> >> com.cloud.server.StatsCollector$StorageCollector.runInContext(StatsCollector.java:1139) >>>>>> at >>>>>> >>>>>> >>>> >> org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) >>>>>> at >>>>>> >>>>>> >>>> >> org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56) >>>>>> at >>>>>> >>>>>> >>>> >> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103) >>>>>> at >>>>>> >>>>>> >>>> >> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53) >>>>>> at >>>>>> >>>>>> >>>> >> org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) >>>>>> at >>>>>> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>>>>> at >>>> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) >>>>>> at >>>>>> >>>>>> >>>> >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) >>>>>> at >>>>>> >>>>>> >>>> >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) >>>>>> at >>>>>> >>>>>> >>>> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>>>>> at >>>>>> >>>>>> >>>> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>>>> at java.lang.Thread.run(Thread.java:748) >>>>>> 2019-07-25 04:01:22,786 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: com.cloud.consoleproxy.ConsoleProxyListener >>>>>> 2019-07-25 04:01:22,789 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: com.cloud.storage.LocalStoragePoolListener >>>>>> 2019-07-25 04:01:22,789 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: com.cloud.storage.upload.UploadListener >>>>>> 2019-07-25 04:01:22,790 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: com.cloud.capacity.StorageCapacityListener >>>>>> 2019-07-25 04:01:22,790 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: com.cloud.capacity.ComputeCapacityListener >>>>>> 2019-07-25 04:01:22,790 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: com.cloud.network.SshKeysDistriMonitor >>>>>> 2019-07-25 04:01:22,791 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: com.cloud.network.router.VirtualNetworkApplianceManagerImpl >>>>>> 2019-07-25 04:01:22,791 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: >>>>>> com.cloud.network.NetworkUsageManagerImpl$DirectNetworkStatsListener >>>>>> 2019-07-25 04:01:22,791 DEBUG [c.c.n.NetworkUsageManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Disconnected called on >>>> 183 >>>>>> with status Disconnected >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> 2019-07-25 04:01:22,791 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: >> com.cloud.agent.manager.AgentManagerImpl$BehindOnPingListener >>>>>> 2019-07-25 04:01:22,791 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect to >>>>>> listener: >> com.cloud.agent.manager.AgentManagerImpl$SetHostParamsListener >>>>>> 2019-07-25 04:01:22,791 DEBUG [c.c.h.Status] >>>> (AgentTaskPool-1:ctx-66de2057) >>>>>> (logid:841d2a63) Transition:[Resource state = Enabled, Agent event = >>>>>> ShutdownRequested, Host id = 183, name = s-2775-VM] >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Thanks and regards >>>>>> Rakesh venkatesh >>>>>> >>>> >>> >>> >>> -- >>> >>> Andrija Panić >> > > > -- > > Andrija Panić