Hello Community,

We recently upgraded our ACS environment from version *4.19.1* to *4.19.3*.
Post-upgrade, we are experiencing recurring issues where the
*cloudstack-management* services on our management nodes are failing
intermittently until the services are manually restarted.

Upon reviewing the logs from the affected management servers, we
observed *java.lang.OutOfMemoryError:
Java heap space* Exceptions are being repeatedly logged.

Below are the logs.

> java[1039537]: INFO  [c.c.k.c.KubernetesClusterManagerImpl]
> (Kubernetes-Cluster-State-Scanner-1:ctx-56efc1be) (logid:e0d2bb71) Running
> Kubernetes cluster state scanner on Kubernetes cluster : Test-Cluster for
> state: Alert

java[1039537]: java.lang.OutOfMemoryError: Java heap space
>
> java[1039537]: Dumping heap to
>> /var/log/cloudstack/management/java_pid1039537.hprof ...
>
> java[1039537]: Heap dump file created [1850233163 bytes in 21.923 secs]
>
> java[1039537]: WARN  [c.c.u.d.Merovingian2]
>> (Network-Scavenger-1:ctx-8673d47e) (logid:7e2df3dc) Timed out on acquiring
>> lock networks4235 .  Waited for 600seconds
>
> java[1039537]: com.cloud.utils.exception.CloudRuntimeException: Timed out
>> on acquiring lock networks4235 .  Waited for 600seconds
>
> java[1039537]:         at
>> com.cloud.utils.db.Merovingian2.acquire(Merovingian2.java:151)
>
> java[1039537]:         at
>> com.cloud.utils.db.TransactionLegacy.lock(TransactionLegacy.java:386)
>
> java[1039537]:         at
>> com.cloud.utils.db.GenericDaoBase.acquireInLockTable(GenericDaoBase.java:1074)
>
> java[1039537]:         at
>> jdk.internal.reflect.GeneratedMethodAccessor474.invoke(Unknown Source)
>
> java[1039537]:         at
>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> java[1039537]:         at
>> java.base/java.lang.reflect.Method.invoke(Method.java:566)
>
> java[1039537]:         at
>> org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
>
> java[1039537]:         at
>> org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
>
> java[1039537]:         at
>> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
>
> java[1039537]:         at
>> com.cloud.utils.db.TransactionContextInterceptor.invoke(TransactionContextInterceptor.java:34)
>
> java[1039537]:         at
>> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175)
>
> java[1039537]:         at
>> org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97)
>
> java[1039537]:         at
>> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
>
> java[1039537]:         at
>> org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215)
>
> java[1039537]:         at
>> com.sun.proxy.$Proxy52.acquireInLockTable(Unknown Source)
>
> java[1039537]:         at
>> org.apache.cloudstack.engine.orchestration.NetworkOrchestrator.shutdownNetwork(NetworkOrchestrator.java:3080)
>
> java[1039537]:         at
>> org.apache.cloudstack.engine.orchestration.NetworkOrchestrator$NetworkGarbageCollector.reallyRun(NetworkOrchestrator.java:3530)
>
> java[1039537]:         at
>> org.apache.cloudstack.engine.orchestration.NetworkOrchestrator$NetworkGarbageCollector.runInContext(NetworkOrchestrator.java:3466)
>
> java[1039537]:         at
>> org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:48)
>
> java[1039537]:         at
>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55)
>
> java[1039537]:         at
>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:102)
>
> java[1039537]:         at
>> org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52)
>
> java[1039537]:         at
>> org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:45)
>
> java[1039537]:         at
>> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>
> java[1039537]:         at
>> java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
>
> java[1039537]:         at
>> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
>
> java[1039537]:         at
>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>
> java[1039537]:         at
>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>
> java[1039537]:         at java.base/java.lang.Thread.run(Thread.java:829)
>
> java[1039537]: WARN  [o.a.c.e.o.NetworkOrchestrator]
>> (Network-Scavenger-1:ctx-8673d47e) (logid:7e2df3dc) Network with id: 4235
>> doesn't exist, or unable to acquire lock for it as a part of network
>> shutdown
>
> java[1039537]: ERROR [c.c.c.ClusterManagerImpl]
>> (Cluster-Heartbeat-1:ctx-cdbfb71b) (logid:cd4a1b55) We have detected that
>> at least one management server peer reports that this management server is
>> down, perform active fencing>Jul 02 05:43:10 sc-mgmt-03.speedcloud.co.in
>> java[1039537]: INFO  [c.c.a.m.AgentManagerImpl]
>> (AgentMonitor-1:ctx-0fb8adb7) (logid:581cfc3f) Found the following agents
>> behind on ping: [58, 67, 116, 129, 147]
>
> java[1039537]: WARN  [c.c.c.ClusterServiceServletContainer]
>> (Thread-26:null) (logid:) Failure during validating cluster request from
>> 10.232.8.36
>
> java[1039537]: INFO  [o.a.c.v.s.VMSchedulerImpl]
>> (VMSchedulerPollTask:ctx-8b5f50ea) (logid:69b8fd3c) Cleaned up 0 VM
>> scheduled job entries
>
> java[1039537]: WARN  [c.c.c.ClusterServiceServletContainer]
>> (Thread-26:null) (logid:) Failure during validating cluster request from
>> 10.232.8.37
>
> java[1039537]: INFO  [c.c.k.c.KubernetesClusterManagerImpl]
>> (Kubernetes-Cluster-State-Scanner-1:ctx-56efc1be) (logid:e0d2bb71) Running
>> Kubernetes cluster state scanner on Kubernetes cluster : Stack-k8s-Test for
>> state: Alert
>
> java[1039537]: WARN  [c.c.n.r.VirtualNetworkApplianceManagerImpl]
>> (RouterStatusMonitor-1:ctx-cbfa09ce) (logid:56ce8a0c) Unable to fetch basic
>> router test results data from router r-15355-VM
>
> java[1039537]: WARN  [c.c.c.ClusterServiceServletContainer]
>> (Thread-26:null) (logid:) Failure during validating cluster request from
>> 10.232.8.36
>
> java[1039537]: WARN  [c.c.c.ClusterServiceServletContainer]
>> (Thread-26:null) (logid:) Failure during validating cluster request from
>> 10.232.8.37
>
> java[1039537]: WARN  [c.c.c.ClusterServiceServletContainer]
>> (Thread-26:null) (logid:) Failure during validating cluster request from
>> 10.232.8.36
>
> java[1039537]: WARN  [c.c.c.ClusterServiceServletContainer]
>> (Thread-26:null) (logid:) Failure during validating cluster request from
>> 10.232.8.37
>
> java[1039537]: WARN  [c.c.c.ClusterServiceServletContainer]
>> (Thread-26:null) (logid:) Failure during validating cluster request from
>> 10.232.8.36
>
> java[1039537]: WARN  [c.c.a.d.ParamGenericValidationWorker]
>> (qtp940584193-1652:ctx-d5a00ae4 ctx-0e610389 ctx-a37dcc30) (logid:a55ae204)
>> Received unknown parameters for command listHosts. Unknown parameters :
>> filter
>
> java[1039537]: WARN  [c.c.c.ClusterServiceServletContainer]
>> (Thread-26:null) (logid:) Failure during validating cluster request from
>> 10.232.8.37
>
> java[1039537]: INFO  [c.c.s.d.VolumeStatsDaoImpl]
>> (StatsCollector-2:ctx-b81c41e3) (logid:a37e0ad7) Removed a total of [0]
>> volume_stats rows older than [Tue Jul 01 17:43:10 UTC 2025].
>
> java[1039537]: WARN  [c.c.c.ClusterServiceServletContainer]
>> (Thread-26:null) (logid:) Failure during validating cluster request from
>> 10.232.8.36
>
> java[1039537]: ERROR [c.c.s.StatsCollector]
>> (StatsCollector-2:ctx-49034cc7) (logid:9cc0b9c6) db statistics collection
>> failed due to / by zero
>
> java[1039537]: WARN  [c.c.c.ClusterServiceServletContainer]
>> (Thread-26:null) (logid:) Failure during validating cluster request from
>> 10.232.8.37
>
> java[1039537]: ERROR [c.c.s.StatsCollector]
>> (StatsCollector-2:ctx-d3b2a30a) (logid:657e156f) db statistics collection
>> failed due to / by zero
>
> java[1039537]: ERROR [c.c.c.ClusterManagerImpl]
>> (Cluster-Heartbeat-1:ctx-c8128059) (logid:4b587df9) We have detected that
>> at least one management server peer reports that this management server is
>> down, perform active fencing>Jul 02 05:43:11 sc-mgmt-03.speedcloud.co.in
>> java[1039537]: INFO  [c.c.a.m.AgentManagerImpl]
>> (AgentTaskPool-1:ctx-04653cf4) (logid:514252d2) Investigating why host 58
>> has disconnected with event PingTimeout
>
> java[1039537]: ERROR [c.c.c.ClusterFenceManagerImpl]
>> (Cluster-Notification-1:ctx-086ed801) (logid:8cd30369) Received node
>> isolation notification, will perform self-fencing and shut myself down
>
> java[1039537]: WARN  [c.c.c.ClusterServiceServletContainer]
>> (Thread-26:null) (logid:) Failure during validating cluster request from
>> 10.232.8.36
>
>

Additionally, we checked the */etc/default/cloudstack-management*
configuration file and found the following Java options configured under
JAVA_OPTS*.*

*JAVA_OPTS="-Djava.security.properties=/etc/cloudstack/management/java.security.ciphers
> -Djava.awt.headless=true -Dcom.sun.management.jmxremote=false -Xmx2G
> -XX:+UseParallelGC -XX:MaxGCPauseMillis=500 -XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=/var/log/cloudstack/management/
> -XX:ErrorFile=/var/log/cloudstack/management/cloudstack-management.err
> -Djava.util.Arrays.useLegacyMergeSort=true "*


And also after removing the *-Djava.util.Arrays.useLegacyMergeSort=true*
option from *JAVA_OPTS. *We are unable to fetch the Public IPs from the
dashboard and CMK as well. We got an error stating* Request failed (431):
comparison method violates its general contract*.

Reply via email to