I would check a memory dump. Or a kill -3 / jstack with the list of classes and instances before anything else ?
On Tue, Jan 24, 2023 at 4:20 AM Michal Balicki < michal.bali...@emlpayments.com> wrote: > Hi, > > Just a follow up on this topic – has anybody faced similar issue with > federation mechanism ending with OOM on intermittent tcp connection > problems? > > > > Is there anything we can do to maintain stability of our installation? > > > > Thanks > > Michal > > > > *Od:* Michal Balicki > *Wysłano:* wtorek, 3 stycznia 2023 14:43 > *Do:* users@activemq.apache.org > *Temat:* RE: Federation queues - issue with cleaning up resources? > > > > Looks images have not been correctly added to original msg– attaching them > now. > > > > Thanks > > Michal Balicki > > > > *Od:* Michal Balicki > *Wysłano:* wtorek, 3 stycznia 2023 10:22 > *Do:* users@activemq.apache.org > *Temat:* Federation queues - issue with cleaning up resources? > > > > Hi, > > In our installation we use Artemis 2.27.1 embedded into SpringBoot 2.7.6. > Recently two separate Artemis clusters in different DCs have been joined > using federation queues (both downstream and upstream mode). Since that > time, we observe heap memory hits from time to time. Looks this could be > related to improper handling of dead TCP connections due to any network > issues. > > > > Following is a snippet of federation queue configuration on nodes where > federation is setup: > > > > var federationUpstreamConfiguration = new > FederationUpstreamConfiguration(); > > > > federationUpstreamConfiguration.setName(String.format("federation-upstream-config-for-%s", > federationName)); > > federationUpstreamConfiguration.getConnectionConfiguration() > > .setShareConnection(true) > > .setStaticConnectors(Collections.singletonList(connectorName)) > > .setCircuitBreakerTimeout(5000) > > .setHA(false) > > > .setClientFailureCheckPeriod(ActiveMQDefaultConfiguration.getDefaultFederationFailureCheckPeriod()) > > > .setConnectionTTL(ActiveMQDefaultConfiguration.getDefaultFederationConnectionTtl()) > > > .setRetryInterval(ActiveMQDefaultConfiguration.getDefaultFederationRetryInterval()) > > > .setRetryIntervalMultiplier(ActiveMQDefaultConfiguration.getDefaultFederationRetryIntervalMultiplier()) > > .setMaxRetryInterval( > ActiveMQDefaultConfiguration.getDefaultFederationMaxRetryInterval()) > > > .setInitialConnectAttempts(ActiveMQDefaultConfiguration.getDefaultFederationInitialConnectAttempts()) > > > .setReconnectAttempts(ActiveMQDefaultConfiguration.getDefaultFederationReconnectAttempts()) > > .setCallTimeout(ActiveMQClient.DEFAULT_CALL_TIMEOUT) > > .setCallFailoverTimeout(ActiveMQClient.DEFAULT_CALL_FAILOVER_TIMEOUT); > > > > federationUpstreamConfiguration.addPolicyRef(queuePolicyNameUpstream); > > > > This is what we observe in the logs: > > > > 2023-01-02T16:46:17.212+01:00 2023-01-02 15:46:17.212 [78] WARN > o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:38732 > has been detected: AMQ229014: Did not receive data from / > 10.112.62.33:38732 within the 60000ms connection TTL. The connection will > now be closed. [code=CONNECTION_TIMEDOUT] > > 2023-01-02T16:47:19.215+01:00 2023-01-02 15:47:19.215 [78] WARN > o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:57708 > has been detected: AMQ229014: Did not receive data from / > 10.112.62.33:57708 within the 60000ms connection TTL. The connection will > now be closed. [code=CONNECTION_TIMEDOUT] > > 2023-01-02T16:48:21.217+01:00 2023-01-02 15:48:21.217 [104] WARN > o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:56402 > has been detected: AMQ229014: Did not receive data from / > 10.112.62.33:56402 within the 60000ms connection TTL. The connection will > now be closed. [code=CONNECTION_TIMEDOUT] > > 2023-01-02T16:49:23.220+01:00 2023-01-02 15:49:23.220 [2466] WARN > o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:56760 > has been detected: AMQ229014: Did not receive data from / > 10.112.62.33:56760 within the 60000ms connection TTL. The connection will > now be closed. [code=CONNECTION_TIMEDOUT] > > 2023-01-02T16:50:25.222+01:00 2023-01-02 15:50:25.222 [110] WARN > o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:43358 > has been detected: AMQ229014: Did not receive data from / > 10.112.62.33:43358 within the 60000ms connection TTL. The connection will > now be closed. [code=CONNECTION_TIMEDOUT] > > 2023-01-02T16:51:27.224+01:00 2023-01-02 15:51:27.224 [110] WARN > o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:34982 > has been detected: AMQ229014: Did not receive data from / > 10.112.62.33:34982 within the 60000ms connection TTL. The connection will > now be closed. [code=CONNECTION_TIMEDOUT] > > 2023-01-02T16:52:29.227+01:00 2023-01-02 15:52:29.227 [97] WARN > o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:40942 > has been detected: AMQ229014: Did not receive data from / > 10.112.62.33:40942 within the 60000ms connection TTL. The connection will > now be closed. [code=CONNECTION_TIMEDOUT] > > 2023-01-02T16:53:31.229+01:00 2023-01-02 15:53:31.229 [92] WARN > o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:37634 > has been detected: AMQ229014: Did not receive data from / > 10.112.62.33:37634 within the 60000ms connection TTL. The connection will > now be closed. [code=CONNECTION_TIMEDOUT] > > 2023-01-02T16:54:33.232+01:00 2023-01-02 15:54:33.231 [3397] WARN > o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:58302 > has been detected: AMQ229014: Did not receive data from / > 10.112.62.33:58302 within the 60000ms connection TTL. The connection will > now be closed. [code=CONNECTION_TIMEDOUT] > > 2023-01-02T16:55:35.234+01:00 2023-01-02 15:55:35.234 [78] WARN > o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:59364 > has been detected: AMQ229014: Did not receive data from / > 10.112.62.33:59364 within the 60000ms connection TTL. The connection will > now be closed. [code=CONNECTION_TIMEDOUT] > > 2023-01-02T16:56:37.236+01:00 2023-01-02 15:56:37.236 [97] WARN > o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:40052 > has been detected: AMQ229014: Did not receive data from / > 10.112.62.33:40052 within the 60000ms connection TTL. The connection will > now be closed. [code=CONNECTION_TIMEDOUT] > > 2023-01-02T16:57:39.239+01:00 2023-01-02 15:57:39.239 [105] WARN > o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:38354 > has been detected: AMQ229014: Did not receive data from / > 10.112.62.33:38354 within the 60000ms connection TTL. The connection will > now be closed. [code=CONNECTION_TIMEDOUT] > > 2023-01-02T16:58:41.241+01:00 2023-01-02 15:58:41.241 [98] WARN > o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:42590 > has been detected: AMQ229014: Did not receive data from / > 10.112.62.33:42590 within the 60000ms connection TTL. The connection will > now be closed. [code=CONNECTION_TIMEDOUT] > > 2023-01-02T16:59:43.243+01:00 2023-01-02 15:59:43.243 [88] WARN > o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:35532 > has been detected: AMQ229014: Did not receive data from / > 10.112.62.33:35532 within the 60000ms connection TTL. The connection will > now be closed. [code=CONNECTION_TIMEDOUT] > > > > Strange is consumer count on upstream nodes that are often far above some > reasonable values – following is a scraping from artemis_consumer_count > metric on queue that is federated > > > > You can see strange increased from 0 to hundreds and then some day later > cleanup. > > As new consumers being created, and old ones not being removed. > > > > > > Similarly, from time to time we observe enormous number of sessions being > maintained on upstream nodes – e.g. 820 on single connection. > > > > When drill in you can see majority sessions on this connection have been > created at the same time. > > > > Then the only way to get rid of this is to close manually the connection > from the console. > > > > Heap usage: > > > > > > When analysing memory dump using MAT following is reported: > > > > Problem Suspect 1 > > > > One instance of > „org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl” loaded by > „org.springframework.boot.loader.LaunchedURLClassLoader @ 0xb02ba788” > occupies 65 125 032 (34,91%) bytes. The memory is accumulated in one > instance of „java.util.HashMap$Node[]”, loaded by „<system class loader>”, > which occupies 65 119 184 (34,90%) bytes. > > > > Keywords > > org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl > > org.springframework.boot.loader.LaunchedURLClassLoader @ 0xb02ba788 > > java.util.HashMap$Node[] > > > > > > java.util.HashMap$Node[256] @ 0xb6cfca48 > > 1 040 65 119 184 > > \table java.util.HashMap @ 0xb1bc46c8 > > 48 65 119 248 > > .\map java.util.HashSet @ 0xb1bc46b8 > > 16 65 119 264 > > ..\factories > org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl @ 0xb1bc4620 > > 136 65 125 032 > > ...+serverLocator > org.apache.activemq.artemis.core.server.federation.FederationConnection @ > 0xb1bc45f0 > > 48 48 > > ...|\connection > org.apache.activemq.artemis.core.server.federation.FederationUpstream @ > 0xb1bc4530 > > 48 480 > > ...|.\upstream > org.apache.activemq.artemis.core.server.federation.queue.FederatedQueue @ > 0xb1bc43a0 > > 56 7 392 > > ...|..\[2] java.lang.Object[4] @ 0xb16fe5b8 > > 32 48 > > ...|...\array java.util.concurrent.CopyOnWriteArrayList @ 0xb16fe590 > > 24 88 > > ...|....\brokerPlugins > org.apache.activemq.artemis.core.config.impl.FileConfiguration @ 0xb16f7460 > > 592 7 272 > > ...|.....\configuration > org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl @ 0xb16f7208 > > 280 4 168 > > ...|......+server > org.apache.activemq.artemis.core.remoting.server.impl.RemotingServiceImpl @ > 0xb3387f60 > > 96 1 952 > > ...|......|\this$0 > org.apache.activemq.artemis.core.remoting.server.impl.RemotingServiceImpl$FailureCheckAndFlushThread > @ 0xb33ce5f8 activemq-failure-check-thread Thread > > 128 336 > > ...|......+server > org.apache.activemq.artemis.core.server.impl.ServerStatus @ 0xb1bc8d38 » > > 24 1 184 > > ...|......\Total: 2 entries > > > > ...+serverLocator > org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl @ > 0xd090a7c8 » > > > > > > org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl @ 0xb1bc4620 > > 136 65 125 032 34,91% > > \java.util.HashSet @ 0xb1bc46b8 > > 16 65 119 264 34,90% > > .\java.util.HashMap @ 0xb1bc46c8 > > 48 65 119 248 34,90% > > ..\java.util.HashMap$Node[256] @ 0xb6cfca48 > > 1 040 65 119 184 34,90% > > ...+java.util.HashMap$Node @ 0xb59106b8 > > 32 236 464 0,13% > > ...+java.util.HashMap$Node @ 0xb620b438 > > 32 181 944 0,10% > > ...+java.util.HashMap$Node @ 0xb3e5c6a0 > > 32 169 344 0,09% > > ...+java.util.HashMap$Node @ 0xb3bb0cd8 > > 32 147 120 0,08% > > ...+java.util.HashMap$Node @ 0xb5169f88 > > 32 139 184 0,07% > > ...+java.util.HashMap$Node @ 0xb4456400 > > 32 131 296 0,07% > > ...+java.util.HashMap$Node @ 0xb59e6668 > > 32 126 616 0,07% > > ...+java.util.HashMap$Node @ 0xb58818a0 > > 32 121 504 0,07% > > ...+java.util.HashMap$Node @ 0xb5839f40 > > 32 120 560 0,06% > > ...+java.util.HashMap$Node @ 0xb84fd158 > > 32 120 496 0,06% > > ...+java.util.HashMap$Node @ 0xb58104a8 > > 32 118 000 0,06% > > ...+java.util.HashMap$Node @ 0xb5947bd8 > > 32 115 920 0,06% > > ...+java.util.HashMap$Node @ 0xb84e3be8 > > 32 115 024 0,06% > > ...+java.util.HashMap$Node @ 0xb8488828 > > 32 113 400 0,06% > > ...+java.util.HashMap$Node @ 0xb3bc5258 > > 32 110 704 0,06% > > ...+java.util.HashMap$Node @ 0xb3e2cdc8 > > 32 110 208 0,06% > > ...+java.util.HashMap$Node @ 0xb57bda48 > > 32 109 688 0,06% > > ...+java.util.HashMap$Node @ 0xb57d2880 > > 32 109 232 0,06% > > ...+org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl > @ 0xb62518c8 > > 192 106 944 0,06% > > ...+org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl > @ 0xd0b00a40 > > 192 98 336 0,05% > > ...\Total: 20 entries > > 960 2 601 984 1,39% > > > > > > When removing federation config on selected node, memory consumption on > this node comes back to normal. > > > > Thanks > > Michal Balicki > > > > > Confidentiality Notice: This message and any included attachments are from > EML and are intended only for the addressee(s). The information contained > in this message is confidential and may constitute inside or non-public > information under international, federal or state laws. Unauthorized > forwarding, printing, copying, distribution or use of such information is > strictly prohibited and may be unlawful. If you are not the addressee, > please promptly delete this message and notify the sender of the delivery > error by email. Thank you for your cooperation. > -- Clebert Suconic