Re: Federation queues - issue with cleaning up resources?

Clebert Suconic Tue, 24 Jan 2023 17:35:18 -0800

I would check a memory dump.  Or a kill -3 / jstack with the list of
classes and instances before anything else ?




On Tue, Jan 24, 2023 at 4:20 AM Michal Balicki <
[email protected]> wrote:

> Hi,
>
> Just a follow up on this topic – has anybody faced similar issue with
> federation mechanism ending with OOM on intermittent tcp connection
> problems?
>
>
>
> Is there anything we can do to maintain stability of our installation?
>
>
>
> Thanks
>
> Michal
>
>
>
> *Od:* Michal Balicki
> *Wysłano:* wtorek, 3 stycznia 2023 14:43
> *Do:* [email protected]
> *Temat:* RE: Federation queues - issue with cleaning up resources?
>
>
>
> Looks images have not been correctly added to original msg– attaching them
> now.
>
>
>
> Thanks
>
> Michal Balicki
>
>
>
> *Od:* Michal Balicki
> *Wysłano:* wtorek, 3 stycznia 2023 10:22
> *Do:* [email protected]
> *Temat:* Federation queues - issue with cleaning up resources?
>
>
>
> Hi,
>
> In our installation we use Artemis 2.27.1 embedded into SpringBoot 2.7.6.
> Recently two separate Artemis clusters in different DCs have been joined
> using federation queues (both downstream and upstream mode). Since that
> time, we observe heap memory hits from time to time. Looks this could be
> related to improper handling of dead TCP connections due to any network
> issues.
>
>
>
> Following is a snippet of federation queue configuration on nodes where
> federation is setup:
>
>
>
> var federationUpstreamConfiguration = new
> FederationUpstreamConfiguration();
>
>
>
> federationUpstreamConfiguration.setName(String.format("federation-upstream-config-for-%s",
> federationName));
>
> federationUpstreamConfiguration.getConnectionConfiguration()
>
> .setShareConnection(true)
>
> .setStaticConnectors(Collections.singletonList(connectorName))
>
> .setCircuitBreakerTimeout(5000)
>
> .setHA(false)
>
>
> .setClientFailureCheckPeriod(ActiveMQDefaultConfiguration.getDefaultFederationFailureCheckPeriod())
>
>
> .setConnectionTTL(ActiveMQDefaultConfiguration.getDefaultFederationConnectionTtl())
>
>
> .setRetryInterval(ActiveMQDefaultConfiguration.getDefaultFederationRetryInterval())
>
>
> .setRetryIntervalMultiplier(ActiveMQDefaultConfiguration.getDefaultFederationRetryIntervalMultiplier())
>
> .setMaxRetryInterval(
> ActiveMQDefaultConfiguration.getDefaultFederationMaxRetryInterval())
>
>
> .setInitialConnectAttempts(ActiveMQDefaultConfiguration.getDefaultFederationInitialConnectAttempts())
>
>
> .setReconnectAttempts(ActiveMQDefaultConfiguration.getDefaultFederationReconnectAttempts())
>
> .setCallTimeout(ActiveMQClient.DEFAULT_CALL_TIMEOUT)
>
> .setCallFailoverTimeout(ActiveMQClient.DEFAULT_CALL_FAILOVER_TIMEOUT);
>
>
>
> federationUpstreamConfiguration.addPolicyRef(queuePolicyNameUpstream);
>
>
>
> This is what we observe in the logs:
>
>
>
> 2023-01-02T16:46:17.212+01:00   2023-01-02 15:46:17.212 [78] WARN
> o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:38732
> has been detected: AMQ229014: Did not receive data from /
> 10.112.62.33:38732 within the 60000ms connection TTL. The connection will
> now be closed. [code=CONNECTION_TIMEDOUT]
>
> 2023-01-02T16:47:19.215+01:00   2023-01-02 15:47:19.215 [78] WARN
> o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:57708
> has been detected: AMQ229014: Did not receive data from /
> 10.112.62.33:57708 within the 60000ms connection TTL. The connection will
> now be closed. [code=CONNECTION_TIMEDOUT]
>
> 2023-01-02T16:48:21.217+01:00   2023-01-02 15:48:21.217 [104] WARN
> o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:56402
> has been detected: AMQ229014: Did not receive data from /
> 10.112.62.33:56402 within the 60000ms connection TTL. The connection will
> now be closed. [code=CONNECTION_TIMEDOUT]
>
> 2023-01-02T16:49:23.220+01:00   2023-01-02 15:49:23.220 [2466] WARN
> o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:56760
> has been detected: AMQ229014: Did not receive data from /
> 10.112.62.33:56760 within the 60000ms connection TTL. The connection will
> now be closed. [code=CONNECTION_TIMEDOUT]
>
> 2023-01-02T16:50:25.222+01:00   2023-01-02 15:50:25.222 [110] WARN
> o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:43358
> has been detected: AMQ229014: Did not receive data from /
> 10.112.62.33:43358 within the 60000ms connection TTL. The connection will
> now be closed. [code=CONNECTION_TIMEDOUT]
>
> 2023-01-02T16:51:27.224+01:00   2023-01-02 15:51:27.224 [110] WARN
> o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:34982
> has been detected: AMQ229014: Did not receive data from /
> 10.112.62.33:34982 within the 60000ms connection TTL. The connection will
> now be closed. [code=CONNECTION_TIMEDOUT]
>
> 2023-01-02T16:52:29.227+01:00   2023-01-02 15:52:29.227 [97] WARN
> o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:40942
> has been detected: AMQ229014: Did not receive data from /
> 10.112.62.33:40942 within the 60000ms connection TTL. The connection will
> now be closed. [code=CONNECTION_TIMEDOUT]
>
> 2023-01-02T16:53:31.229+01:00   2023-01-02 15:53:31.229 [92] WARN
> o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:37634
> has been detected: AMQ229014: Did not receive data from /
> 10.112.62.33:37634 within the 60000ms connection TTL. The connection will
> now be closed. [code=CONNECTION_TIMEDOUT]
>
> 2023-01-02T16:54:33.232+01:00   2023-01-02 15:54:33.231 [3397] WARN
> o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:58302
> has been detected: AMQ229014: Did not receive data from /
> 10.112.62.33:58302 within the 60000ms connection TTL. The connection will
> now be closed. [code=CONNECTION_TIMEDOUT]
>
> 2023-01-02T16:55:35.234+01:00   2023-01-02 15:55:35.234 [78] WARN
> o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:59364
> has been detected: AMQ229014: Did not receive data from /
> 10.112.62.33:59364 within the 60000ms connection TTL. The connection will
> now be closed. [code=CONNECTION_TIMEDOUT]
>
> 2023-01-02T16:56:37.236+01:00   2023-01-02 15:56:37.236 [97] WARN
> o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:40052
> has been detected: AMQ229014: Did not receive data from /
> 10.112.62.33:40052 within the 60000ms connection TTL. The connection will
> now be closed. [code=CONNECTION_TIMEDOUT]
>
> 2023-01-02T16:57:39.239+01:00   2023-01-02 15:57:39.239 [105] WARN
> o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:38354
> has been detected: AMQ229014: Did not receive data from /
> 10.112.62.33:38354 within the 60000ms connection TTL. The connection will
> now be closed. [code=CONNECTION_TIMEDOUT]
>
> 2023-01-02T16:58:41.241+01:00   2023-01-02 15:58:41.241 [98] WARN
> o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:42590
> has been detected: AMQ229014: Did not receive data from /
> 10.112.62.33:42590 within the 60000ms connection TTL. The connection will
> now be closed. [code=CONNECTION_TIMEDOUT]
>
> 2023-01-02T16:59:43.243+01:00   2023-01-02 15:59:43.243 [88] WARN
> o.a.a.a.c.client - AMQ212037: Connection failure to /10.112.62.33:35532
> has been detected: AMQ229014: Did not receive data from /
> 10.112.62.33:35532 within the 60000ms connection TTL. The connection will
> now be closed. [code=CONNECTION_TIMEDOUT]
>
>
>
> Strange is consumer count on upstream nodes that are often far above some
> reasonable values – following is a scraping from artemis_consumer_count
> metric on queue that is federated
>
>
>
> You can see strange increased from 0 to hundreds and then some day later
> cleanup.
>
> As new consumers being created, and old ones not being removed.
>
>
>
>
>
> Similarly, from time to time we observe enormous number of sessions being
> maintained on upstream nodes – e.g. 820 on single connection.
>
>
>
> When drill in you can see majority sessions on this connection have been
> created at the same time.
>
>
>
> Then the only way to get rid of this is to close manually the connection
> from the console.
>
>
>
> Heap usage:
>
>
>
>
>
> When analysing memory dump using MAT following is reported:
>
>
>
> Problem Suspect 1
>
>
>
> One instance of
> „org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl” loaded by
> „org.springframework.boot.loader.LaunchedURLClassLoader @ 0xb02ba788”
> occupies 65 125 032 (34,91%) bytes. The memory is accumulated in one
> instance of „java.util.HashMap$Node[]”, loaded by „<system class loader>”,
> which occupies 65 119 184 (34,90%) bytes.
>
>
>
> Keywords
>
> org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl
>
> org.springframework.boot.loader.LaunchedURLClassLoader @ 0xb02ba788
>
> java.util.HashMap$Node[]
>
>
>
>
>
> java.util.HashMap$Node[256] @ 0xb6cfca48
>
> 1 040 65 119 184
>
> \table java.util.HashMap @ 0xb1bc46c8
>
> 48 65 119 248
>
> .\map java.util.HashSet @ 0xb1bc46b8
>
> 16 65 119 264
>
> ..\factories
> org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl @ 0xb1bc4620
>
> 136 65 125 032
>
> ...+serverLocator
> org.apache.activemq.artemis.core.server.federation.FederationConnection @
> 0xb1bc45f0
>
> 48 48
>
> ...|\connection
> org.apache.activemq.artemis.core.server.federation.FederationUpstream @
> 0xb1bc4530
>
> 48 480
>
> ...|.\upstream
> org.apache.activemq.artemis.core.server.federation.queue.FederatedQueue @
> 0xb1bc43a0
>
> 56 7 392
>
> ...|..\[2] java.lang.Object[4] @ 0xb16fe5b8
>
> 32 48
>
> ...|...\array java.util.concurrent.CopyOnWriteArrayList @ 0xb16fe590
>
> 24 88
>
> ...|....\brokerPlugins
> org.apache.activemq.artemis.core.config.impl.FileConfiguration @ 0xb16f7460
>
> 592 7 272
>
> ...|.....\configuration
> org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl @ 0xb16f7208
>
> 280 4 168
>
> ...|......+server
> org.apache.activemq.artemis.core.remoting.server.impl.RemotingServiceImpl @
> 0xb3387f60
>
> 96 1 952
>
> ...|......|\this$0
> org.apache.activemq.artemis.core.remoting.server.impl.RemotingServiceImpl$FailureCheckAndFlushThread
> @ 0xb33ce5f8 activemq-failure-check-thread Thread
>
> 128 336
>
> ...|......+server
> org.apache.activemq.artemis.core.server.impl.ServerStatus @ 0xb1bc8d38 »
>
> 24 1 184
>
> ...|......\Total: 2 entries
>
>
>
> ...+serverLocator
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl @
> 0xd090a7c8 »
>
>
>
>
>
> org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl @ 0xb1bc4620
>
> 136 65 125 032 34,91%
>
> \java.util.HashSet @ 0xb1bc46b8
>
> 16 65 119 264 34,90%
>
> .\java.util.HashMap @ 0xb1bc46c8
>
> 48 65 119 248 34,90%
>
> ..\java.util.HashMap$Node[256] @ 0xb6cfca48
>
> 1 040 65 119 184 34,90%
>
> ...+java.util.HashMap$Node @ 0xb59106b8
>
> 32 236 464 0,13%
>
> ...+java.util.HashMap$Node @ 0xb620b438
>
> 32 181 944 0,10%
>
> ...+java.util.HashMap$Node @ 0xb3e5c6a0
>
> 32 169 344 0,09%
>
> ...+java.util.HashMap$Node @ 0xb3bb0cd8
>
> 32 147 120 0,08%
>
> ...+java.util.HashMap$Node @ 0xb5169f88
>
> 32 139 184 0,07%
>
> ...+java.util.HashMap$Node @ 0xb4456400
>
> 32 131 296 0,07%
>
> ...+java.util.HashMap$Node @ 0xb59e6668
>
> 32 126 616 0,07%
>
> ...+java.util.HashMap$Node @ 0xb58818a0
>
> 32 121 504 0,07%
>
> ...+java.util.HashMap$Node @ 0xb5839f40
>
> 32 120 560 0,06%
>
> ...+java.util.HashMap$Node @ 0xb84fd158
>
> 32 120 496 0,06%
>
> ...+java.util.HashMap$Node @ 0xb58104a8
>
> 32 118 000 0,06%
>
> ...+java.util.HashMap$Node @ 0xb5947bd8
>
> 32 115 920 0,06%
>
> ...+java.util.HashMap$Node @ 0xb84e3be8
>
> 32 115 024 0,06%
>
> ...+java.util.HashMap$Node @ 0xb8488828
>
> 32 113 400 0,06%
>
> ...+java.util.HashMap$Node @ 0xb3bc5258
>
> 32 110 704 0,06%
>
> ...+java.util.HashMap$Node @ 0xb3e2cdc8
>
> 32 110 208 0,06%
>
> ...+java.util.HashMap$Node @ 0xb57bda48
>
> 32 109 688 0,06%
>
> ...+java.util.HashMap$Node @ 0xb57d2880
>
> 32 109 232 0,06%
>
> ...+org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl
> @ 0xb62518c8
>
> 192 106 944 0,06%
>
> ...+org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl
> @ 0xd0b00a40
>
> 192 98 336 0,05%
>
> ...\Total: 20 entries
>
> 960 2 601 984 1,39%
>
>
>
>
>
> When removing federation config on selected node, memory consumption on
> this node comes back to normal.
>
>
>
> Thanks
>
> Michal Balicki
>
>
>
>
> Confidentiality Notice: This message and any included attachments are from
> EML and are intended only for the addressee(s). The information contained
> in this message is confidential and may constitute inside or non-public
> information under international, federal or state laws. Unauthorized
> forwarding, printing, copying, distribution or use of such information is
> strictly prohibited and may be unlawful. If you are not the addressee,
> please promptly delete this message and notify the sender of the delivery
> error by email. Thank you for your cooperation.
>
-- 
Clebert Suconic

Re: Federation queues - issue with cleaning up resources?

Reply via email to