Re: HA failover: Nothing we try reduces client recovery below one minute

Karl Weller Wed, 21 Feb 2024 13:10:40 -0800

Since your parameters didn't change the behavior, you could try tunning tcp 
settings (jdk and OS). I find the TCP stacks across different OSs behave 
differently.


JDK example:
-Dsun.net.client.defaultReadTimeout=20000 
-Dsun.net.client.defaultConnectTimeout=10000


OS Example (windows):

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\TCPIP\Parameters\TcpTimedWaitDelay
 30 (DWORD) decimal
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters\EnableDynamicBacklog
 1 (DWORD)
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters\MinimumDynamicBacklog
 20 (DWORD) decimal
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters\MaximumDynamicBacklog
 1000 (DWORD) decimal
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters\DynamicBacklogGrowthDelta
 10 (DWORD) decimal
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters\KeepAliveInterval
 1 (DWORD)


HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces\{Interface
 GUID}\TcpNoDelay 1 (DWORD)
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces\{Interface
 GUID}\TcpAckFrequency 1 (DWORD)

C:\> netsh int tcp set global autotuninglevel=disabled
C:\> netsh int ipv4 set dynamicportrange tcp start=32767 num=32768


OS Example (linux) if using ipv6 options will differ:

# echo 'net.ipv4.tcp_synack_retries=3' >> /etc/sysctl.conf
# echo 'net.ipv4.tcp_syn_retries =3' >> /etc/sysctl.conf
# echo 'net.ipv4.tcp_window_scaling=1' >> /etc/sysctl.conf
# echo 'net.ipv4.tcp_timestamp=1' >> /etc/sysctl.conf
# echo 'net.ipv4.tcp_sack=0' >> /etc/sysctl.conf
# echo 'net.ipv4.tcp_reordering=3' >> /etc/sysctl.conf
# echo 'net.ipv4.tcp_fastopen=1' >> /etc/sysctl.conf
# echo 'tcp_max_syn_backlog=1500' >> /etc/sysctl.conf
# echo 'tcp_keepalive_probes=5' >> /etc/sysctl.conf
# echo 'tcp_keepalive_time=1800' >> /etc/sysctl.conf
# echo 'tcp_keepalive_intvl=60' >> /etc/sysctl.conf
# echo 'net.ipv4.tcp_tw_reuse=1' >> /etc/sysctl.conf
# echo 'net.ipv4.tcp_tw_recycle=1' >> /etc/sysctl.conf
# echo 'net.ipv4.ip_local_port_range=32768 65535' >> /etc/sysctl.conf
# echo 'net.ipv4.tcp_fin_timeout=10' >> /etc/sysctl.conf


Hope this helps!

________________________________
From: John Lilley <john.lil...@redpointglobal.com.INVALID>
Sent: Wednesday, February 21, 2024 1:45 PM
To: users@activemq.apache.org <users@activemq.apache.org>
Cc: Lino Pereira <lino.pere...@redpointglobal.com>
Subject: HA failover: Nothing we try reduces client recovery below one minute


Greetings!



We are having a devil of a time trying to reduce delay during a failover event. 
 We’ve set our URL to

(tcp://dm-activemq-live-svc:61616,tcp://dm-activemq-backup-svc:61617)?ha=true&reconnectAttempts=200&initialConnectAttempts=200&clientFailureCheckPeriod=10000&connectionTTL=10000&callTimeout=10000



But nothing seems to reduce the time it takes for a sender to detect the issue 
and failover.  With or without these parameters, it takes a full minute to 
recover after failover. Are there other parameters to adjust?  Something in the 
broker?  Is there some internal retry loop and timer that waits longer that we 
can influence?



We are testing failover by killing one of the AMQ pods.  This almost always 
succeeds without issue, except occasionally it doesn’t.



The backup broker log shows that it assumes control very quickly

2024-02-16 22:42:56,523 INFO  [org.apache.activemq.artemis.core.server] 
AMQ221007: Server is now live

2024-02-16 22:42:56,533 INFO  [org.apache.activemq.artemis.core.server] 
AMQ221020: Started EPOLL Acceptor at 0.0.0.0:61617 for protocols 
[CORE,MQTT,AMQP,STOMP,HORNETQ,OPENWIRE]

2024-02-16 22:42:58,074 INFO  
[net.redpoint.rpdm.artemis_logger.RpdmArtemisLogger] SEND: HEADER= 
{"version":1,"type":"void","id":"jagbjffeu3of","api":"hpa_metrics","method":"get_expected_pod_counts","authorization":"Q979k6lzOu9KmqMA89GbvmZnIAMNpJJP/TEAAa7Yjpo="},
 BODY={"message_type":"void"}

2024-02-16 22:42:58,254 INFO  
[net.redpoint.rpdm.artemis_logger.RpdmArtemisLogger] DELIVER: HEADER= 
{"version":1,"type":"void","id":"jagbjffeu3of","api":"hpa_metrics","method":"get_expected_pod_counts","authorization":"Q979k6lzOu9KmqMA89GbvmZnIAMNpJJP/TEAAa7Yjpo="},
 BODY={"message_type":"void"}



In our app logs, we show that we are attempting to send a message to the AMQ 
broker at the time of failover:

2024-02-16T22:42:55.981 [http-nio-9910-exec-9] 
JmsRpcClientChannel.prepareCall:84 [9v2zwvclclrc] INFO - REQUEST OUT: { 
"header": 
{"version":1,"type":"get_task_status_request","id":"9v2zwvclclrc","api":"test_harness","method":"get_task_status","instance":"combined","authorization":"***REDACTED***"},
 
"body":{"id":"d1123865-ac47-4238-ab02-5f2324a43264","progress_start_index":0,"message_type":"get_task_status_request"}
 }



But then it takes 40 seconds for the “10000ms” timeout to happen.

2024-02-16T22:43:35.988 [http-nio-9910-exec-9] JmsProducerPool.send_:376 
[9v2zwvclclrc] WARN - Error sending message, will retry javax.jms.JMSException: 
AMQ219014: Timed out after waiting 10000 ms for response when sending packet 71



Our problem is really that this 40-second delay seems to push everything back 
so that recovery takes over a minute.  Once we get into the one-minute mark, we 
start hitting several timeouts, like our own RPC timeout setting and the nginx 
ingress controller for K8S.  How can we know why this takes so long?  We’ve set 
every timeout we can find for the AMQ client to 10 seconds.  Is there some 
other setting we need to adjust?





Meanwhile the primary broker pod has returned, and the backup decides it is not 
needed:

2024-02-16 22:43:16,115 INFO  [org.apache.activemq.artemis.core.server] 
AMQ221002: Apache ActiveMQ Artemis Message Broker version 2.31.2 
[10952195-b6ec-11ee-9c87-aa03cb64206a] stopped, uptime 16 minutes

2024-02-16 22:43:16,115 INFO  [org.apache.activemq.artemis.core.server] 
AMQ221039: Restarting as Replicating backup server after live restart



But our app is waiting for its response on the reply-to queue, and gets this 
error nearly a minute later:

2024-02-16T22:43:58.033 [Thread-6] JmsStaticConnectionPool.onException:78 [] 
ERROR - Receive error occurred. javax.jms.JMSException: 
ActiveMQDisconnectedException[errorType=DISCONNECTED message=AMQ219015: The 
connection was disconnected because of server shutdown]



This is the stack trace at time of the first timeout:

                at 
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:550)

                at 
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:452)

                at 
org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.sendFullMessage(ActiveMQSessionContext.java:588)

                at 
org.apache.activemq.artemis.core.client.impl.ClientProducerImpl.sendRegularMessage(ClientProducerImpl.java:305)

                at 
org.apache.activemq.artemis.core.client.impl.ClientProducerImpl.doSend(ClientProducerImpl.java:277)

                at 
org.apache.activemq.artemis.core.client.impl.ClientProducerImpl.send(ClientProducerImpl.java:147)

                at 
org.apache.activemq.artemis.core.client.impl.ClientProducerImpl.send(ClientProducerImpl.java:129)

                at 
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:483)

                at 
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:221)

                at 
net.redpoint.ipc.jms.JmsProducerPool.send_(JmsProducerPool.java:372)

                at 
net.redpoint.ipc.jms.JmsProducerPool.sendRequest(JmsProducerPool.java:301)

                at 
net.redpoint.ipc.jms.JmsRpcClientChannel.sendRequest(JmsRpcClientChannel.java:228)

                at 
net.redpoint.ipc.jms.JmsRpcClientChannel.invokeRaw(JmsRpcClientChannel.java:202)

                at 
net.redpoint.ipc.jms.JmsRpcClientChannel.call(JmsRpcClientChannel.java:101)

                at 
net.redpoint.ipc.clients.RpcClientBase._sync(RpcClientBase.java:169)

                at 
net.redpoint.ipc.clients.RpcClientBase._rpc(RpcClientBase.java:237)

                at 
net.redpoint.rpdm.ipc.clients.TestHarnessClient.getTaskStatus(TestHarnessClient.java:229)

                at 
net.redpoint.rpdm.ipc.web_service_gateway.TestHarnessHttpServer.lambda$getTaskStatus$25(TestHarnessHttpServer.java:390)

                at 
net.redpoint.ipc.SecurityControl.doAsNoThrow(SecurityControl.java:272)

                at 
net.redpoint.rpdm.ipc.web_service_gateway.TestHarnessHttpServer.lambda$getTaskStatus$26(TestHarnessHttpServer.java:390)

                at 
net.redpoint.ipc.InstanceControl.doAsNoThrow(InstanceControl.java:84)

                at 
net.redpoint.rpdm.ipc.web_service_gateway.TestHarnessHttpServer.getTaskStatus(TestHarnessHttpServer.java:390)

                at 
jdk.internal.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)

                at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

                at java.base/java.lang.reflect.Method.invoke(Method.java:568)

                at 
org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52)

                at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:146)

                at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:189)

                at 
org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219)

                at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:93)

                at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478)

                at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400)

                at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81)

                at 
org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:261)

                at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)

                at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)

                at org.glassfish.jersey.internal.Errors.process(Errors.java:292)

                at org.glassfish.jersey.internal.Errors.process(Errors.java:274)

                at org.glassfish.jersey.internal.Errors.process(Errors.java:244)

                at 
org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265)

                at 
org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:240)

                at 
org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:697)

                at 
org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394)

                at 
org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)

                at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:357)

                at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:311)

                at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)

                at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:205)

                at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149)

                at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:167)

                at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:90)

                at 
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:482)

                at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:115)

                at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:93)

                at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74)

                at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:340)

                at 
org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:391)

                at 
org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:63)

                at 
org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:896)

                at 
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1744)

                at 
org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:52)

                at 
org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191)

                at 
org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)

                at 
org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)

                at java.base/java.lang.Thread.run(Thread.java:833)

Caused by: ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT 
message=AMQ219014: Timed out after waiting 10000 ms for response when sending 
packet 71]

                ... 65 more





This is the stack trace at the second timeout message

                at 
org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:714)

                at 
org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:735)

                at 
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:868)

                at 
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:856)

                at 
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:802)

                at 
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:566)

                at 
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1407)

                at 
org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:98)

                at 
org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:212)

                at 
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$CloseRunnable.run(ClientSessionFactoryImpl.java:1172)

                at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:57)

                at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:32)

                at 
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:68)

                at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)

                at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)

                at 
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)

Caused by: ActiveMQDisconnectedException[errorType=DISCONNECTED 
message=AMQ219015: The connection was disconnected because of server shutdown]

                ... 7 more






[rg]<https://www.redpointglobal.com/>

John Lilley

Data Management Chief Architect, Redpoint Global Inc.

34 Washington Street, Suite 205 Wellesley Hills, MA 02481

M: +1 7209385761<tel:+1%207209385761> | 
john.lil...@redpointglobal.com<mailto:john.lil...@redpointglobal.com>

PLEASE NOTE: This e-mail from Redpoint Global Inc. (“Redpoint”) is confidential 
and is intended solely for the use of the individual(s) to whom it is 
addressed. If you believe you received this e-mail in error, please notify the 
sender immediately, delete the e-mail from your computer and do not copy, print 
or disclose it to anyone else. If you properly received this e-mail as a 
customer, partner or vendor of Redpoint, you should maintain its contents in 
confidence subject to the terms and conditions of your agreement(s) with 
Redpoint.

Re: HA failover: Nothing we try reduces client recovery below one minute

Reply via email to