Since your parameters didn't change the behavior, you could try tunning tcp settings (jdk and OS). I find the TCP stacks across different OSs behave differently.
JDK example: -Dsun.net.client.defaultReadTimeout=20000 -Dsun.net.client.defaultConnectTimeout=10000 OS Example (windows): HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\TCPIP\Parameters\TcpTimedWaitDelay 30 (DWORD) decimal HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters\EnableDynamicBacklog 1 (DWORD) HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters\MinimumDynamicBacklog 20 (DWORD) decimal HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters\MaximumDynamicBacklog 1000 (DWORD) decimal HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters\DynamicBacklogGrowthDelta 10 (DWORD) decimal HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\AFD\Parameters\KeepAliveInterval 1 (DWORD) HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces\{Interface GUID}\TcpNoDelay 1 (DWORD) HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces\{Interface GUID}\TcpAckFrequency 1 (DWORD) C:\> netsh int tcp set global autotuninglevel=disabled C:\> netsh int ipv4 set dynamicportrange tcp start=32767 num=32768 OS Example (linux) if using ipv6 options will differ: # echo 'net.ipv4.tcp_synack_retries=3' >> /etc/sysctl.conf # echo 'net.ipv4.tcp_syn_retries =3' >> /etc/sysctl.conf # echo 'net.ipv4.tcp_window_scaling=1' >> /etc/sysctl.conf # echo 'net.ipv4.tcp_timestamp=1' >> /etc/sysctl.conf # echo 'net.ipv4.tcp_sack=0' >> /etc/sysctl.conf # echo 'net.ipv4.tcp_reordering=3' >> /etc/sysctl.conf # echo 'net.ipv4.tcp_fastopen=1' >> /etc/sysctl.conf # echo 'tcp_max_syn_backlog=1500' >> /etc/sysctl.conf # echo 'tcp_keepalive_probes=5' >> /etc/sysctl.conf # echo 'tcp_keepalive_time=1800' >> /etc/sysctl.conf # echo 'tcp_keepalive_intvl=60' >> /etc/sysctl.conf # echo 'net.ipv4.tcp_tw_reuse=1' >> /etc/sysctl.conf # echo 'net.ipv4.tcp_tw_recycle=1' >> /etc/sysctl.conf # echo 'net.ipv4.ip_local_port_range=32768 65535' >> /etc/sysctl.conf # echo 'net.ipv4.tcp_fin_timeout=10' >> /etc/sysctl.conf Hope this helps! ________________________________ From: John Lilley <john.lil...@redpointglobal.com.INVALID> Sent: Wednesday, February 21, 2024 1:45 PM To: users@activemq.apache.org <users@activemq.apache.org> Cc: Lino Pereira <lino.pere...@redpointglobal.com> Subject: HA failover: Nothing we try reduces client recovery below one minute Greetings! We are having a devil of a time trying to reduce delay during a failover event. We’ve set our URL to (tcp://dm-activemq-live-svc:61616,tcp://dm-activemq-backup-svc:61617)?ha=true&reconnectAttempts=200&initialConnectAttempts=200&clientFailureCheckPeriod=10000&connectionTTL=10000&callTimeout=10000 But nothing seems to reduce the time it takes for a sender to detect the issue and failover. With or without these parameters, it takes a full minute to recover after failover. Are there other parameters to adjust? Something in the broker? Is there some internal retry loop and timer that waits longer that we can influence? We are testing failover by killing one of the AMQ pods. This almost always succeeds without issue, except occasionally it doesn’t. The backup broker log shows that it assumes control very quickly 2024-02-16 22:42:56,523 INFO [org.apache.activemq.artemis.core.server] AMQ221007: Server is now live 2024-02-16 22:42:56,533 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started EPOLL Acceptor at 0.0.0.0:61617 for protocols [CORE,MQTT,AMQP,STOMP,HORNETQ,OPENWIRE] 2024-02-16 22:42:58,074 INFO [net.redpoint.rpdm.artemis_logger.RpdmArtemisLogger] SEND: HEADER= {"version":1,"type":"void","id":"jagbjffeu3of","api":"hpa_metrics","method":"get_expected_pod_counts","authorization":"Q979k6lzOu9KmqMA89GbvmZnIAMNpJJP/TEAAa7Yjpo="}, BODY={"message_type":"void"} 2024-02-16 22:42:58,254 INFO [net.redpoint.rpdm.artemis_logger.RpdmArtemisLogger] DELIVER: HEADER= {"version":1,"type":"void","id":"jagbjffeu3of","api":"hpa_metrics","method":"get_expected_pod_counts","authorization":"Q979k6lzOu9KmqMA89GbvmZnIAMNpJJP/TEAAa7Yjpo="}, BODY={"message_type":"void"} In our app logs, we show that we are attempting to send a message to the AMQ broker at the time of failover: 2024-02-16T22:42:55.981 [http-nio-9910-exec-9] JmsRpcClientChannel.prepareCall:84 [9v2zwvclclrc] INFO - REQUEST OUT: { "header": {"version":1,"type":"get_task_status_request","id":"9v2zwvclclrc","api":"test_harness","method":"get_task_status","instance":"combined","authorization":"***REDACTED***"}, "body":{"id":"d1123865-ac47-4238-ab02-5f2324a43264","progress_start_index":0,"message_type":"get_task_status_request"} } But then it takes 40 seconds for the “10000ms” timeout to happen. 2024-02-16T22:43:35.988 [http-nio-9910-exec-9] JmsProducerPool.send_:376 [9v2zwvclclrc] WARN - Error sending message, will retry javax.jms.JMSException: AMQ219014: Timed out after waiting 10000 ms for response when sending packet 71 Our problem is really that this 40-second delay seems to push everything back so that recovery takes over a minute. Once we get into the one-minute mark, we start hitting several timeouts, like our own RPC timeout setting and the nginx ingress controller for K8S. How can we know why this takes so long? We’ve set every timeout we can find for the AMQ client to 10 seconds. Is there some other setting we need to adjust? Meanwhile the primary broker pod has returned, and the backup decides it is not needed: 2024-02-16 22:43:16,115 INFO [org.apache.activemq.artemis.core.server] AMQ221002: Apache ActiveMQ Artemis Message Broker version 2.31.2 [10952195-b6ec-11ee-9c87-aa03cb64206a] stopped, uptime 16 minutes 2024-02-16 22:43:16,115 INFO [org.apache.activemq.artemis.core.server] AMQ221039: Restarting as Replicating backup server after live restart But our app is waiting for its response on the reply-to queue, and gets this error nearly a minute later: 2024-02-16T22:43:58.033 [Thread-6] JmsStaticConnectionPool.onException:78 [] ERROR - Receive error occurred. javax.jms.JMSException: ActiveMQDisconnectedException[errorType=DISCONNECTED message=AMQ219015: The connection was disconnected because of server shutdown] This is the stack trace at time of the first timeout: at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:550) at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:452) at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.sendFullMessage(ActiveMQSessionContext.java:588) at org.apache.activemq.artemis.core.client.impl.ClientProducerImpl.sendRegularMessage(ClientProducerImpl.java:305) at org.apache.activemq.artemis.core.client.impl.ClientProducerImpl.doSend(ClientProducerImpl.java:277) at org.apache.activemq.artemis.core.client.impl.ClientProducerImpl.send(ClientProducerImpl.java:147) at org.apache.activemq.artemis.core.client.impl.ClientProducerImpl.send(ClientProducerImpl.java:129) at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:483) at org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:221) at net.redpoint.ipc.jms.JmsProducerPool.send_(JmsProducerPool.java:372) at net.redpoint.ipc.jms.JmsProducerPool.sendRequest(JmsProducerPool.java:301) at net.redpoint.ipc.jms.JmsRpcClientChannel.sendRequest(JmsRpcClientChannel.java:228) at net.redpoint.ipc.jms.JmsRpcClientChannel.invokeRaw(JmsRpcClientChannel.java:202) at net.redpoint.ipc.jms.JmsRpcClientChannel.call(JmsRpcClientChannel.java:101) at net.redpoint.ipc.clients.RpcClientBase._sync(RpcClientBase.java:169) at net.redpoint.ipc.clients.RpcClientBase._rpc(RpcClientBase.java:237) at net.redpoint.rpdm.ipc.clients.TestHarnessClient.getTaskStatus(TestHarnessClient.java:229) at net.redpoint.rpdm.ipc.web_service_gateway.TestHarnessHttpServer.lambda$getTaskStatus$25(TestHarnessHttpServer.java:390) at net.redpoint.ipc.SecurityControl.doAsNoThrow(SecurityControl.java:272) at net.redpoint.rpdm.ipc.web_service_gateway.TestHarnessHttpServer.lambda$getTaskStatus$26(TestHarnessHttpServer.java:390) at net.redpoint.ipc.InstanceControl.doAsNoThrow(InstanceControl.java:84) at net.redpoint.rpdm.ipc.web_service_gateway.TestHarnessHttpServer.getTaskStatus(TestHarnessHttpServer.java:390) at jdk.internal.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:52) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:146) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:189) at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:219) at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:93) at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:478) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:400) at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:81) at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:261) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248) at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244) at org.glassfish.jersey.internal.Errors.process(Errors.java:292) at org.glassfish.jersey.internal.Errors.process(Errors.java:274) at org.glassfish.jersey.internal.Errors.process(Errors.java:244) at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:265) at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:240) at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:697) at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:394) at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:357) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:311) at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:205) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:149) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:167) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:90) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:482) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:115) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:93) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:340) at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:391) at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:63) at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:896) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1744) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:52) at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191) at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT message=AMQ219014: Timed out after waiting 10000 ms for response when sending packet 71] ... 65 more This is the stack trace at the second timeout message at org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:714) at org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:735) at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:868) at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:856) at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:802) at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:566) at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1407) at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:98) at org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:212) at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$CloseRunnable.run(ClientSessionFactoryImpl.java:1172) at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:57) at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:32) at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:68) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) Caused by: ActiveMQDisconnectedException[errorType=DISCONNECTED message=AMQ219015: The connection was disconnected because of server shutdown] ... 7 more [rg]<https://www.redpointglobal.com/> John Lilley Data Management Chief Architect, Redpoint Global Inc. 34 Washington Street, Suite 205 Wellesley Hills, MA 02481 M: +1 7209385761<tel:+1%207209385761> | john.lil...@redpointglobal.com<mailto:john.lil...@redpointglobal.com> PLEASE NOTE: This e-mail from Redpoint Global Inc. (“Redpoint”) is confidential and is intended solely for the use of the individual(s) to whom it is addressed. If you believe you received this e-mail in error, please notify the sender immediately, delete the e-mail from your computer and do not copy, print or disclose it to anyone else. If you properly received this e-mail as a customer, partner or vendor of Redpoint, you should maintain its contents in confidence subject to the terms and conditions of your agreement(s) with Redpoint.