Re: Question about HA configuration failover problem

Justin Bertram Mon, 29 Jan 2024 12:55:40 -0800

> Is this a bug in the AMQ JMS client?

At this point I don't believe it is a bug in the ActiveMQ Artemis core JMS
client.


I talked about AMQ219014 previously, but I suppose it bears repeating here.
The timeout is ultimately ambiguous. The client can't reliably conclude
that the broker has failed due to a timeout like this. It could be the
result of a network issue or a broker slow-down for some reason (e.g. long
GC pause). The broker may have received the data sent but simply failed to
send a response back within the timeout or it may not have received
anything. How to respond to the timeout is ultimately up to the application.

In this case the application retries the operation which itself fails after
44 seconds due to a connection loss. I believe the connection loss is based
on the default connection TTL of 60 seconds (i.e. 15 + 44 = 59 which is
close enough) during which time the client never receives any data from the
broker (i.e. no pings, etc.).

> When we encounter this error, should we attempt to close/destroy/recreate
the Producer, Session, or Connection?

It's hard to say what you "should" do in this circumstance, but if this
delay is too long then you should probably either change your connection
URL (e.g. lower your clientFailureCheckPeriod & connectionTTL [1]) or
change the application itself (as you mentioned) to deal with it so that it
functions appropriately for your use-case.


Justin

[1]
https://activemq.apache.org/components/artemis/documentation/latest/connection-ttl.html#detecting-failure-from-the-client

On Fri, Jan 26, 2024 at 12:51 PM John Lilley
<[email protected]> wrote:

> Greetings,
>
>
>
> This is something of a follow-up on previous failover issue reports, but
> we’ve taken careful notes and logs and hopefully we have enough information
> to diagnose what is happening.
>
>
>
> We are experiencing an error during AMQ broker failover from live to
> backup.  We are testing this using a load-generator of our own devising, in
> which multiple threads are performing RPC calls, in which requests are
> posted to a named queue and replies are returned on the reply-to temporary
> queue.
>
>
>
> Both JMS client and the broker are version 2.31.2
>
>
>
> Our live (master) broker.xml:
> https://drive.google.com/file/d/10lDHv13AJXKOHZOLIdT7Cph8pt7VghEl/view?usp=sharing
>
> Our backup (slave) broker.xml:
> https://drive.google.com/file/d/10gNkpFSABskxaPODFI_1GV-DzMIj4tem/view?usp=sharing
>
>
>
> Oddly, the failover from backup to live never has an issue.
>
>
>
> Synopsis of the timeline is:
>
>
>
> 2024-01-23T22:31:20.719 our RPC service attempts to send reply message to
> reply-to-queue
>
>
>
> 2024-01-23T22:31:35.721 the call to Producer.send() fails after 15
> seconds: AMQ219014: Timed out after waiting 15000 ms for response when
> sending packet 45
>
>
>
> Our code delays for two seconds and attempts to call Producer.send() again
>
>
>
> Meanwhile, the backup AMQ broker has sensed failure and taken over, and is
> processing messages from **other** clients:
>
> 2024-01-23 22:29:58,245 INFO  [org.apache.activemq.artemis.core.server]
> AMQ221024: Backup server ActiveMQServerImpl::name=backup is synchronized
> with live server, nodeID=10952195-b6ec-11ee-9c87-aa03cb64206a.
>
> 2024-01-23 22:29:58,252 INFO  [org.apache.activemq.artemis.core.server]
> AMQ221031: backup announced
>
> 2024-01-23 22:31:20,720 WARN  [org.apache.activemq.artemis.core.server]
> AMQ222295: There is a possible split brain on nodeID
> 10952195-b6ec-11ee-9c87-aa03cb64206a. Topology update ignored
>
> 2024-01-23 22:31:20,721 INFO  [org.apache.activemq.artemis.core.server]
> AMQ221066: Initiating quorum vote: LiveFailoverQuorumVote
>
> 2024-01-23 22:31:20,723 INFO  [org.apache.activemq.artemis.core.server]
> AMQ221084: Requested 0 quorum votes
>
> 2024-01-23 22:31:20,723 INFO  [org.apache.activemq.artemis.core.server]
> AMQ221083: ignoring quorum vote as max cluster size is 1.
>
> 2024-01-23 22:31:20,723 INFO  [org.apache.activemq.artemis.core.server]
> AMQ221071: Failing over based on quorum vote results.
>
> 2024-01-23 22:31:20,732 WARN  [org.apache.activemq.artemis.core.client]
> AMQ212037: Connection failure to dm-activemq-live-svc/10.0.52.174:61616
> has been detected: AMQ219015: The connection was disconnected because of
> server shutdown [code=DISCONNECTED]
>
> 2024-01-23 22:31:20,733 WARN  [org.apache.activemq.artemis.core.client]
> AMQ212037: Connection failure to dm-activemq-live-svc/10.0.52.174:61616
> has been detected: AMQ219015: The connection was disconnected because of
> server shutdown [code=DISCONNECTED]
>
> …
>
> 2024-01-23 22:31:21,450 INFO  [org.apache.activemq.artemis.core.server]
> AMQ221007: Server is now live
>
> 2024-01-23 22:31:21,459 INFO  [org.apache.activemq.artemis.core.server]
> AMQ221020: Started EPOLL Acceptor at 0.0.0.0:61617 for protocols
> [CORE,MQTT,AMQP,STOMP,HORNETQ,OPENWIRE]
>
> 2024-01-23 22:31:21,816 INFO
> [net.redpoint.rpdm.artemis_logger.RpdmArtemisLogger] SEND: HEADER=
> {"version":1,"type":"get_task_status_request","id":"d70xpfwljofo","api":"test_harness","method":"get_task_status","instance":"combined","authorization":"73DZU/fb1A2fFnKdzABPbLzAHVw7Z7VsfSLcQ7VqSBQ="},
> BODY={"id":"909b23ae-578f-412d-9706-9f300adb9119","progress_start_index":0,"message_...
>
>
>
> But… our second Producer.send() attempt fails again after about 44 seconds:
>
> 2024-01-23T22:32:21.936 [Thread-2 (ActiveMQ-client-global-threads)]
> JmsProducerPool.send_:376 [j6ugszoiu1gl] WARN - Error sending message, will
> retry javax.jms.JMSException: AMQ219016: Connection failure detected.
> Unblocking a blocking call that will never get a response
>
>                 at
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:560)
>
>                 at
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:452)
>
>                 at
> org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.addressQuery(ActiveMQSessionContext.java:434)
>
>                 at
> org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.addressQuery(ClientSessionImpl.java:808)
>
>                 at
> org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:390)
>
>                 at
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
>
>                 at
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:221)
>
>                 at
> net.redpoint.ipc.jms.JmsProducerPool.send_(JmsProducerPool.java:372)
>
>                 at
> net.redpoint.ipc.jms.JmsProducerPool.sendResponse(JmsProducerPool.java:319)
>
>                 at
> net.redpoint.ipc.jms.JmsRpcServer$RpcReceiver.handleMessage(JmsRpcServer.java:225)
>
>                 at
> net.redpoint.ipc.jms.JmsRpcServer$RpcReceiver.onMessage(JmsRpcServer.java:158)
>
>                 at
> org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
>
>                 at
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:982)
>
>                 at
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1139)
>
>                 at
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:57)
>
>                 at
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:32)
>
>                 at
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:68)
>
>                 at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>
>                 at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>
>                 at
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
>
> Caused by: ActiveMQUnBlockedException[errorType=UNBLOCKED
> message=AMQ219016: Connection failure detected. Unblocking a blocking call
> that will never get a response]
>
>                 ... 20 more
>
> Caused by: ActiveMQDisconnectedException[errorType=DISCONNECTED
> message=AMQ219015: The connection was disconnected because of server
> shutdown]
>
>                 at
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$CloseRunnable.run(ClientSessionFactoryImpl.java:1172)
>
>                 ... 6 more
>
>
>
> We retry again, and this third attempt at Producer.send() does succeed, as
> seen in the backup broker’s log:
>
> 2024-01-23 22:32:23,937 INFO
> [net.redpoint.rpdm.artemis_logger.RpdmArtemisLogger] SEND: HEADER=
> {"version":1,"type":"testing_echo_response","id":"j6ugszoiu1gl","http_code":200}…
>
>
> 2024-01-23 22:32:23,937 INFO
> [net.redpoint.rpdm.artemis_logger.RpdmArtemisLogger] DELIVER: HEADER=
> {"version":1,"type":"testing_echo_response","id":"j6ugszoiu1gl","http_code":200}…
>
>
>
> Unfortunately, by this time a whole 63 seconds has gone by form the RPC
> caller’s point of view, and our RPC client timed out and gave up.
>
>
>
> It seems to us that the problem can be summarized as "Once the client gets
> the 'AMQ219014: Timed out after waiting 15000 ms' error, an attempt at
> retry will fail again after 44 seconds".
>
>
>
> It is worth noting that, in our send-retry code, we do not attempt to
> destroy/recreate the Connection, Session, or Producer; we believe that the
> client should take care of that for us.  Which it mostly does, except for
> this one case.  And even in this case it does eventually, but the 44-second
> delay is too long for us.  And it is unclear where that 44-second delay
> even comes from.
>
>
>
> FYI our retry loop looks like:
>
> private static final int SEND_RETRIES = 3;
>
> private static final long SEND_RETRY_DELAY_MS = 2000;
>
> ...
>
> var producer = ticket.pi.getProducer();
>
> for (int retry = 0; retry < SEND_RETRIES; retry++) {
>
>   try {
>
>     producer.send(ticket.destination, jmsRequest,
> producer.getDeliveryMode(), producer.getPriority(), ttlMs);
>
>     break;
>
>   } catch (javax.jms.JMSException ex) {
>
>     if (Arrays.stream(retryableCodes).anyMatch(code ->
> ex.getMessage().contains(code)) && retry + 1 < SEND_RETRIES) {
>
>       LOG.warn("Error sending message, will retry", ex);
>
>       Thread.sleep(SEND_RETRY_DELAY_MS);
>
>       continue;
>
>     } else {
>
>       throw ex;
>
>     }
>
>   }
>
> }
>
>
>
> Also see the thread dump generated at the 60-second mark, which is *
> *after** the first retry fails but **before** the second retry fails (in
> other words, this is the thread dump of the JVM state when our second
> attempt at Producer.send() is pending):
>
>
> https://drive.google.com/file/d/10dIWqAL65zwWMEfN03WGzC_Ya1QayPGB/view?usp=sharing
>
>
>
> Our questions come down to two things:
>
>    - Is this a bug in the AMQ JMS client?
>    - When we encounter this error, should we attempt to
>    close/destroy/recreate the Producer, Session, or Connection?
>
>
>
> Please let me know if you can think of a workaround, or if there is more
> information we should capture.  This problem is readily reproducible.
>
>
>
> Thanks
>
> John
>
>
>
>
>
> [image: rg] <https://www.redpointglobal.com/>
>
> John Lilley
>
> Data Management Chief Architect, Redpoint Global Inc.
>
> 34 Washington Street, Suite 205 Wellesley Hills, MA 02481
>
> *M: *+1 7209385761 <+1%207209385761> | [email protected]
>
> PLEASE NOTE: This e-mail from Redpoint Global Inc. (“Redpoint”) is
> confidential and is intended solely for the use of the individual(s) to
> whom it is addressed. If you believe you received this e-mail in error,
> please notify the sender immediately, delete the e-mail from your computer
> and do not copy, print or disclose it to anyone else. If you properly
> received this e-mail as a customer, partner or vendor of Redpoint, you
> should maintain its contents in confidence subject to the terms and
> conditions of your agreement(s) with Redpoint.
>

Re: Question about HA configuration failover problem

Reply via email to