Question about HA configuration failover problem

John Lilley Fri, 26 Jan 2024 10:51:04 -0800

Greetings,

This is something of a follow-up on previous failover issue reports, but we’ve 
taken careful notes and logs and hopefully we have enough information to 
diagnose what is happening.


We are experiencing an error during AMQ broker failover from live to backup.  
We are testing this using a load-generator of our own devising, in which 
multiple threads are performing RPC calls, in which requests are posted to a 
named queue and replies are returned on the reply-to temporary queue.

Both JMS client and the broker are version 2.31.2

Our live (master) broker.xml: 
https://drive.google.com/file/d/10lDHv13AJXKOHZOLIdT7Cph8pt7VghEl/view?usp=sharing
Our backup (slave) broker.xml: 
https://drive.google.com/file/d/10gNkpFSABskxaPODFI_1GV-DzMIj4tem/view?usp=sharing

Oddly, the failover from backup to live never has an issue.

Synopsis of the timeline is:

2024-01-23T22:31:20.719 our RPC service attempts to send reply message to 
reply-to-queue

2024-01-23T22:31:35.721 the call to Producer.send() fails after 15 seconds: 
AMQ219014: Timed out after waiting 15000 ms for response when sending packet 45

Our code delays for two seconds and attempts to call Producer.send() again

Meanwhile, the backup AMQ broker has sensed failure and taken over, and is 
processing messages from *other* clients:
2024-01-23 22:29:58,245 INFO  [org.apache.activemq.artemis.core.server] 
AMQ221024: Backup server ActiveMQServerImpl::name=backup is synchronized with 
live server, nodeID=10952195-b6ec-11ee-9c87-aa03cb64206a.
2024-01-23 22:29:58,252 INFO  [org.apache.activemq.artemis.core.server] 
AMQ221031: backup announced
2024-01-23 22:31:20,720 WARN  [org.apache.activemq.artemis.core.server] 
AMQ222295: There is a possible split brain on nodeID 
10952195-b6ec-11ee-9c87-aa03cb64206a. Topology update ignored
2024-01-23 22:31:20,721 INFO  [org.apache.activemq.artemis.core.server] 
AMQ221066: Initiating quorum vote: LiveFailoverQuorumVote
2024-01-23 22:31:20,723 INFO  [org.apache.activemq.artemis.core.server] 
AMQ221084: Requested 0 quorum votes
2024-01-23 22:31:20,723 INFO  [org.apache.activemq.artemis.core.server] 
AMQ221083: ignoring quorum vote as max cluster size is 1.
2024-01-23 22:31:20,723 INFO  [org.apache.activemq.artemis.core.server] 
AMQ221071: Failing over based on quorum vote results.
2024-01-23 22:31:20,732 WARN  [org.apache.activemq.artemis.core.client] 
AMQ212037: Connection failure to dm-activemq-live-svc/10.0.52.174:61616 has 
been detected: AMQ219015: The connection was disconnected because of server 
shutdown [code=DISCONNECTED]
2024-01-23 22:31:20,733 WARN  [org.apache.activemq.artemis.core.client] 
AMQ212037: Connection failure to dm-activemq-live-svc/10.0.52.174:61616 has 
been detected: AMQ219015: The connection was disconnected because of server 
shutdown [code=DISCONNECTED]
…
2024-01-23 22:31:21,450 INFO  [org.apache.activemq.artemis.core.server] 
AMQ221007: Server is now live
2024-01-23 22:31:21,459 INFO  [org.apache.activemq.artemis.core.server] 
AMQ221020: Started EPOLL Acceptor at 0.0.0.0:61617 for protocols 
[CORE,MQTT,AMQP,STOMP,HORNETQ,OPENWIRE]
2024-01-23 22:31:21,816 INFO  
[net.redpoint.rpdm.artemis_logger.RpdmArtemisLogger] SEND: HEADER= 
{"version":1,"type":"get_task_status_request","id":"d70xpfwljofo","api":"test_harness","method":"get_task_status","instance":"combined","authorization":"73DZU/fb1A2fFnKdzABPbLzAHVw7Z7VsfSLcQ7VqSBQ="},
 
BODY={"id":"909b23ae-578f-412d-9706-9f300adb9119","progress_start_index":0,"message_...

But… our second Producer.send() attempt fails again after about 44 seconds:
2024-01-23T22:32:21.936 [Thread-2 (ActiveMQ-client-global-threads)] 
JmsProducerPool.send_:376 [j6ugszoiu1gl] WARN - Error sending message, will 
retry javax.jms.JMSException: AMQ219016: Connection failure detected. 
Unblocking a blocking call that will never get a response
                at 
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:560)
                at 
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:452)
                at 
org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.addressQuery(ActiveMQSessionContext.java:434)
                at 
org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.addressQuery(ClientSessionImpl.java:808)
                at 
org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:390)
                at 
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
                at 
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:221)
                at 
net.redpoint.ipc.jms.JmsProducerPool.send_(JmsProducerPool.java:372)
                at 
net.redpoint.ipc.jms.JmsProducerPool.sendResponse(JmsProducerPool.java:319)
                at 
net.redpoint.ipc.jms.JmsRpcServer$RpcReceiver.handleMessage(JmsRpcServer.java:225)
                at 
net.redpoint.ipc.jms.JmsRpcServer$RpcReceiver.onMessage(JmsRpcServer.java:158)
                at 
org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
                at 
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:982)
                at 
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1139)
                at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:57)
                at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:32)
                at 
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:68)
                at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
                at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
                at 
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
Caused by: ActiveMQUnBlockedException[errorType=UNBLOCKED message=AMQ219016: 
Connection failure detected. Unblocking a blocking call that will never get a 
response]
                ... 20 more
Caused by: ActiveMQDisconnectedException[errorType=DISCONNECTED 
message=AMQ219015: The connection was disconnected because of server shutdown]
                at 
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$CloseRunnable.run(ClientSessionFactoryImpl.java:1172)
                ... 6 more

We retry again, and this third attempt at Producer.send() does succeed, as seen 
in the backup broker’s log:
2024-01-23 22:32:23,937 INFO  
[net.redpoint.rpdm.artemis_logger.RpdmArtemisLogger] SEND: HEADER= 
{"version":1,"type":"testing_echo_response","id":"j6ugszoiu1gl","http_code":200}…
2024-01-23 22:32:23,937 INFO  
[net.redpoint.rpdm.artemis_logger.RpdmArtemisLogger] DELIVER: HEADER= 
{"version":1,"type":"testing_echo_response","id":"j6ugszoiu1gl","http_code":200}…

Unfortunately, by this time a whole 63 seconds has gone by form the RPC 
caller’s point of view, and our RPC client timed out and gave up.

It seems to us that the problem can be summarized as "Once the client gets the 
'AMQ219014: Timed out after waiting 15000 ms' error, an attempt at retry will 
fail again after 44 seconds".

It is worth noting that, in our send-retry code, we do not attempt to 
destroy/recreate the Connection, Session, or Producer; we believe that the 
client should take care of that for us.  Which it mostly does, except for this 
one case.  And even in this case it does eventually, but the 44-second delay is 
too long for us.  And it is unclear where that 44-second delay even comes from.

FYI our retry loop looks like:
private static final int SEND_RETRIES = 3;
private static final long SEND_RETRY_DELAY_MS = 2000;
...
var producer = ticket.pi.getProducer();
for (int retry = 0; retry < SEND_RETRIES; retry++) {
  try {
    producer.send(ticket.destination, jmsRequest, producer.getDeliveryMode(), 
producer.getPriority(), ttlMs);
    break;
  } catch (javax.jms.JMSException ex) {
    if (Arrays.stream(retryableCodes).anyMatch(code -> 
ex.getMessage().contains(code)) && retry + 1 < SEND_RETRIES) {
      LOG.warn("Error sending message, will retry", ex);
      Thread.sleep(SEND_RETRY_DELAY_MS);
      continue;
    } else {
      throw ex;
    }
  }
}

Also see the thread dump generated at the 60-second mark, which is *after* the 
first retry fails but *before* the second retry fails (in other words, this is 
the thread dump of the JVM state when our second attempt at Producer.send() is 
pending):
https://drive.google.com/file/d/10dIWqAL65zwWMEfN03WGzC_Ya1QayPGB/view?usp=sharing

Our questions come down to two things:

  *   Is this a bug in the AMQ JMS client?
  *   When we encounter this error, should we attempt to close/destroy/recreate 
the Producer, Session, or Connection?

Please let me know if you can think of a workaround, or if there is more 
information we should capture.  This problem is readily reproducible.

Thanks
John




[rg] <https://www.redpointglobal.com/>

John Lilley

Data Management Chief Architect, Redpoint Global Inc.

34 Washington Street, Suite 205 Wellesley Hills, MA 02481

M: +1 7209385761<tel:+1%207209385761> | 
john.lil...@redpointglobal.com<mailto:john.lil...@redpointglobal.com>

PLEASE NOTE: This e-mail from Redpoint Global Inc. (“Redpoint”) is confidential 
and is intended solely for the use of the individual(s) to whom it is 
addressed. If you believe you received this e-mail in error, please notify the 
sender immediately, delete the e-mail from your computer and do not copy, print 
or disclose it to anyone else. If you properly received this e-mail as a 
customer, partner or vendor of Redpoint, you should maintain its contents in 
confidence subject to the terms and conditions of your agreement(s) with 
Redpoint.

Question about HA configuration failover problem

Reply via email to