Markus Meierhofer created ARTEMIS-3337:
------------------------------------------

             Summary: CORE Client sessions are closed when there are errors 
during failover
                 Key: ARTEMIS-3337
                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3337
             Project: ActiveMQ Artemis
          Issue Type: Bug
    Affects Versions: 2.17.0
         Environment: {noformat}
 {noformat}
            Reporter: Markus Meierhofer


h3. The issue

When a CORE client does failover to a new connection, it usually transfers all 
sessions from the old to the new connection after it's able to establish the 
new connection. When the session transfer to the new connection fails, it 
assumes the new connection is also broken and does the same steps again (get 
new connection, transfer all sessions) until the reconnectAttempts have been 
reached. But after this failed session transfer, there seems to be a bug that 
sessions can get closed.

 

One occurrence of the bug:

A client has lost connection due to bad Wifi and tried to reconnect. After 
reconnect, the sessions were transferred, but one of the sessions couldn't be 
transferred (Recognizable by the "failed to handle failover" error):
{noformat}
[ERROR 2021-05-27 00:24:08,175 l-threads)  
org.apache.activemq.artemis.core.client]: AMQ214003: Failed to handle failover
org.apache.activemq.artemis.api.core.ActiveMQConnectionTimedOutException: 
AMQ219014: Timed out after waiting 4,000 ms for response when sending packet 32
        at 
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:530)
        at 
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
        at 
org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.reattachOnNewConnection(ActiveMQSessionContext.java:836)
        at 
org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.handleFailover(ClientSessionImpl.java:1393)
        at 
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.reconnectSessions(ClientSessionFactoryImpl.java:806)
        at 
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:638)
        at 
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:525)
        at 
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.access$600(ClientSessionFactoryImpl.java:74)
        at 
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1282)
        at 
org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:78)
        at 
org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:220)
        at 
org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.fail(AbstractRemotingConnection.java:221)
        at 
org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.lambda$asyncFail$0(AbstractRemotingConnection.java:228)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
        at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
        at 
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at 
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118){noformat}
Afterwards, although the failover was still ongoing, a producer tried to send a 
message, but got the error "Connection is destroyed". This error is only 
printed if a session's underlying ChannelImpl is closed:
{noformat}
[ERROR 2021-05-27 00:24:15,526 l-threads) 
n.bridge.impl.service.BasicServiceServer]: JMS Exception sending service call 
reply! (Request queue: robot_2001.robot_command, Reply queue: 
ActiveMQTemporaryQueue[e375e12c-65c9-4646-9212-2edcc8b8587c], Request: 
RobotCommandRequest(info=SlotCommandInfo(super=CommandInfo(source=SYSTEM, 
id=1486607), gotoOptions=GotoOptions(fineposDirection=FORWARD ec=true 
LhdOptions( z=0.6) timeout=1800), 
transferOptions=TransferOptions(SECONDARY_SLOT loadUnits=[LoadUnit(code=, 
orderId=1486594, verify=false, empty=false, size=1)]), goalName=LMC_H1), 
type=unload_at, timestamp=2021-05-26T22:24:03.450405Z),Reply to send: 
RobotCommandResponse(id=1486607, result=REQUEST_TOO_OLD)):
[ERROR 2021-05-27 00:24:15,526 l-threads) 
n.bridge.impl.service.BasicServiceServer]: AMQ219010: Connection is destroyed
javax.jms.JMSException: AMQ219010: Connection is destroyed
        at 
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:460)
        at 
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
        at 
org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.addressQuery(ActiveMQSessionContext.java:410)
        at 
org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.addressQuery(ClientSessionImpl.java:791)
        at 
org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:388)
        at 
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
        at 
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
        at 
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
        at 
com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
        at 
com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
        at 
com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
        at 
org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
        at 
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
        at 
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
        at 
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
        at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
        at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
        at 
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at 
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
Caused by: org.apache.activemq.artemis.api.core.ActiveMQNotConnectedException: 
AMQ219010: Connection is destroyed
        ... 21 common frames omitted{noformat}
~12 seconds later, the connection "successfully" failovered:
{noformat}
[ERROR 2021-05-27 00:24:27,948 ad-6433287 
nt.control.RestartAgentExceptionListener]: JMS exception listener retrieved 
exception!
javax.jms.JMSException: 
ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT 
message=AMQ219014: Timed out after waiting 4,000 ms for response when sending 
packet 45]
        at 
org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:750)
        at 
org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:771)
        at 
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:736)
        at 
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:724)
        at 
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:680)
        at 
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:525)
        at 
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.access$600(ClientSessionFactoryImpl.java:74)
        at 
org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1282)
        at 
org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:78)
        at 
org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:220)
        at 
org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.fail(AbstractRemotingConnection.java:221)
        at 
org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.lambda$asyncFail$0(AbstractRemotingConnection.java:228)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
        at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
        at 
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at 
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
Caused by: 
org.apache.activemq.artemis.api.core.ActiveMQConnectionTimedOutException: 
AMQ219014: Timed out after waiting 4,000 ms for response when sending packet 45
        at 
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:530)
        at 
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
        at 
org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.queueQuery(ActiveMQSessionContext.java:359)
        at 
org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.queueQuery(ClientSessionImpl.java:780)
        at 
org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:409)
        at 
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
        at 
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
        at 
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
        at 
com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
        at 
com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
        at 
com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
        at 
org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
        at 
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
        at 
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
        at 
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
        ... 6 common frames omitted
[ERROR 2021-05-27 00:24:27,948 ad-6433287 
nt.control.RestartAgentExceptionListener]: JMS exception error code is: 
FAILOVER{noformat}
But even after the failover, every time a message was tried to be sent on that 
session, it failed with "Connection is destroyed". Other sessions worked fine 
after the failover:
{noformat}
[ERROR 2021-05-27 00:24:28,511 l-threads) 
n.bridge.impl.service.BasicServiceServer]: JMS Exception sending service call 
reply! (Request queue: robot_2001.robot_command, Reply queue: 
ActiveMQTemporaryQueue[e375e12c-65c9-4646-9212-2edcc8b8587c], Request: 
RobotCommandRequest(info=GotoCommandInfo(super=CommandInfo(source=SYSTEM, 
id=1487919), gotoOptions=GotoOptions(fineposDirection=FORWARD ec=true 
LhdOptions( z=0.797) timeout=1800), goalName=WH_Conveyor_Box_Put), type=goto, 
timestamp=2021-05-26T22:24:28.508373Z),Reply to send: 
RobotCommandResponse(id=1487919, result=SUCCESS)):
[ERROR 2021-05-27 00:24:28,511 l-threads) 
n.bridge.impl.service.BasicServiceServer]: AMQ219010: Connection is destroyed
javax.jms.JMSException: AMQ219010: Connection is destroyed
        at 
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:460)
        at 
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
        at 
org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.addressQuery(ActiveMQSessionContext.java:410)
        at 
org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.addressQuery(ClientSessionImpl.java:791)
        at 
org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:388)
        at 
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
        at 
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
        at 
org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
        at 
com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
        at 
com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
        at 
com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
        at 
org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
        at 
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
        at 
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
        at 
org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
        at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
        at 
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
        at 
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at 
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
Caused by: org.apache.activemq.artemis.api.core.ActiveMQNotConnectedException: 
AMQ219010: Connection is destroyed
        ... 21 common frames omitted
{noformat}
h3. Cause of the issue

After reconnect, every session's "handleFailover" method is called to transfer 
the session to the new connection. But if handleFailover fails for one session, 
the reconnect immediately returns and tries again:

 
{noformat}
for (ClientSessionInternal session : sessionsToFailover) {
         if (!session.handleFailover(connection, cause)) {
            return false;
         }
}{noformat}
 

The issue here is that "handleFailover" is not executed for every session:
 * During handleFailover, "transferConnection" is called on the session's 
ChannelImpl, which removes the session's channel from the old connection and 
puts it onto the new one. If transferConnection is not executed for a session, 
this session's channel stays on the old connection.
 * During connection destroy, every channel set on that connection is closed 
via Channel.close
 * Now if one of the sessions wouldn't successfully execute handleFailover, the 
method would immediately return and not remove the remaining session's channels 
from the old connection. And as before reconnecting, the old connection (which 
still contains some session channels) is destroyed, those session's channels 
are also closed.

Even if handleFailover is afterwards successfully executed on those sessions, 
the "closed" flag is already set on their channels, so they will reject future 
sends.
h3. Possible solution

It needs to be ensured that all session's channels are transferred (=removed 
from the old connection) before the old connection is destroyed, even if an 
earlier session failed to be transferred on the server-side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to