[ 
https://issues.apache.org/jira/browse/ARTEMIS-3337?focusedWorklogId=608364&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-608364
 ]

ASF GitHub Bot logged work on ARTEMIS-3337:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 08/Jun/21 09:45
            Start Date: 08/Jun/21 09:45
    Worklog Time Spent: 10m 
      Work Description: brusdev commented on pull request #3613:
URL: https://github.com/apache/activemq-artemis/pull/3613#issuecomment-856625762


   @meierhofer08 good catch. Why not just detaching the sessions from the old 
connections in preHandleFailover avoiding a useless call to handleFailover?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 608364)
    Time Spent: 20m  (was: 10m)

> CORE Client sessions are closed when there are errors during failover
> ---------------------------------------------------------------------
>
>                 Key: ARTEMIS-3337
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3337
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 2.17.0
>         Environment: {noformat}
>  {noformat}
>            Reporter: Markus Meierhofer
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> h3. The issue
> When a CORE client does failover to a new connection, it usually transfers 
> all sessions from the old to the new connection after it's able to establish 
> the new connection. When the session transfer to the new connection fails, it 
> assumes the new connection is also broken and does the same steps again (get 
> new connection, transfer all sessions) until the reconnectAttempts have been 
> reached. But after this failed session transfer, there seems to be a bug that 
> sessions can get closed.
>  
> One occurrence of the bug:
> A client has lost connection due to bad Wifi and tried to reconnect. After 
> reconnect, the sessions were transferred, but one of the sessions couldn't be 
> transferred (Recognizable by the "failed to handle failover" error):
> {noformat}
> [ERROR 2021-05-27 00:24:08,175 l-threads)  
> org.apache.activemq.artemis.core.client]: AMQ214003: Failed to handle failover
> org.apache.activemq.artemis.api.core.ActiveMQConnectionTimedOutException: 
> AMQ219014: Timed out after waiting 4,000 ms for response when sending packet 
> 32
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:530)
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.reattachOnNewConnection(ActiveMQSessionContext.java:836)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.handleFailover(ClientSessionImpl.java:1393)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.reconnectSessions(ClientSessionFactoryImpl.java:806)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:638)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:525)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.access$600(ClientSessionFactoryImpl.java:74)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1282)
>         at 
> org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:78)
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:220)
>         at 
> org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.fail(AbstractRemotingConnection.java:221)
>         at 
> org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.lambda$asyncFail$0(AbstractRemotingConnection.java:228)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at 
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118){noformat}
> Afterwards, although the failover was still ongoing, a producer tried to send 
> a message, but got the error "Connection is destroyed". This error is only 
> printed if a session's underlying ChannelImpl is closed:
> {noformat}
> [ERROR 2021-05-27 00:24:15,526 l-threads) 
> n.bridge.impl.service.BasicServiceServer]: JMS Exception sending service call 
> reply! (Request queue: robot_2001.robot_command, Reply queue: 
> ActiveMQTemporaryQueue[e375e12c-65c9-4646-9212-2edcc8b8587c], Request: 
> RobotCommandRequest(info=SlotCommandInfo(super=CommandInfo(source=SYSTEM, 
> id=1486607), gotoOptions=GotoOptions(fineposDirection=FORWARD ec=true 
> LhdOptions( z=0.6) timeout=1800), 
> transferOptions=TransferOptions(SECONDARY_SLOT loadUnits=[LoadUnit(code=, 
> orderId=1486594, verify=false, empty=false, size=1)]), goalName=LMC_H1), 
> type=unload_at, timestamp=2021-05-26T22:24:03.450405Z),Reply to send: 
> RobotCommandResponse(id=1486607, result=REQUEST_TOO_OLD)):
> [ERROR 2021-05-27 00:24:15,526 l-threads) 
> n.bridge.impl.service.BasicServiceServer]: AMQ219010: Connection is destroyed
> javax.jms.JMSException: AMQ219010: Connection is destroyed
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:460)
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.addressQuery(ActiveMQSessionContext.java:410)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.addressQuery(ClientSessionImpl.java:791)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:388)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
>         at 
> com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
>         at 
> com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
>         at 
> com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
>         at 
> org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
>         at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at 
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> Caused by: 
> org.apache.activemq.artemis.api.core.ActiveMQNotConnectedException: 
> AMQ219010: Connection is destroyed
>         ... 21 common frames omitted{noformat}
> ~12 seconds later, the connection "successfully" failovered:
> {noformat}
> [ERROR 2021-05-27 00:24:27,948 ad-6433287 
> nt.control.RestartAgentExceptionListener]: JMS exception listener retrieved 
> exception!
> javax.jms.JMSException: 
> ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT 
> message=AMQ219014: Timed out after waiting 4,000 ms for response when sending 
> packet 45]
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:750)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQConnection$JMSFailureListener.connectionFailed(ActiveMQConnection.java:771)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:736)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.callSessionFailureListeners(ClientSessionFactoryImpl.java:724)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.failoverOrReconnect(ClientSessionFactoryImpl.java:680)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.handleConnectionFailure(ClientSessionFactoryImpl.java:525)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl.access$600(ClientSessionFactoryImpl.java:74)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingFailureListener.connectionFailed(ClientSessionFactoryImpl.java:1282)
>         at 
> org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.callFailureListeners(AbstractRemotingConnection.java:78)
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.fail(RemotingConnectionImpl.java:220)
>         at 
> org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.fail(AbstractRemotingConnection.java:221)
>         at 
> org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.lambda$asyncFail$0(AbstractRemotingConnection.java:228)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at 
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> Caused by: 
> org.apache.activemq.artemis.api.core.ActiveMQConnectionTimedOutException: 
> AMQ219014: Timed out after waiting 4,000 ms for response when sending packet 
> 45
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:530)
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.queueQuery(ActiveMQSessionContext.java:359)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.queueQuery(ClientSessionImpl.java:780)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:409)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
>         at 
> com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
>         at 
> com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
>         at 
> com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
>         at 
> org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
>         ... 6 common frames omitted
> [ERROR 2021-05-27 00:24:27,948 ad-6433287 
> nt.control.RestartAgentExceptionListener]: JMS exception error code is: 
> FAILOVER{noformat}
> But even after the failover, every time a message was tried to be sent on 
> that session, it failed with "Connection is destroyed". Other sessions worked 
> fine after the failover:
> {noformat}
> [ERROR 2021-05-27 00:24:28,511 l-threads) 
> n.bridge.impl.service.BasicServiceServer]: JMS Exception sending service call 
> reply! (Request queue: robot_2001.robot_command, Reply queue: 
> ActiveMQTemporaryQueue[e375e12c-65c9-4646-9212-2edcc8b8587c], Request: 
> RobotCommandRequest(info=GotoCommandInfo(super=CommandInfo(source=SYSTEM, 
> id=1487919), gotoOptions=GotoOptions(fineposDirection=FORWARD ec=true 
> LhdOptions( z=0.797) timeout=1800), goalName=WH_Conveyor_Box_Put), type=goto, 
> timestamp=2021-05-26T22:24:28.508373Z),Reply to send: 
> RobotCommandResponse(id=1487919, result=SUCCESS)):
> [ERROR 2021-05-27 00:24:28,511 l-threads) 
> n.bridge.impl.service.BasicServiceServer]: AMQ219010: Connection is destroyed
> javax.jms.JMSException: AMQ219010: Connection is destroyed
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:460)
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
>         at 
> org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.addressQuery(ActiveMQSessionContext.java:410)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.addressQuery(ClientSessionImpl.java:791)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQSession.checkDestination(ActiveMQSession.java:388)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.doSendx(ActiveMQMessageProducer.java:406)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:220)
>         at 
> org.apache.activemq.artemis.jms.client.ActiveMQMessageProducer.send(ActiveMQMessageProducer.java:207)
>         at 
> com.incubedit.communication.bridge.impl.service.BasicServiceServer.sendResponse(BasicServiceServer.java:128)
>         at 
> com.incubedit.communication.bridge.impl.service.ServiceServer.handleMessage(ServiceServer.java:33)
>         at 
> com.incubedit.communication.bridge.impl.service.BasicServiceServer.onNewMessage(BasicServiceServer.java:112)
>         at 
> org.apache.activemq.artemis.jms.client.JMSMessageListenerWrapper.onMessage(JMSMessageListenerWrapper.java:110)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.callOnMessage(ClientConsumerImpl.java:1030)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.access$400(ClientConsumerImpl.java:49)
>         at 
> org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl$Runner.run(ClientConsumerImpl.java:1153)
>         at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at 
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at 
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at 
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> Caused by: 
> org.apache.activemq.artemis.api.core.ActiveMQNotConnectedException: 
> AMQ219010: Connection is destroyed
>         ... 21 common frames omitted
> {noformat}
> h3. Cause of the issue
> After reconnect, every session's "handleFailover" method is called to 
> transfer the session to the new connection. But if handleFailover fails for 
> one session, the reconnect immediately returns and tries again:
>  
> {noformat}
> for (ClientSessionInternal session : sessionsToFailover) {
>          if (!session.handleFailover(connection, cause)) {
>             return false;
>          }
> }{noformat}
>  
> The issue here is that "handleFailover" is not executed for every session:
>  * During handleFailover, "transferConnection" is called on the session's 
> ChannelImpl, which removes the session's channel from the old connection and 
> puts it onto the new one. If transferConnection is not executed for a 
> session, this session's channel stays on the old connection.
>  * During connection destroy, every channel set on that connection is closed 
> via Channel.close
>  * Now if one of the sessions wouldn't successfully execute handleFailover, 
> the method would immediately return and not remove the remaining session's 
> channels from the old connection. And as before reconnecting, the old 
> connection (which still contains some session channels) is destroyed, those 
> session's channels are also closed.
> Even if handleFailover is afterwards successfully executed on those sessions, 
> the "closed" flag is already set on their channels, so they will reject 
> future sends.
> h3. Possible solution
> It needs to be ensured that all session's channels are transferred (=removed 
> from the old connection) before the old connection is destroyed, even if an 
> earlier session failed to be transferred on the server-side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to