[ 
https://issues.apache.org/jira/browse/ARTEMIS-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016318#comment-16016318
 ] 

clebert suconic commented on ARTEMIS-1170:
------------------------------------------

Do you have your data clean when you do this? do you always start with clean 
data? or you have a bunch of stuff already there?

> HA Replication Failback Failure  - Sync Timed Out
> -------------------------------------------------
>
>                 Key: ARTEMIS-1170
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-1170
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.1.0
>            Reporter: Andrei Bercu
>            Assignee: clebert suconic
>
> We are running artemis in a master/slave setup (1 master, 1 slave) using 
> replication. We had artemis 2.0.0 and reported this issue: 
> [ARTEMIS-1161|https://issues.apache.org/jira/browse/ARTEMIS-1161] . Following 
> advice, we replaced artemis 2.0.0 with 2.1.0 and the failback still doesn't 
> work, failing in a different form though, in about 50% of our test 
> experiments.
> Now, during failback, master reports:
> {noformat}
> 07:11:22,962 INFO  [org.apache.activemq.artemis.integration.bootstrap] 
> AMQ101000: Starting ActiveMQ Artemis Server
> 07:11:22,975 INFO  [org.apache.activemq.artemis.core.server] AMQ221000: live 
> Message Broker is starting with configuration Broker Configuration 
> (clustered=true,journalDirectory=/var/lib/artemis/shadowbroker/data/journal,bindingsDirectory=/var/lib/artemis/shadowbroker/data/bindings,largeMessagesDirectory=/var/lib/artemis/shadowbroker/data/largemessages,pagingDirectory=/var/lib/artemis/shadowbroker/data/paging)
> 07:11:23,416 INFO  [org.apache.activemq.artemis.core.server] AMQ221055: There 
> were too many old replicated folders upon startup, removing 
> /var/lib/artemis/shadowbroker/data/bindings/oldreplica.313
> 07:11:23,421 INFO  [org.apache.activemq.artemis.core.server] AMQ222162: 
> Moving data directory /var/lib/artemis/shadowbroker/data/bindings to 
> /var/lib/artemis/shadowbroker/data/bindings/oldreplica.315
> 07:11:23,423 INFO  [org.apache.activemq.artemis.core.server] AMQ221055: There 
> were too many old replicated folders upon startup, removing 
> /var/lib/artemis/shadowbroker/data/journal/oldreplica.313
> 07:11:23,426 INFO  [org.apache.activemq.artemis.core.server] AMQ222162: 
> Moving data directory /var/lib/artemis/shadowbroker/data/journal to 
> /var/lib/artemis/shadowbroker/data/journal/oldreplica.315
> 07:11:23,427 INFO  [org.apache.activemq.artemis.core.server] AMQ221055: There 
> were too many old replicated folders upon startup, removing 
> /var/lib/artemis/shadowbroker/data/paging/oldreplica.252
> 07:11:23,427 INFO  [org.apache.activemq.artemis.core.server] AMQ222162: 
> Moving data directory /var/lib/artemis/shadowbroker/data/paging to 
> /var/lib/artemis/shadowbroker/data/paging/oldreplica.254
> 07:11:23,437 INFO  [org.eclipse.jetty.util.log] Logging initialized @1968ms 
> to org.eclipse.jetty.util.log.Slf4jLog
> 07:11:23,448 INFO  [org.apache.activemq.artemis.core.server] AMQ221012: Using 
> AIO Journal
> 07:11:23,533 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: 
> Protocol module found: [artemis-server]. Adding protocol support for: CORE
> 07:11:23,533 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: 
> Protocol module found: [artemis-amqp-protocol]. Adding protocol support for: 
> AMQP
> 07:11:23,534 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: 
> Protocol module found: [artemis-hornetq-protocol]. Adding protocol support 
> for: HORNETQ
> 07:11:23,534 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: 
> Protocol module found: [artemis-mqtt-protocol]. Adding protocol support for: 
> MQTT
> 07:11:23,534 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: 
> Protocol module found: [artemis-openwire-protocol]. Adding protocol support 
> for: OPENWIRE
> 07:11:23,535 INFO  [org.apache.activemq.artemis.core.server] AMQ221043: 
> Protocol module found: [artemis-stomp-protocol]. Adding protocol support for: 
> STOMP
> 07:11:23,591 INFO  [org.eclipse.jetty.server.Server] jetty-9.4.z-SNAPSHOT
> 07:11:23,643 INFO  [org.apache.activemq.artemis.core.server] AMQ221109: 
> Apache ActiveMQ Artemis Backup Server version 2.1.0 [null] started, waiting 
> live to fail before it gets active
> 07:11:23,702 INFO  [org.eclipse.jetty.webapp.StandardDescriptorProcessor] NO 
> JSP Support for /jolokia, did not find org.eclipse.jetty.jsp.JettyJspServlet
> 07:11:23,713 INFO  [org.eclipse.jetty.server.session] DefaultSessionIdManager 
> workerName=node0
> 07:11:23,714 INFO  [org.eclipse.jetty.server.session] No SessionScavenger 
> set, using defaults
> 07:11:23,715 INFO  [org.eclipse.jetty.server.session] Scavenging every 
> 660000ms
> 07:11:23,764 INFO  [org.eclipse.jetty.ContextHandler.jolokia] jolokia-agent: 
> No access restrictor found, access to any MBean is allowed
> 07:11:23,910 INFO  [org.eclipse.jetty.server.handler.ContextHandler] Started 
> o.e.j.w.WebAppContext@127a7a2e{/jolokia,file:///var/lib/artemis/shadowbroker/tmp/jetty-localhost-8161-jolokia.war-_jolokia-any-6088373165208569604.dir/webapp/,AVAILABLE}{/opt/artemis/web/jolokia.war}
> 07:11:23,911 INFO  [org.eclipse.jetty.server.handler.ContextHandler] Started 
> o.e.j.s.h.ContextHandler@14008db3{/,file:///opt/artemis/web/,AVAILABLE}
> 07:11:23,925 INFO  [org.eclipse.jetty.server.AbstractConnector] Started 
> ServerConnector@28d18df5{HTTP/1.1,[http/1.1]}{localhost:8161}
> 07:11:23,926 INFO  [org.eclipse.jetty.server.Server] Started @2459ms
> 07:11:23,927 INFO  [org.apache.activemq.artemis] AMQ241001: HTTP Server 
> started at http://localhost:8161
> 07:11:23,927 INFO  [org.apache.activemq.artemis] AMQ241002: Artemis Jolokia 
> REST API available at http://localhost:8161/jolokia
> 07:11:24,745 INFO  [org.apache.activemq.artemis.core.server] AMQ221024: 
> Backup server 
> ActiveMQServerImpl::serverUUID=47e70ff8-2b2c-11e7-84d3-525400ec0962 is 
> synchronized with live-server.
> 07:11:24,760 INFO  [org.apache.activemq.artemis.core.server] AMQ221031: 
> backup announced
> 07:12:54,442 INFO  [org.apache.activemq.artemis.core.server] AMQ221037: 
> ActiveMQServerImpl::serverUUID=47e70ff8-2b2c-11e7-84d3-525400ec0962 to become 
> 'live'
> 07:12:54,556 WARN  [org.apache.activemq.artemis.core.server] AMQ222165: No 
> Dead Letter Address configured for queue test-load in AddressSettings
> 07:12:54,556 WARN  [org.apache.activemq.artemis.core.server] AMQ222166: No 
> Expiry Address configured for queue test-load in AddressSettings
> 07:12:54,910 INFO  [org.apache.activemq.artemis.core.server] AMQ221007: 
> Server is now live
> 07:12:54,933 INFO  [org.apache.activemq.artemis.core.server] AMQ221020: 
> Started EPOLL Acceptor at 0.0.0.0:61616 for protocols 
> [CORE,MQTT,AMQP,STOMP,HORNETQ,OPENWIRE]
> 07:12:54,936 INFO  [org.apache.activemq.artemis.core.server] AMQ221020: 
> Started EPOLL Acceptor at 0.0.0.0:61613 for protocols [STOMP]
> {noformat}
> \* Note the logs: _Backup server ActiveMQServerImpl::serverUUID=xxx is 
> synchronized with live-server_ / _backup announced_
> and slave reports:
> {noformat}
> 07:11:23,818 INFO  [org.apache.activemq.artemis.core.server] AMQ221025: 
> Replication: sending 
> AIOSequentialFile:/var/lib/artemis/shadowbroker/data/journal/activemq-data-1759.amq
>  (size=10,485,760) to replica.
> 07:11:24,352 INFO  [org.apache.activemq.artemis.core.server] AMQ221025: 
> Replication: sending NIOSequentialFile 
> /var/lib/artemis/shadowbroker/data/bindings/activemq-bindings-1448.bindings 
> (size=1,048,576) to replica.
> 07:11:24,366 INFO  [org.apache.activemq.artemis.core.server] AMQ221025: 
> Replication: sending NIOSequentialFile 
> /var/lib/artemis/shadowbroker/data/bindings/activemq-bindings-1456.bindings 
> (size=1,048,576) to replica.
> 07:11:24,389 INFO  [org.apache.activemq.artemis.core.server] AMQ221025: 
> Replication: sending NIOSequentialFile 
> /var/lib/artemis/shadowbroker/data/bindings/activemq-bindings-1447.bindings 
> (size=1,048,576) to replica.
> 07:12:54,399 WARN  
> [org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager]
>  AMQ119114: Replication synchronization process timed out after waiting 
> 90,000 milliseconds: java.lang.IllegalStateException: AMQ119114: Replication 
> synchronization process timed out after waiting 90,000 milliseconds
>         at 
> org.apache.activemq.artemis.core.replication.ReplicationManager.sendSynchronizationDone(ReplicationManager.java:607)
>  [artemis-server-2.1.0.jar:2.1.0]
>         at 
> org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.startReplication(JournalStorageManager.java:579)
>  [artemis-server-2.1.0.jar:2.1.0]
>         at 
> org.apache.activemq.artemis.core.server.impl.SharedNothingLiveActivation$2.run(SharedNothingLiveActivation.java:171)
>  [artemis-server-2.1.0.jar:2.1.0]
>         at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_92]
> 07:12:54,401 WARN  [org.apache.activemq.artemis.core.client] AMQ212037: 
> Connection failure has been detected: AMQ119014: Did not receive data from 
> /172.23.12.5:37606 within the 60,000ms connection TTL. The connection will 
> now be closed. [code=CONNECTION_TIMEDOUT]
> 07:12:54,401 WARN  [org.apache.activemq.artemis.core.server] AMQ222092: 
> Connection to the backup node failed, removing replication now: 
> ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT 
> message=AMQ119014: Did not receive data from /172.23.12.5:37606 within the 
> 60,000ms connection TTL. The connection will now be closed.]
>         at 
> org.apache.activemq.artemis.core.remoting.server.impl.RemotingServiceImpl$FailureCheckAndFlushThread$2.run(RemotingServiceImpl.java:723)
>  [artemis-server-2.1.0.jar:2.1.0]
>         at 
> org.apache.activemq.artemis.utils.OrderedExecutorFactory$OrderedExecutor$ExecutorTask.run(OrderedExecutorFactory.java:101)
>  [artemis-commons-2.1.0.jar:2.1.0]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [rt.jar:1.8.0_92]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [rt.jar:1.8.0_92]
>         at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_92]
> 07:12:54,406 WARN  [org.apache.activemq.artemis.core.server] AMQ222013: Error 
> when trying to start replication: java.lang.IllegalStateException: AMQ119114: 
> Replication synchronization process timed out after waiting 90,000 
> milliseconds
>         at 
> org.apache.activemq.artemis.core.replication.ReplicationManager.sendSynchronizationDone(ReplicationManager.java:607)
>  [artemis-server-2.1.0.jar:2.1.0]
>         at 
> org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.startReplication(JournalStorageManager.java:579)
>  [artemis-server-2.1.0.jar:2.1.0]
>         at 
> org.apache.activemq.artemis.core.server.impl.SharedNothingLiveActivation$2.run(SharedNothingLiveActivation.java:171)
>  [artemis-server-2.1.0.jar:2.1.0]
>         at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_92]
> 07:12:54,412 WARN  [org.apache.activemq.artemis.core.client] AMQ212037: 
> Connection failure has been detected: AMQ119014: Did not receive data from 
> /172.20.98.64:45224 within the 60,000ms connection TTL. The connection will 
> now be closed. [code=CONNECTION_TIMEDOUT]
> 07:12:54,412 WARN  [org.apache.activemq.artemis.core.server] AMQ222061: 
> Client connection failed, clearing up resources for session 
> 8431e0ae-3af1-11e7-ac67-0acf5bbeb077
> 07:12:54,416 WARN  [org.apache.activemq.artemis.core.server] AMQ222107: 
> Cleared up resources for session 8431e0ae-3af1-11e7-ac67-0acf5bbeb077
> 07:12:54,416 WARN  [org.apache.activemq.artemis.core.server] AMQ222061: 
> Client connection failed, clearing up resources for session 
> 8432a3ff-3af1-11e7-ac67-0acf5bbeb077
> 07:12:54,418 WARN  [org.apache.activemq.artemis.core.server] AMQ222107: 
> Cleared up resources for session 8432a3ff-3af1-11e7-ac67-0acf5bbeb077
> 07:12:54,418 WARN  [org.apache.activemq.artemis.core.server] AMQ222061: 
> Client connection failed, clearing up resources for session 
> 84338e60-3af1-11e7-ac67-0acf5bbeb077
> 07:12:54,421 WARN  [org.apache.activemq.artemis.core.server] AMQ222107: 
> Cleared up resources for session 84338e60-3af1-11e7-ac67-0acf5bbeb077
> 07:12:54,421 WARN  [org.apache.activemq.artemis.core.server] AMQ222061: 
> Client connection failed, clearing up resources for session 
> 8435b141-3af1-11e7-ac67-0acf5bbeb077
> 07:12:54,422 WARN  [org.apache.activemq.artemis.core.server] AMQ222107: 
> Cleared up resources for session 8435b141-3af1-11e7-ac67-0acf5bbeb077
> 07:12:54,422 WARN  [org.apache.activemq.artemis.core.server] AMQ222061: 
> Client connection failed, clearing up resources for session 
> 8436e9c2-3af1-11e7-ac67-0acf5bbeb077
> 07:12:54,424 WARN  [org.apache.activemq.artemis.core.server] AMQ222107: 
> Cleared up resources for session 8436e9c2-3af1-11e7-ac67-0acf5bbeb077
> 07:12:54,424 WARN  [org.apache.activemq.artemis.core.server] AMQ222061: 
> Client connection failed, clearing up resources for session 
> 8437fb33-3af1-11e7-ac67-0acf5bbeb077
> 07:12:54,425 WARN  [org.apache.activemq.artemis.core.server] AMQ222107: 
> Cleared up resources for session 8437fb33-3af1-11e7-ac67-0acf5bbeb077
> 07:13:31,653 WARN  [org.apache.activemq.artemis.core.server] AMQ222061: 
> Client connection failed, clearing up resources for session 
> 8431e090-3af1-11e7-b91a-0acf5bbeb077
> 07:13:31,654 WARN  [org.apache.activemq.artemis.core.server] AMQ222107: 
> Cleared up resources for session 8431e090-3af1-11e7-b91a-0acf5bbeb077
> 07:13:31,654 WARN  [org.apache.activemq.artemis.core.server] AMQ222061: 
> Client connection failed, clearing up resources for session 
> 8432a3e1-3af1-11e7-b91a-0acf5bbeb077
> 07:13:31,655 WARN  [org.apache.activemq.artemis.core.server] AMQ222107: 
> Cleared up resources for session 8432a3e1-3af1-11e7-b91a-0acf5bbeb077
> 07:13:31,655 WARN  [org.apache.activemq.artemis.core.server] AMQ222061: 
> Client connection failed, clearing up resources for session 
> 84336732-3af1-11e7-b91a-0acf5bbeb077
> 07:13:31,655 WARN  [org.apache.activemq.artemis.core.server] AMQ222107: 
> Cleared up resources for session 84336732-3af1-11e7-b91a-0acf5bbeb077
> 07:13:31,655 WARN  [org.apache.activemq.artemis.core.server] AMQ222061: 
> Client connection failed, clearing up resources for session 
> 84340373-3af1-11e7-b91a-0acf5bbeb077
> 07:13:31,656 WARN  [org.apache.activemq.artemis.core.server] AMQ222107: 
> Cleared up resources for session 84340373-3af1-11e7-b91a-0acf5bbeb077
> 07:13:31,656 WARN  [org.apache.activemq.artemis.core.server] AMQ222061: 
> Client connection failed, clearing up resources for session 
> 8435b124-3af1-11e7-b91a-0acf5bbeb077
> 07:13:31,656 WARN  [org.apache.activemq.artemis.core.server] AMQ222107: 
> Cleared up resources for session 8435b124-3af1-11e7-b91a-0acf5bbeb077
> 07:13:31,656 WARN  [org.apache.activemq.artemis.core.server] AMQ222061: 
> Client connection failed, clearing up resources for session 
> 84367475-3af1-11e7-b91a-0acf5bbeb077
> 07:13:31,663 WARN  [org.apache.activemq.artemis.core.server] AMQ222107: 
> Cleared up resources for session 84367475-3af1-11e7-b91a-0acf5bbeb077
> {noformat}
> \* Note the logs: _Replication synchronization process timed out after 
> waiting 90,000 milliseconds_ / _Did not receive data from /xxx:37606 within 
> the 60,000ms connection TTL_
> Same as before, master has the following ha-policy:
> {code:xml}
>   <ha-policy>
>      <replication>
>         <master>
>             <check-for-live-server>true</check-for-live-server>
>             
> <initial-replication-sync-timeout>90000</initial-replication-sync-timeout>
>         </master>
>      </replication>
>   </ha-policy>
> {code}
> and slave has:
> {code:xml}
>   <ha-policy>
>      <replication>
>         <slave>
>            <allow-failback>true</allow-failback>
>            
> <initial-replication-sync-timeout>90000</initial-replication-sync-timeout>
>         </slave>
>      </replication>
>   </ha-policy>
> {code}
> For running the tests, we send messages using the artemis producer - command:
> {noformat}
> ./artemis producer --url 
> "tcp://xxx:61616?ha=true&retryInterval=1000&retryIntervalMultiplier=1.0&reconnectAttempts=-1"
>  --destination queue://test-load --message-count 60000 --user xxx --password 
> xxx --sleep 200 --message-size 4000 --threads 5
> {noformat}
> and consume messages with the artemis consumer - command:
> {noformat}
> ./artemis consumer --url 
> "tcp://xxx:61616?ha=true&retryInterval=1000&retryIntervalMultiplier=1.0&reconnectAttempts=-1"
>  --destination queue://test-load --threads 5 --message-count 60000 --user xxx 
> --password xxx
> {noformat}
> Please let us know if we can provide some other info about this new issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to