Issue: under unstable network conditions and/or network outages, ActiveMQ connection is not re-established following the error: Could not accept connection from tcp://hostname.domain.com:portNumber
Any settings that could help with recovering from this error would be much appreciated. ActiveMQ versions 5.13.0 (client) 5.13.1 (server) Setup: we have servers worldwide and network connectivity is not optimal everyday everywhere. Network outage are not unusual, sometimes for hours, and this error requires a restart of the broker as restarting the spoke does not help. ActiveMQ traffic is very irregular with long periods of inactivity and periods of high activity. Configuration Spoke: acts as a listener, but is the originator of the messages (no network connector definition in its activemq.xml) Broker: establishes connection to spokes, uses static bridge, duplex connection. Sample network connector: <networkConnector name="hostname.domain.com-STATS" uri="static:(ssl://hostname.domain.com:5902)?useExponentialBackOff=false&initialReconnectDelay=5000&maxReconnectDelay=5000&jms.watchTopicAdvisories=false" duplex="true" staticBridge="true"> <staticallyIncludedDestinations> <queue physicalName="tool.stats"/> <queue physicalName="user.access"/> <queue physicalName="user.session"/> <queue physicalName="user.service"/> <queue physicalName="tool.lightpole"/> <queue physicalName="system.config"/> </staticallyIncludedDestinations> </networkConnector> Test steps to re-create the error seen on production servers. In this test, there are no messages sent. This is just one scenario, there maybe others leading to the same situation, but following the steps below ensure the issue can be reproduced. For instance, the same result can be achieved without step 5, but the wait may be a lot longer as it is all a timing issue. 1. Start with an established healthy connection 2. On the spoke, using the linux network emulator, simulate an 80% packet loss (in the real world it would require be much less, but in a test environment with AMQ running in VMs in the same VM cluster and not to wait forever, this does the trick). The command used is tc qdisc [add/replace] dev apdaterName root netem loss 80% 3. Eventually this will cause AMQ to drop the consumer (i.e. when monitoring the status via the AMQ console, it shows 0 consumer) 4. While there is 0 consumer, ensure that both server and client show a TCP established connection (netstat -nat | grep portNumber shows ESTABLISHED) 5. Severe the network connection completely by simulating 100% packet loss (tc qdisc replace dev apdaterName root netem loss 100%) 6. Wait -it takes anything from 2-10 minutes- until the error Could not accept connection occurs on the spoke. At that point the TCP connection has been dropped on the spoke. 7. Re-enable the network (tc qdisc replace dev apdaterName root netem loss 0%), but the connection is never re-established Result: The exception associated with "Could not accept connection" will vary per test and can be anything from ssl handshake error, or no route to host, or EOFException... On the broker: there is no further attempt to reconnect and the TCP connection remains established. ActiveMQ must be restarted to re-establish a valid connection. Note: if the broker happens to drop the TCP connection BEFORE the spoke, then the connection is re-established successfully. Using the soTimeout/soWriteTimeout keepAlive transport options does not seem to affect the behavior Interestingly, the debug logging socket close occurred when shutting down AMQ in this scenario. Log snippet: activemq.log.66:2016-05-19 17:56:07,998 | DEBUG | No connection attempt made in time for ssl:///172.27.11.189:5902! Throwing InactivityIOException. | org.apache.activemq.transport.AbstractInactivityMonitor | ActiveMQ InactivityMonitor ReadCheckTimer activemq.log.68:2016-05-19 17:55:37,964 | DEBUG | No message received since last read check for ssl:///172.27.11.189:5902. Throwing InactivityIOException. | org.apache.activemq.transport.AbstractInactivityMonitor | ActiveMQ InactivityMonitor ReadCheckTimer activemq.log.68:2016-05-19 17:55:37,965 | DEBUG | Running WriteCheck[tcp://172.27.11.189:5902] | org.apache.activemq.transport.AbstractInactivityMonitor | ActiveMQ InactivityMonitor Worker activemq.log.68:2016-05-19 17:55:37,966 | DEBUG | Running ReadCheck[tcp://172.27.11.189:5902] | org.apache.activemq.transport.AbstractInactivityMonitor | ActiveMQ InactivityMonitor Worker activemq.log.68:2016-05-19 17:55:37,968 | WARN | Network connection between vm://localhost#112 and ssl:///172.27.11.189:5902 shutdown due to a remote error: org.apache.activemq.transport.InactivityIOException: Channel was inactive for too (>30000) long: tcp://172.27.11.189:5902 | org.apache.activemq.network.DemandForwardingBridgeSupport | ActiveMQ InactivityMonitor Worker activemq.log.68:2016-05-19 17:55:37,968 | DEBUG | The remote Exception was: org.apache.activemq.transport.InactivityIOException: Channel was inactive for too (>30000) long: tcp://172.27.11.189:5902 | org.apache.activemq.network.DemandForwardingBridgeSupport | ActiveMQ InactivityMonitor Worker activemq.log.68:org.apache.activemq.transport.InactivityIOException: Channel was inactive for too (>30000) long: tcp://172.27.11.189:5902 activemq.log.68:2016-05-19 17:55:37,972 | DEBUG | Unregistering MBean org.apache.activemq:brokerName=localhost,connector=networkConnectors,networkConnectorName=ECNICE_ILSTECHNOLOGY_COM-STATS,networkBridge=tcp_//172.27.11.189_5902,type=Broker | org.apache.activemq.broker.jmx.ManagementContext | ActiveMQ BrokerService[localhost] Task-4154 activemq.log.68:2016-05-19 17:55:37,973 | INFO | Establishing network connection from vm://localhost?async=false to ssl://172.27.11.189:5902 | org.apache.activemq.network.DiscoveryNetworkConnector | Simple Discovery Agent-1 activemq.log.68:org.apache.activemq.transport.InactivityIOException: Cannot send, channel has already failed: tcp://172.27.11.189:5902 activemq.log.68:2016-05-19 17:55:37,974 | DEBUG | Stopping transport ssl:///172.27.11.189:5902 | org.apache.activemq.transport.tcp.TcpTransport | ActiveMQ BrokerService[localhost] Task-4154 activemq.log.68:2016-05-19 17:55:37,994 | DEBUG | Closed socket 38aff18d[TLS_RSA_WITH_AES_128_CBC_SHA: Socket[addr=/172.27.11.189,port=5902,localport=49490]] | org.apache.activemq.transport.tcp.TcpTransport | ActiveMQ Task-1 activemq.log.68:2016-05-19 17:55:37,998 | DEBUG | Sending: WireFormatInfo { version=11, properties={TcpNoDelayEnabled=true, SizePrefixDisabled=false, CacheSize=1024, StackTraceEnabled=true, CacheEnabled=true, TightEncodingEnabled=true, MaxFrameSize=9223372036854775807, Host=172.27.11.189, MaxInactivityDuration=30000, MaxInactivityDurationInitalDelay=10000}, magic=[A,c,t,i,v,e,M,Q]} | org.apache.activemq.transport.WireFormatNegotiator | Simple Discovery Agent-1 Thanks. -- View this message in context: http://activemq.2283324.n4.nabble.com/ActiveMQ-fails-to-reconnect-after-Could-not-accept-connection-error-tp4712192.html Sent from the ActiveMQ - User mailing list archive at Nabble.com.