ActiveMQ fails to reconnect after "Could not accept connection" error

gramonta Thu, 19 May 2016 17:33:30 -0700

Issue: under unstable network conditions and/or network outages, ActiveMQ
connection is not re-established following the error:
Could not accept connection from tcp://hostname.domain.com:portNumber


Any settings that could help with recovering from this error would be much
appreciated.

ActiveMQ versions 5.13.0 (client) 5.13.1 (server)

Setup: we have servers worldwide and network connectivity is not optimal
everyday everywhere. 
Network outage are not unusual, sometimes for hours, and this error requires
a restart of the broker as restarting the spoke does not help.
ActiveMQ traffic is very irregular with long periods of inactivity and
periods of high activity.

Configuration
Spoke: acts as a listener, but is the originator of the messages (no network
connector definition in its activemq.xml)
Broker: establishes connection to spokes, uses static bridge, duplex
connection.

Sample network connector:
<networkConnector name="hostname.domain.com-STATS"
uri="static:(ssl://hostname.domain.com:5902)?useExponentialBackOff=false&amp;initialReconnectDelay=5000&amp;maxReconnectDelay=5000&amp;jms.watchTopicAdvisories=false"
duplex="true" staticBridge="true">
        <staticallyIncludedDestinations>
          <queue physicalName="tool.stats"/>
          <queue physicalName="user.access"/>
          <queue physicalName="user.session"/>
          <queue physicalName="user.service"/>
          <queue physicalName="tool.lightpole"/>
          <queue physicalName="system.config"/>
        </staticallyIncludedDestinations>
</networkConnector>

Test steps to re-create the error seen on production servers.
In this test, there are no messages sent.
This is just one scenario, there maybe others leading to the same situation,
but following the steps below ensure the issue can be reproduced.
For instance, the same result can be achieved without step 5, but the wait
may be a lot longer as it is all a timing issue.

1. Start with an established healthy connection
2. On the spoke, using the linux network emulator, simulate an 80% packet
loss (in the real world it would require be much less, but in a test
environment with AMQ running in VMs in the same VM cluster and not to wait
forever, this does the trick).
   The command used is tc qdisc [add/replace] dev apdaterName root netem
loss 80%
3. Eventually this will cause AMQ to drop the consumer (i.e. when monitoring
the status via the AMQ console, it shows 0 consumer)
4. While there is 0 consumer, ensure that both server and client show a TCP
established connection (netstat -nat | grep portNumber shows ESTABLISHED)
5. Severe the network connection completely by simulating 100% packet loss
(tc qdisc replace dev apdaterName root netem loss 100%)
6. Wait -it takes anything from 2-10 minutes- until the error Could not
accept connection occurs on the spoke. At that point the TCP connection has
been dropped on the spoke.
7. Re-enable the network (tc qdisc replace dev apdaterName root netem loss
0%), but the connection is never re-established

Result:
The exception associated with "Could not accept connection" will vary per
test and can be anything from ssl handshake error, or no route to host, or
EOFException...
On the broker: there is no further attempt to reconnect and the TCP
connection remains established. ActiveMQ must be restarted to re-establish a
valid connection.

Note: if the broker happens to drop the TCP connection BEFORE the spoke,
then the connection is re-established successfully.

Using the soTimeout/soWriteTimeout keepAlive transport options does not seem
to affect the behavior
Interestingly, the debug logging socket close occurred when shutting down
AMQ in this scenario.

Log snippet:

activemq.log.66:2016-05-19 17:56:07,998 | DEBUG | No connection attempt made
in time for ssl:///172.27.11.189:5902! Throwing InactivityIOException. |
org.apache.activemq.transport.AbstractInactivityMonitor | ActiveMQ
InactivityMonitor ReadCheckTimer
activemq.log.68:2016-05-19 17:55:37,964 | DEBUG | No message received since
last read check for ssl:///172.27.11.189:5902. Throwing
InactivityIOException. |
org.apache.activemq.transport.AbstractInactivityMonitor | ActiveMQ
InactivityMonitor ReadCheckTimer
activemq.log.68:2016-05-19 17:55:37,965 | DEBUG | Running
WriteCheck[tcp://172.27.11.189:5902] |
org.apache.activemq.transport.AbstractInactivityMonitor | ActiveMQ
InactivityMonitor Worker
activemq.log.68:2016-05-19 17:55:37,966 | DEBUG | Running
ReadCheck[tcp://172.27.11.189:5902] |
org.apache.activemq.transport.AbstractInactivityMonitor | ActiveMQ
InactivityMonitor Worker
activemq.log.68:2016-05-19 17:55:37,968 | WARN  | Network connection between
vm://localhost#112 and ssl:///172.27.11.189:5902 shutdown due to a remote
error: org.apache.activemq.transport.InactivityIOException: Channel was
inactive for too (>30000) long: tcp://172.27.11.189:5902 |
org.apache.activemq.network.DemandForwardingBridgeSupport | ActiveMQ
InactivityMonitor Worker
activemq.log.68:2016-05-19 17:55:37,968 | DEBUG | The remote Exception was:
org.apache.activemq.transport.InactivityIOException: Channel was inactive
for too (>30000) long: tcp://172.27.11.189:5902 |
org.apache.activemq.network.DemandForwardingBridgeSupport | ActiveMQ
InactivityMonitor Worker
activemq.log.68:org.apache.activemq.transport.InactivityIOException: Channel
was inactive for too (>30000) long: tcp://172.27.11.189:5902
activemq.log.68:2016-05-19 17:55:37,972 | DEBUG | Unregistering MBean
org.apache.activemq:brokerName=localhost,connector=networkConnectors,networkConnectorName=ECNICE_ILSTECHNOLOGY_COM-STATS,networkBridge=tcp_//172.27.11.189_5902,type=Broker
| org.apache.activemq.broker.jmx.ManagementContext | ActiveMQ
BrokerService[localhost] Task-4154
activemq.log.68:2016-05-19 17:55:37,973 | INFO  | Establishing network
connection from vm://localhost?async=false to ssl://172.27.11.189:5902 |
org.apache.activemq.network.DiscoveryNetworkConnector | Simple Discovery
Agent-1
activemq.log.68:org.apache.activemq.transport.InactivityIOException: Cannot
send, channel has already failed: tcp://172.27.11.189:5902
activemq.log.68:2016-05-19 17:55:37,974 | DEBUG | Stopping transport
ssl:///172.27.11.189:5902 | org.apache.activemq.transport.tcp.TcpTransport |
ActiveMQ BrokerService[localhost] Task-4154
activemq.log.68:2016-05-19 17:55:37,994 | DEBUG | Closed socket
38aff18d[TLS_RSA_WITH_AES_128_CBC_SHA:
Socket[addr=/172.27.11.189,port=5902,localport=49490]] |
org.apache.activemq.transport.tcp.TcpTransport | ActiveMQ Task-1
activemq.log.68:2016-05-19 17:55:37,998 | DEBUG | Sending: WireFormatInfo {
version=11, properties={TcpNoDelayEnabled=true, SizePrefixDisabled=false,
CacheSize=1024, StackTraceEnabled=true, CacheEnabled=true,
TightEncodingEnabled=true, MaxFrameSize=9223372036854775807,
Host=172.27.11.189, MaxInactivityDuration=30000,
MaxInactivityDurationInitalDelay=10000}, magic=[A,c,t,i,v,e,M,Q]} |
org.apache.activemq.transport.WireFormatNegotiator | Simple Discovery
Agent-1

Thanks.






--
View this message in context: 
http://activemq.2283324.n4.nabble.com/ActiveMQ-fails-to-reconnect-after-Could-not-accept-connection-error-tp4712192.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.

ActiveMQ fails to reconnect after "Could not accept connection" error

Reply via email to