Faster detection of failure on network connections in "cable pull" scenario

Hendley, Sam Fri, 20 Feb 2015 12:44:26 -0800

We are doing "break testing" on our clustered deployment and are running into 
an issue with activemq being slow to notice a node that has been "cable 
pulled". We are seeking advice on how to best configure the connections between 
the brokers to notice this error quickly.


We have a producer box and two consumer boxes connected with a network 
connector with the below options. Our producer is enqueing messages to a 
distributed queue that is serviced by both of the consumer boxes. Normally 
traffic is round-robbined between the two consumer boxes correctly. In all of 
these tests we are producing messages at a constant rate.

Prodcuer configuration:
<networkConnector name="producer-to-consumer1" 
uri="static:tcp://consumer1:61616?soTimeout=2000&amp;soWriteTimeout=2000&amp;wireFormat.maxInactivityDuration=3000"
 />
<networkConnector name="producer-to-consumer2" 
uri="static:tcp://consumer2:61616?soTimeout=2000&amp;soWriteTimeout=2000&amp;wireFormat.maxInactivityDuration=3000"
 />

Consumer configuration:
<networkConnector name="consumer-to-produce" 
uri="static:tcp://producerbox:61616?soTimeout=2000&amp;soWriteTimeout=2000&amp;wireFormat.maxInactivityDuration=3000"
 />

If we do a "cable pull" on one of the consumer boxes we it can take many 
minutes before the broker notices that our connection is down. Eventually we 
did get a failure from the writeTimeoutFilter but instead of after two seconds 
like expected, we didn't get the failure for nearly 5 minutes. When it finally 
does trip all of the messages that had been enqueued for the bad consumer are 
correctly resent to the good consumer and all the future traffic is switched 
over to the good consumer.

Below is the log from the producer side. We pulled the cable at 14:20:36 and we 
get the expected inactivity failure error on the "reverse bridge" from the 
consumer broker came a few seconds later. Our "forward bridge" doesn't fail for 
around 5 minutes.

2015-02-20 14:20:41,798 | WARN  | Transport Connection to: 
tcp://10.22.19.206:44208 failed: 
org.apache.activemq.transport.InactivityIOException: Channel was inactive for 
too (>3000) long: tcp://10.22.19.206:44208 | 
org.apache.activemq.broker.TransportConnection.Transport | ActiveMQ 
InactivityMonitor Worker
2015-02-20 14:25:15,276 | WARN  | Forced write timeout 
for:tcp://10.22.19.206:61616 | org.apache.activemq.transport.WriteTimeoutFilter 
| WriteTimeoutFilter-Timeout-1
2015-02-20 14:25:15,278 | WARN  | Caught an exception processing local command 
| org.apache.activemq.network.DemandForwardingBridgeSupport | ActiveMQ 
BrokerService[scaleha-gw2] Task-37
java.net.SocketException: Socket closed
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:121)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
        at 
org.apache.activemq.transport.tcp.TcpBufferedOutputStream.flush(TcpBufferedOutputStream.java:115)
        at java.io.DataOutputStream.flush(DataOutputStream.java:123)
        at 
org.apache.activemq.transport.tcp.TcpTransport.oneway(TcpTransport.java:176)
        at 
org.apache.activemq.transport.AbstractInactivityMonitor.doOnewaySend(AbstractInactivityMonitor.java:304)
        at 
org.apache.activemq.transport.AbstractInactivityMonitor.oneway(AbstractInactivityMonitor.java:286)
        at 
org.apache.activemq.transport.TransportFilter.oneway(TransportFilter.java:85)
        at 
org.apache.activemq.transport.WireFormatNegotiator.oneway(WireFormatNegotiator.java:104)
        at 
org.apache.activemq.transport.TransportFilter.oneway(TransportFilter.java:85)
        at 
org.apache.activemq.transport.WriteTimeoutFilter.oneway(WriteTimeoutFilter.java:73)
        at 
org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:68)
        at 
org.apache.activemq.transport.ResponseCorrelator.asyncRequest(ResponseCorrelator.java:81)
        at 
org.apache.activemq.network.DemandForwardingBridgeSupport.serviceLocalCommand(DemandForwardingBridgeSupport.java:997)
        at 
org.apache.activemq.network.DemandForwardingBridgeSupport$2.onCommand(DemandForwardingBridgeSupport.java:206)
        at 
org.apache.activemq.transport.ResponseCorrelator.onCommand(ResponseCorrelator.java:116)
        at 
org.apache.activemq.transport.MutexTransport.onCommand(MutexTransport.java:50)
        at 
org.apache.activemq.transport.vm.VMTransport.doDispatch(VMTransport.java:138)
        at 
org.apache.activemq.transport.vm.VMTransport.dispatch(VMTransport.java:130)
        at 
org.apache.activemq.transport.vm.VMTransport.oneway(VMTransport.java:107)
        at 
org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:68)
        at 
org.apache.activemq.transport.ResponseCorrelator.oneway(ResponseCorrelator.java:60)
        at 
org.apache.activemq.broker.TransportConnection.dispatch(TransportConnection.java:1370)
        at 
org.apache.activemq.broker.TransportConnection.processDispatch(TransportConnection.java:889)
        at 
org.apache.activemq.broker.TransportConnection.iterate(TransportConnection.java:935)
        at 
org.apache.activemq.thread.PooledTaskRunner.runTask(PooledTaskRunner.java:133)
        at 
org.apache.activemq.thread.PooledTaskRunner$1.run(PooledTaskRunner.java:48)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

I believe I know what is occurring but we don't know how to fix it. What we 
believe is occurring is the soWriteTimeout filter isn't firing because each 
socket write is actually succeeding and returning quickly but the data is being 
queued into the linux socket caches. The actual write to the socket doesn't 
block until we fill those buffers. We validated this by watching netstat as we 
ran the test and saw that the send-q slowly filled up as we ran traffic and 
then once it stopped growing the socket write timeout tripped a few seconds 
later.

tcp        0        0 10.22.19.205:39466          10.22.19.206:61616          
ESTABLISHED   - // normal
tcp        0     7653 10.22.19.205:39466          10.22.19.206:61616          
ESTABLISHED   - // early after cable pull
tcp        0    53400 10.22.19.205:39466          10.22.19.206:61616          
ESTABLISHED   - // two minutes after cable pull
tcp        0   129489 10.22.19.205:39466          10.22.19.206:61616          
FIN_WAIT1     - // final state after writeTimeout fires

I tried to rerun this same experiment without any messages flowing and the 
detection took 15 minutes. I think this took much longer because the number of 
bytes being written to the socket was much lower so we didn't fill the buffer 
as quickly.

I was expecting the InactivityMonitorTImeout to fire to protect us from this 
case but I think it considers the connection to be active every time it 
dispatches a message to that consumer so that timer never fires either.

Normally this is handled with an application level timeout. Presumabley in this 
case we should be waiting for the acknowledgement of receipt of the message 
from the other broker. The stats appear to show the messages as dispatched but 
not dequeued while we keep the consumer box off the network. I was expecting 
that the inactivity timer was looking the successful completion of a message, 
not just it's send.

Is there a setting somewhere I am missing? It seems like this should be a 
relatively common failure mode, maybe other people have enough data flowing 
that they fill those buffers incredibly quickly? We have investigated using 
keepAlive but in general those timeouts are still too slow for our purposes.

Sam Hendley

Faster detection of failure on network connections in "cable pull" scenario

Reply via email to