James Baldassari created AVRO-1292:
--------------------------------------

             Summary: NettyTransceiver: Prevent client threads from blocking 
under certain connection failure scenarios
                 Key: AVRO-1292
                 URL: https://issues.apache.org/jira/browse/AVRO-1292
             Project: Avro
          Issue Type: Bug
          Components: java
    Affects Versions: 1.7.4
            Reporter: James Baldassari
            Assignee: James Baldassari


I've recently found a couple of different failure scenarios with 
NettyTransceiver that result in:
* Client threads blocking for long periods of time (uninterruptibly at that) 
while holding the {{stateLock}} write lock
* RPCs (either sync or async) never returning because a failure in sending the 
RPC was not propagated back up to the caller

The patch I'm going to submit will probably be a lot easier to understand, but 
I'll try to explain the main problems I found.  There is a single type of 
underlying connectivity issue that seems to trigger both of these problems in 
NettyTransceiver: a failure at the network layer causes all packets to be 
dropped somewhere between the RPC client and server.  You might think this 
would be a rare scenario, but it has happened several times in our production 
environment and usually occurs after the RPC server machine becomes 
unresponsive and needs to be physically rebooted.  The only way I've been able 
to reproduce this scenario for testing purposes has been to set up an iptables 
rule on the RPC server that simply drops all incoming packets from the client.  
For example, if the client's IP is 10.0.0.1 I would use the following iptables 
rule on the server to reproduce the failure:

{code}
iptables -t mangle -A INPUT --source 10.0.0.1 -j DROP
{code}

After looking through a lot of stack traces I think I've identified 2 main 
problems:

*Problem 1:* NettyTransceiver calls 
{{ChannelFuture#awaitUninterruptibly(long)}} in a couple places, 
{{getChannel()}} and {{disconnect(boolean,boolean,Throwable)}}.  Under the 
dropped packet scenario I outlined above, the client thread ends up blocking 
uninterruptibly for the entire connection timeout duration while holding the 
{{stateLock}} write lock.  The stack trace for this situation looks like this:

{code}
"RPC Executor - 11 - 1363627762930" daemon prio=10 tid=0x00002aaad005f000 
nid=0x56cf in Object.wait() [0x0000000049344000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:443)
        at 
org.jboss.netty.channel.DefaultChannelFuture.await0(DefaultChannelFuture.java:265)
        - locked <0x0000000703acfa00> (a 
org.jboss.netty.channel.DefaultChannelFuture)
        at 
org.jboss.netty.channel.DefaultChannelFuture.awaitUninterruptibly(DefaultChannelFuture.java:237)
        at 
org.apache.avro.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:248)
        at 
org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:199)
        at 
org.apache.avro.ipc.NettyTransceiver.<init>(NettyTransceiver.java:148)
{code}

At a minimum it should be possible to interrupt these connection attempts.

*Problem 2:* When an error occurs writing to the Netty channel the error is not 
passed back up the stack or callback chain (whether it's a sync or async RPC), 
so the client can end up waiting indefinitely for an RPC that will never return 
because an error occurred sending the Netty packet (i.e. it was never sent to 
the server in the first place).  This scenario might yield a stack trace like 
the following:

{code}
"main" prio=10 tid=0x00007f9400008800 nid=0x379b waiting on condition 
[0x00007f9406bc6000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000007af677960> (a 
java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
        at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:207)
        at org.apache.avro.ipc.CallFuture.await(CallFuture.java:141)
        at org.apache.avro.ipc.Requestor.request(Requestor.java:150)
        at org.apache.avro.ipc.Requestor.request(Requestor.java:101)
        at 
org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:88)
        at $Proxy9.send(Unknown Source)

{code}

It's difficult to provide a unit test for these issues because a connection 
refused error alone will not trigger it.  The only way I've been able to 
reliably reproduce it is by setting the iptables rule I mentioned above.  
Hopefully a code review will be sufficient, but if necessary I can try to find 
a way to create a unit test.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to