[ 
https://issues.apache.org/jira/browse/AVRO-1407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doug Cutting updated AVRO-1407:
-------------------------------
    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I committed this.  Thanks, Gareth!

> NettyTransceiver can cause a infinite loop when slow to connect
> ---------------------------------------------------------------
>
>                 Key: AVRO-1407
>                 URL: https://issues.apache.org/jira/browse/AVRO-1407
>             Project: Avro
>          Issue Type: Bug
>          Components: java
>    Affects Versions: 1.7.5, 1.7.6
>            Reporter: Gareth Davis
>            Assignee: Gareth Davis
>             Fix For: 1.7.8
>
>         Attachments: AVRO-1407-1.patch, AVRO-1407-2.patch, 
> AVRO-1407-testcase.patch
>
>
> When a new {{NettyTransceiver}} is created it forces the channel to be 
> allocated and connected to the remote host. it waits for the connectTimeout 
> ms on the [connect channel 
> future|https://github.com/apache/avro/blob/1579ab1ac95731630af58fc303a07c9bf28541d6/lang/java/ipc/src/main/java/org/apache/avro/ipc/NettyTransceiver.java#L271]
>  this is obivously a good thing it's only that on being unsuccessful, ie 
> {{!channelFuture.isSuccess()}} an exception is thrown and the call to the 
> constructor fails with an {{IOException}}, but has the potential to leave a 
> active channel associated with the {{ChannelFactory}}
> The problem is that a Netty {{NioClientSocketChannelFactory}} will not 
> shutdown if there are active channels still around and if you have supplied 
> the {{ChannelFactory}} to the {{NettyTransceiver}} then  you will not be able 
> to cancel it by calling {{ChannelFactory.releaseExternalResources()}} like 
> the [Flume Avro RPC client 
> does|https://github.com/apache/flume/blob/b8cf789b8509b1e5be05dd0b0b16c5d9af9698ae/flume-ng-sdk/src/main/java/org/apache/flume/api/NettyAvroRpcClient.java#L158].
>  In order to recreate this you need a very laggy network, where the connect 
> attempt takes longer than the connect timeout but does actually work, this 
> very hard to organise in a test case, although I do have a test setup using 
> vagrant VM's that recreates this everytime, using the Flume RPC client and 
> server.
> The following stack is from a production system, it won't ever leave recover 
> until the channel is disconnected (by forcing a disconnect at the remote 
> host) or restarting the JVM.
> {noformat:title=Production stack trace}
> "TLOG-0" daemon prio=10 tid=0x00007f581c7be800 nid=0x39a1 waiting on 
> condition [0x00007f57ef9f2000]
>   java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   parking to wait for <0x00000007218b16e0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>   at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025)
>   at 
> java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1253)
>   at 
> org.jboss.netty.util.internal.ExecutorUtil.terminate(ExecutorUtil.java:103)
>   at 
> org.jboss.netty.channel.socket.nio.AbstractNioWorkerPool.releaseExternalResources(AbstractNioWorkerPool.java:80)
>   at 
> org.jboss.netty.channel.socket.nio.NioClientSocketChannelFactory.releaseExternalResources(NioClientSocketChannelFactory.java:181)
>   at 
> org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:142)
>   at 
> org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:101)
>   at 
> org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:564)
>   locked <0x00000006c30ae7b0> (a org.apache.flume.api.NettyAvroRpcClient)
>   at 
> org.apache.flume.api.RpcClientFactory.getInstance(RpcClientFactory.java:88)
>   at 
> org.apache.flume.api.LoadBalancingRpcClient.createClient(LoadBalancingRpcClient.java:214)
>   at 
> org.apache.flume.api.LoadBalancingRpcClient.getClient(LoadBalancingRpcClient.java:205)
>   locked <0x00000006a97b18e8> (a org.apache.flume.api.LoadBalancingRpcClient)
>   at 
> org.apache.flume.api.LoadBalancingRpcClient.appendBatch(LoadBalancingRpcClient.java:95)
>   at 
> com.ean.platform.components.tlog.client.service.AvroRpcEventRouter$1.call(AvroRpcEventRouter.java:45)
>   at 
> com.ean.platform.components.tlog.client.service.AvroRpcEventRouter$1.call(AvroRpcEventRouter.java:43)
> {noformat}
> The solution is very simple, and a patch should be along in a moment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to