[ 
https://issues.apache.org/jira/browse/CASSANDRA-13204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Brown updated CASSANDRA-13204:
------------------------------------
       Resolution: Fixed
    Fix Version/s:     (was: 3.11.x)
                       (was: 2.2.x)
                       (was: 2.1.x)
                   4.0
                   3.11.0
                   2.2.9
                   2.1.17
           Status: Resolved  (was: Ready to Commit)

committed as sha {{a6237bf65a95d654b7e702e81fd0d353460d0c89}} to 2.1, 2.2, 3.0, 
3.11, and trunk. Thanks!

> Thread Leak in OutboundTcpConnection
> ------------------------------------
>
>                 Key: CASSANDRA-13204
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13204
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: sankalp kohli
>            Assignee: Jason Brown
>             Fix For: 2.1.17, 2.2.9, 3.0.11, 3.11.0, 4.0
>
>
> We found threads leaking from OutboundTcpConnection to machines which are not 
> part of the cluster and still in Gossip for some reason. There are two issues 
> here, this JIRA will cover the second one which is most important. 
> 1) First issue is that Gossip has information about machines not in the ring 
> which has been replaced out. It causes Cassandra to connect to those machines 
> but due to internode auth, it wont be able to connect to them at the socket 
> level.  
> 2) Second issue is a race between creating a connection and closing a 
> connections which is triggered by the gossip bug explained above. Let me try 
> to explain it using the code
> In OutboundTcpConnection, we are calling closeSocket(true) which will set 
> isStopped=true and also put a close sentinel into the queue to exit the 
> thread. On the ack connection, Gossip tries to send a message which calls 
> connect() which will block for 10 seconds which is RPC timeout. The reason we 
> will block is because Cassandra might not be running there or internode auth 
> will not let it connect. During this 10 seconds, if Gossip calls closeSocket, 
> it will put close sentinel into the queue. When we return from the connect 
> method after 10 seconds, we will clear the backlog queue causing this thread 
> to leak. 
> Proofs from the heap dump of the affected machine which is leaking threads 
> 1. Only ack connection is leaking and not the command connection which is not 
> used by Gossip. 
> 2. We see thread blocked on the backlog queue, isStopped=true and backlog 
> queue is empty. This is happening on the threads which have already leaked. 
> 3. A running thread was blocked on the connect waiting for timeout(10 
> seconds) and we see backlog queue to contain the close sentinel. Once the 
> connect will return false, we will clear the backlog and this thread will 
> have leaked.  
> Interesting bits from j stack 
> 1282 number of threads for "MessagingService-Outgoing-/<IP-Address>"
> Thread which is about to leak:
> "MessagingService-Outgoing-/<IP Address>" 
>    java.lang.Thread.State: RUNNABLE
>       at sun.nio.ch.Net.connect0(Native Method)
>       at sun.nio.ch.Net.connect(Net.java:454)
>       at sun.nio.ch.Net.connect(Net.java:446)
>       at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
>       - locked <> (a java.lang.Object)
>       - locked <> (a java.lang.Object)
>       - locked <> (a java.lang.Object)
>       at 
> org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:137)
>       at 
> org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:119)
>       at 
> org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:381)
>       at 
> org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:217)
> Thread already leaked:
> "MessagingService-Outgoing-/<IP Address>"
>    java.lang.Thread.State: WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>       at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>       at 
> org.apache.cassandra.utils.CoalescingStrategies$DisabledCoalescingStrategy.coalesceInternal(CoalescingStrategies.java:482)
>       at 
> org.apache.cassandra.utils.CoalescingStrategies$CoalescingStrategy.coalesce(CoalescingStrategies.java:213)
>       at 
> org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:190)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to