[ https://issues.apache.org/jira/browse/CASSANDRA-13204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Brown updated CASSANDRA-13204: ------------------------------------ Resolution: Fixed Fix Version/s: (was: 3.11.x) (was: 2.2.x) (was: 2.1.x) 4.0 3.11.0 2.2.9 2.1.17 Status: Resolved (was: Ready to Commit) committed as sha {{a6237bf65a95d654b7e702e81fd0d353460d0c89}} to 2.1, 2.2, 3.0, 3.11, and trunk. Thanks! > Thread Leak in OutboundTcpConnection > ------------------------------------ > > Key: CASSANDRA-13204 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13204 > Project: Cassandra > Issue Type: Bug > Reporter: sankalp kohli > Assignee: Jason Brown > Fix For: 2.1.17, 2.2.9, 3.0.11, 3.11.0, 4.0 > > > We found threads leaking from OutboundTcpConnection to machines which are not > part of the cluster and still in Gossip for some reason. There are two issues > here, this JIRA will cover the second one which is most important. > 1) First issue is that Gossip has information about machines not in the ring > which has been replaced out. It causes Cassandra to connect to those machines > but due to internode auth, it wont be able to connect to them at the socket > level. > 2) Second issue is a race between creating a connection and closing a > connections which is triggered by the gossip bug explained above. Let me try > to explain it using the code > In OutboundTcpConnection, we are calling closeSocket(true) which will set > isStopped=true and also put a close sentinel into the queue to exit the > thread. On the ack connection, Gossip tries to send a message which calls > connect() which will block for 10 seconds which is RPC timeout. The reason we > will block is because Cassandra might not be running there or internode auth > will not let it connect. During this 10 seconds, if Gossip calls closeSocket, > it will put close sentinel into the queue. When we return from the connect > method after 10 seconds, we will clear the backlog queue causing this thread > to leak. > Proofs from the heap dump of the affected machine which is leaking threads > 1. Only ack connection is leaking and not the command connection which is not > used by Gossip. > 2. We see thread blocked on the backlog queue, isStopped=true and backlog > queue is empty. This is happening on the threads which have already leaked. > 3. A running thread was blocked on the connect waiting for timeout(10 > seconds) and we see backlog queue to contain the close sentinel. Once the > connect will return false, we will clear the backlog and this thread will > have leaked. > Interesting bits from j stack > 1282 number of threads for "MessagingService-Outgoing-/<IP-Address>" > Thread which is about to leak: > "MessagingService-Outgoing-/<IP Address>" > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.Net.connect0(Native Method) > at sun.nio.ch.Net.connect(Net.java:454) > at sun.nio.ch.Net.connect(Net.java:446) > at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648) > - locked <> (a java.lang.Object) > - locked <> (a java.lang.Object) > - locked <> (a java.lang.Object) > at > org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:137) > at > org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:119) > at > org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:381) > at > org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:217) > Thread already leaked: > "MessagingService-Outgoing-/<IP Address>" > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.cassandra.utils.CoalescingStrategies$DisabledCoalescingStrategy.coalesceInternal(CoalescingStrategies.java:482) > at > org.apache.cassandra.utils.CoalescingStrategies$CoalescingStrategy.coalesce(CoalescingStrategies.java:213) > at > org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:190) -- This message was sent by Atlassian JIRA (v6.3.15#6346)