[ 
https://issues.apache.org/jira/browse/CASSANDRA-13204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860572#comment-15860572
 ] 

Ariel Weisberg edited comment on CASSANDRA-13204 at 2/10/17 3:07 AM:
---------------------------------------------------------------------

* [Did you finish this 
comment?|https://github.com/apache/cassandra/compare/trunk...jasobrown:13204-trunk?expand=1#diff-c7ef124561c4cde1c906f28ad3883a88R184]
* --[This would cause a concurrent modification exception with the 
iterator|https://github.com/apache/cassandra/compare/cassandra-2.1...jasobrown:13204-2.1?expand=1#diff-c7ef124561c4cde1c906f28ad3883a88R225]--
 Never mind forgot about the jump.
* [It's not clear to me that you need to move the 
backlog.clear()?|https://github.com/apache/cassandra/compare/cassandra-2.1...jasobrown:13204-2.1?expand=1#diff-c7ef124561c4cde1c906f28ad3883a88L164]

I think I understand the issue. A failed connection clobbers the sentinel with 
backlog.clear(). You fixed the clobbering by relegating the sentinel to just a 
tool to wake up the thread. The flag is controlling the loop and the break will 
make it out to check the loop condition if a connection fails.


was (Author: aweisberg):
* [Did you finish this 
comment?|https://github.com/apache/cassandra/compare/trunk...jasobrown:13204-trunk?expand=1#diff-c7ef124561c4cde1c906f28ad3883a88R184]
* [This would cause a concurrent modification exception with the 
iterator|https://github.com/apache/cassandra/compare/cassandra-2.1...jasobrown:13204-2.1?expand=1#diff-c7ef124561c4cde1c906f28ad3883a88R225]
* [It's not clear to me that you need to move the 
backlog.clear()?|https://github.com/apache/cassandra/compare/cassandra-2.1...jasobrown:13204-2.1?expand=1#diff-c7ef124561c4cde1c906f28ad3883a88L164]

I think I understand the issue. A failed connection clobbers the sentinel with 
backlog.clear(). You fixed the clobbering by relegating the sentinel to just a 
tool to wake up the thread. The flag is controlling the loop and the break will 
make it out to check the loop condition if a connection fails.

> Thread Leak in OutboundTcpConnection
> ------------------------------------
>
>                 Key: CASSANDRA-13204
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13204
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: sankalp kohli
>            Assignee: Jason Brown
>             Fix For: 3.0.11, 2.1.x, 2.2.x, 3.11.x
>
>
> We found threads leaking from OutboundTcpConnection to machines which are not 
> part of the cluster and still in Gossip for some reason. There are two issues 
> here, this JIRA will cover the second one which is most important. 
> 1) First issue is that Gossip has information about machines not in the ring 
> which has been replaced out. It causes Cassandra to connect to those machines 
> but due to internode auth, it wont be able to connect to them at the socket 
> level.  
> 2) Second issue is a race between creating a connection and closing a 
> connections which is triggered by the gossip bug explained above. Let me try 
> to explain it using the code
> In OutboundTcpConnection, we are calling closeSocket(true) which will set 
> isStopped=true and also put a close sentinel into the queue to exit the 
> thread. On the ack connection, Gossip tries to send a message which calls 
> connect() which will block for 10 seconds which is RPC timeout. The reason we 
> will block is because Cassandra might not be running there or internode auth 
> will not let it connect. During this 10 seconds, if Gossip calls closeSocket, 
> it will put close sentinel into the queue. When we return from the connect 
> method after 10 seconds, we will clear the backlog queue causing this thread 
> to leak. 
> Proofs from the heap dump of the affected machine which is leaking threads 
> 1. Only ack connection is leaking and not the command connection which is not 
> used by Gossip. 
> 2. We see thread blocked on the backlog queue, isStopped=true and backlog 
> queue is empty. This is happening on the threads which have already leaked. 
> 3. A running thread was blocked on the connect waiting for timeout(10 
> seconds) and we see backlog queue to contain the close sentinel. Once the 
> connect will return false, we will clear the backlog and this thread will 
> have leaked.  
> Interesting bits from j stack 
> 1282 number of threads for "MessagingService-Outgoing-/<IP-Address>"
> Thread which is about to leak:
> "MessagingService-Outgoing-/<IP Address>" 
>    java.lang.Thread.State: RUNNABLE
>       at sun.nio.ch.Net.connect0(Native Method)
>       at sun.nio.ch.Net.connect(Net.java:454)
>       at sun.nio.ch.Net.connect(Net.java:446)
>       at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
>       - locked <> (a java.lang.Object)
>       - locked <> (a java.lang.Object)
>       - locked <> (a java.lang.Object)
>       at 
> org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:137)
>       at 
> org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:119)
>       at 
> org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:381)
>       at 
> org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:217)
> Thread already leaked:
> "MessagingService-Outgoing-/<IP Address>"
>    java.lang.Thread.State: WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>       at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>       at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>       at 
> org.apache.cassandra.utils.CoalescingStrategies$DisabledCoalescingStrategy.coalesceInternal(CoalescingStrategies.java:482)
>       at 
> org.apache.cassandra.utils.CoalescingStrategies$CoalescingStrategy.coalesce(CoalescingStrategies.java:213)
>       at 
> org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:190)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to