[ https://issues.apache.org/jira/browse/CASSANDRA-5154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brandon Williams resolved CASSANDRA-5154. ----------------------------------------- Resolution: Cannot Reproduce I'm convinced there's a clock problem here. > Gossip sends removed node which causes restarted nodes to constantly create > new threads > --------------------------------------------------------------------------------------- > > Key: CASSANDRA-5154 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5154 > Project: Cassandra > Issue Type: Bug > Components: Core > Affects Versions: 1.1.7 > Environment: centos 6, JVM 1.6.0_37 > Reporter: Mariusz Gronczewski > Assignee: Brandon Williams > Priority: Minor > > Our cassandra cluster had 14 nodes but it was mostly idle so about 2 weeks > ago we removed 3 of them (via standard decommision) & moved tokens to balance > load. > Since then no node was restarted but last week after restarting 2 of them we > observed that both of them spawn threads ( WRITE-/1.2.3.4 where 1.2.3.4 is > one of removed nodes IPs ) till they hit limit ( which is 800 on our system) > and then cassandra dies. Not restarted nodes do not do that. There are no > outgoing connections to those dead nodes > I noticed dead nodes are still in nodetool gossipinfo on non-restarted nodes > but not on restarted ones so it seems they are not propertly removed from > gossip. > Would rolling restart work for fixing this or is full cluster stop-start > required ? > trace from hanging threads: > {code} > "WRITE-/1.2.3.4" daemon prio=10 tid=0x00007f5fe8194000 nid=0x2fb2 waiting on > condition [0x00007f6020de0000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000007536a1160> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) > at > org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:104) > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira