[ https://issues.apache.org/jira/browse/CASSANDRA-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brandon Williams updated CASSANDRA-2072: ---------------------------------------- Description: Occasionally when decommissioning a node, there is a race condition that occurs where another node will never remove the token and thus propagate it again with a state of down. With CASSANDRA-1900 we can solve this, but it shouldn't occur in the first place. Given nodes A, B, and C, if you decommission B it will stream to A and C. When complete, B will decommission and receive this stacktrace: ERROR 00:02:40,282 Fatal exception in thread Thread[Thread-5,5,main] java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:62) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658) at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:387) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:91 At this point A will show it is removing B's token, but C will not and instead its failure detector will report that B is dead, and nodetool ring on C shows B in a leaving/down state. In another gossip round, C will propagate this state back to A. was: Occasionally when decommissioning a node, there is a race condition that occurs where another node will never remove the token and thus propagate it again with a state of down. With CASSANDRA-1900 we can solve this, but it shouldn't occur in the first place. Given nodes A, B, and C, if you decommission B it will stream to A and C. When complete, B will decommission and receive this stacktrace: ERROR 00:02:40,282 Fatal exception in thread Thread[Thread-5,5,main] java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:62) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658) at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:387) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:91 At this point A will show it is removing B's token, but C will not and instead it's failure detector will report that B is dead, and nodetool ring on C shows A in a leaving/down state. In another gossip round, C will propagate this state back to A. > Race condition during decommission > ---------------------------------- > > Key: CASSANDRA-2072 > URL: https://issues.apache.org/jira/browse/CASSANDRA-2072 > Project: Cassandra > Issue Type: Bug > Components: Core > Affects Versions: 0.7.0 > Reporter: Brandon Williams > Priority: Minor > > Occasionally when decommissioning a node, there is a race condition that > occurs where another node will never remove the token and thus propagate it > again with a state of down. With CASSANDRA-1900 we can solve this, but it > shouldn't occur in the first place. > Given nodes A, B, and C, if you decommission B it will stream to A and C. > When complete, B will decommission and receive this stacktrace: > ERROR 00:02:40,282 Fatal exception in thread Thread[Thread-5,5,main] > java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut > down > at > org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:62) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658) > at > org.apache.cassandra.net.MessagingService.receive(MessagingService.java:387) > at > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:91 > At this point A will show it is removing B's token, but C will not and > instead its failure detector will report that B is dead, and nodetool ring on > C shows B in a leaving/down state. In another gossip round, C will propagate > this state back to A. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.