NPE in AntiEntropyService$RepairSession.completed() ---------------------------------------------------
Key: CASSANDRA-3548 URL: https://issues.apache.org/jira/browse/CASSANDRA-3548 Project: Cassandra Issue Type: Bug Components: Core Affects Versions: 1.0.1 Environment: Free BSD 8.2, JVM vendor/version: OpenJDK 64-Bit Server VM/1.6.0 Reporter: Aaron Morton Assignee: Aaron Morton Priority: Minor This may be related to CASSANDRA-3519 (cluster it was observed on is still 1.0.1), however i think there is still a race condition. Observed on a 2 DC cluster, during a repair that spanned the DC's. {noformat} INFO [AntiEntropyStage:1] 2011-11-28 06:22:56,225 StreamingRepairTask.java (line 136) [streaming task #69187510-1989-11e1-0000-5ff37d368cb6] Forwarding streaming repair of 8602 ranges to /10.6.130.70 (to be streamed with /10.37.114.10) ... INFO [AntiEntropyStage:66] 2011-11-29 11:20:57,109 StreamingRepairTask.java (line 253) [streaming task #69187510-1989-11e1-0000-5ff37d368cb6] task succeeded ERROR [AntiEntropyStage:66] 2011-11-29 11:20:57,109 AbstractCassandraDaemon.java (line 133) Fatal exception in thread Thread[AntiEntropyStage:66,5,main] java.lang.NullPointerException at org.apache.cassandra.service.AntiEntropyService$RepairSession.completed(AntiEntropyService.java:712) at org.apache.cassandra.service.AntiEntropyService$RepairSession$Differencer$1.run(AntiEntropyService.java:912) at org.apache.cassandra.streaming.StreamingRepairTask$2.run(StreamingRepairTask.java:186) at org.apache.cassandra.streaming.StreamingRepairTask$StreamingRepairResponse.doVerb(StreamingRepairTask.java:255) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679) {noformat} One of the nodes involved in the repair session failed, e.g. (Not sure if this is from the same repair session as the streaming task above, but it illustrates the issue) {noformat} ERROR [AntiEntropySessions:1] 2011-11-28 19:39:52,507 AntiEntropyService.java (line 688) [repair #2bf19860-197f-11e1-0000-5ff37d368cb6] session completed with the following error java.io.IOException: Endpoint /10.29.60.10 died at org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:725) at org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:762) at org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:192) at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:559) at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:62) at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:167) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679) ERROR [GossipTasks:1] 2011-11-28 19:39:52,507 StreamOutSession.java (line 232) StreamOutSession /10.29.60.10 failed because {} died or was restarted/removed ERROR [GossipTasks:1] 2011-11-28 19:39:52,571 Gossiper.java (line 172) Gossip error java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:782) at java.util.ArrayList$Itr.next(ArrayList.java:754) at org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:190) at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:559) at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:62) at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:167) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679) {noformat} When a node is marked as failed AntiEntropyService.RepairSession.forceShutdown() clears the activejobs map. But the jobs to other nodes will continue, and will eventually call completed(). RepairSession.terminated should stop completed() from checking the map, but there is a race between the map been cleared and if there is an error in finally block it wont be set. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira