[ https://issues.apache.org/jira/browse/CASSANDRA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161107#comment-13161107 ]
Aaron Morton commented on CASSANDRA-3548: ----------------------------------------- Much nicer. I'm not able to test in place on the cluster but +1. thanks. > NPE in AntiEntropyService$RepairSession.completed() > --------------------------------------------------- > > Key: CASSANDRA-3548 > URL: https://issues.apache.org/jira/browse/CASSANDRA-3548 > Project: Cassandra > Issue Type: Bug > Components: Core > Affects Versions: 1.0.1 > Environment: Free BSD 8.2, JVM vendor/version: OpenJDK 64-Bit Server > VM/1.6.0 > Reporter: Aaron Morton > Assignee: Aaron Morton > Priority: Minor > Attachments: 0001-3548.patch, 3548-v2.patch > > > This may be related to CASSANDRA-3519 (cluster it was observed on is still > 1.0.1), however i think there is still a race condition. > Observed on a 2 DC cluster, during a repair that spanned the DC's. > {noformat} > INFO [AntiEntropyStage:1] 2011-11-28 06:22:56,225 StreamingRepairTask.java > (line 136) [streaming task #69187510-1989-11e1-0000-5ff37d368cb6] Forwarding > streaming repair of 8602 > ranges to /10.6.130.70 (to be streamed with /10.37.114.10) > ... > INFO [AntiEntropyStage:66] 2011-11-29 11:20:57,109 StreamingRepairTask.java > (line 253) [streaming task #69187510-1989-11e1-0000-5ff37d368cb6] task > succeeded > ERROR [AntiEntropyStage:66] 2011-11-29 11:20:57,109 > AbstractCassandraDaemon.java (line 133) Fatal exception in thread > Thread[AntiEntropyStage:66,5,main] > java.lang.NullPointerException > at > org.apache.cassandra.service.AntiEntropyService$RepairSession.completed(AntiEntropyService.java:712) > at > org.apache.cassandra.service.AntiEntropyService$RepairSession$Differencer$1.run(AntiEntropyService.java:912) > at > org.apache.cassandra.streaming.StreamingRepairTask$2.run(StreamingRepairTask.java:186) > at > org.apache.cassandra.streaming.StreamingRepairTask$StreamingRepairResponse.doVerb(StreamingRepairTask.java:255) > at > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:679) > {noformat} > One of the nodes involved in the repair session failed, e.g. (Not sure if > this is from the same repair session as the streaming task above, but it > illustrates the issue) > {noformat} > ERROR [AntiEntropySessions:1] 2011-11-28 19:39:52,507 AntiEntropyService.java > (line 688) [repair #2bf19860-197f-11e1-0000-5ff37d368cb6] session completed > with the following error > java.io.IOException: Endpoint /10.29.60.10 died > at > org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:725) > at > org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:762) > at > org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:192) > at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:559) > at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:62) > at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:167) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at > java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:679) > ERROR [GossipTasks:1] 2011-11-28 19:39:52,507 StreamOutSession.java (line > 232) StreamOutSession /10.29.60.10 failed because {} died or was > restarted/removed > ERROR [GossipTasks:1] 2011-11-28 19:39:52,571 Gossiper.java (line 172) Gossip > error > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:782) > at java.util.ArrayList$Itr.next(ArrayList.java:754) > at > org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:190) > at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:559) > at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:62) > at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:167) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at > java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:679) > {noformat} > When a node is marked as failed > AntiEntropyService.RepairSession.forceShutdown() clears the activejobs map. > But the jobs to other nodes will continue, and will eventually call > completed(). > RepairSession.terminated should stop completed() from checking the map, but > there is a race between the map been cleared and if there is an error in > finally block it wont be set. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira