[ 
https://issues.apache.org/jira/browse/CASSANDRA-3548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161107#comment-13161107
 ] 

Aaron Morton commented on CASSANDRA-3548:
-----------------------------------------

Much nicer. I'm not able to test in place on the cluster but +1.

thanks.
                
> NPE in AntiEntropyService$RepairSession.completed()
> ---------------------------------------------------
>
>                 Key: CASSANDRA-3548
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3548
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.1
>         Environment: Free BSD 8.2, JVM vendor/version: OpenJDK 64-Bit Server 
> VM/1.6.0
>            Reporter: Aaron Morton
>            Assignee: Aaron Morton
>            Priority: Minor
>         Attachments: 0001-3548.patch, 3548-v2.patch
>
>
> This may be related to CASSANDRA-3519 (cluster it was observed on is still 
> 1.0.1), however i think there is still a race condition.
> Observed on a 2 DC cluster, during a repair that spanned the DC's.  
> {noformat}
> INFO [AntiEntropyStage:1] 2011-11-28 06:22:56,225 StreamingRepairTask.java 
> (line 136) [streaming task #69187510-1989-11e1-0000-5ff37d368cb6] Forwarding 
> streaming repair of 8602 
> ranges to /10.6.130.70 (to be streamed with /10.37.114.10)
> ...
>  INFO [AntiEntropyStage:66] 2011-11-29 11:20:57,109 StreamingRepairTask.java 
> (line 253) [streaming task #69187510-1989-11e1-0000-5ff37d368cb6] task 
> succeeded
> ERROR [AntiEntropyStage:66] 2011-11-29 11:20:57,109 
> AbstractCassandraDaemon.java (line 133) Fatal exception in thread 
> Thread[AntiEntropyStage:66,5,main]
> java.lang.NullPointerException
>         at 
> org.apache.cassandra.service.AntiEntropyService$RepairSession.completed(AntiEntropyService.java:712)
>         at 
> org.apache.cassandra.service.AntiEntropyService$RepairSession$Differencer$1.run(AntiEntropyService.java:912)
>         at 
> org.apache.cassandra.streaming.StreamingRepairTask$2.run(StreamingRepairTask.java:186)
>         at 
> org.apache.cassandra.streaming.StreamingRepairTask$StreamingRepairResponse.doVerb(StreamingRepairTask.java:255)
>         at 
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:679)
> {noformat}
> One of the nodes involved in the repair session failed, e.g. (Not sure if 
> this is from the same repair session as the streaming task above, but it 
> illustrates the issue)
> {noformat}
> ERROR [AntiEntropySessions:1] 2011-11-28 19:39:52,507 AntiEntropyService.java 
> (line 688) [repair #2bf19860-197f-11e1-0000-5ff37d368cb6] session completed 
> with the following error
> java.io.IOException: Endpoint /10.29.60.10 died
>         at 
> org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:725)
>         at 
> org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:762)
>         at 
> org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:192)
>         at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:559)
>         at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:62)
>         at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:167)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at 
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:679)
> ERROR [GossipTasks:1] 2011-11-28 19:39:52,507 StreamOutSession.java (line 
> 232) StreamOutSession /10.29.60.10 failed because {} died or was 
> restarted/removed
> ERROR [GossipTasks:1] 2011-11-28 19:39:52,571 Gossiper.java (line 172) Gossip 
> error
> java.util.ConcurrentModificationException
>         at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:782)
>         at java.util.ArrayList$Itr.next(ArrayList.java:754)
>         at 
> org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:190)
>         at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:559)
>         at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:62)
>         at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:167)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at 
> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:679)
> {noformat}
> When a node is marked as failed 
> AntiEntropyService.RepairSession.forceShutdown() clears the activejobs map. 
> But the jobs to other nodes will continue, and will eventually call 
> completed(). 
> RepairSession.terminated should stop completed() from checking the map, but 
> there is a race between the map been cleared and if there is an error in 
> finally block it wont be set. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to