[ https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099403#comment-14099403 ]
Jeremiah Jordan commented on CASSANDRA-7560: -------------------------------------------- [~yukim] running with the patch had a cluster get the following error: {noformat} ERROR [RepairJobTask:1] 2014-08-15 20:16:46,807 RepairJob.java (line 117) Error while snapshot java.lang.RuntimeException: Could not create snapshot at localhost-grid/10.96.100.22 at org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:81) at org.apache.cassandra.net.MessagingService$5$1.run(MessagingService.java:344) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} And then the repair still hung. Should this patch have caused the repair to correctly error out in this case? > 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession > ---------------------------------------------------------------------- > > Key: CASSANDRA-7560 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7560 > Project: Cassandra > Issue Type: Bug > Components: Core > Reporter: Vladimir Avram > Assignee: Yuki Morishita > Fix For: 2.0.10 > > Attachments: 0001-backport-CASSANDRA-6747.patch, > cassandra_daemon.log, cassandra_daemon_rep1.log, cassandra_daemon_rep2.log, > nodetool_command.log > > > Running {{nodetool repair -pr}} will sometimes hang on one of the resulting > AntiEntropySessions. > The system logs will show the repair command starting > {noformat} > INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) > Starting repair command #1, repairing 256 ranges for keyspace x > {noformat} > You can then see a few AntiEntropySessions completing with: > {noformat} > INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line > 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed > successfully > {noformat} > Finally we reach an AntiEntropySession at some point that hangs just before > requesting the merkle trees for the next column family in line for repair. So > we first see the previous CF being finished and the whole repair sessions > hangs here with no visible progress or errors on this or any of the related > nodes. > {noformat} > INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line > 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully > synced > {noformat} > Notes: > * Single DC 6 node cluster with an average load of 86 GB per node. > * This appears to be random; it does not always happen on the same CF or on > the same session. -- This message was sent by Atlassian JIRA (v6.2#6252)