[jira] [Commented] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession

Jeremiah Jordan (JIRA) Fri, 15 Aug 2014 17:40:53 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099403#comment-14099403
 ]


Jeremiah Jordan commented on CASSANDRA-7560:
--------------------------------------------

[~yukim] running with the patch had a cluster get the following error:

{noformat}
ERROR [RepairJobTask:1] 2014-08-15 20:16:46,807 RepairJob.java (line 117) Error 
while snapshot 
java.lang.RuntimeException: Could not create snapshot at 
localhost-grid/10.96.100.22 
at 
org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:81)
 
at org.apache.cassandra.net.MessagingService$5$1.run(MessagingService.java:344) 
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) 
at java.util.concurrent.FutureTask.run(Unknown Source) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
at java.lang.Thread.run(Unknown Source) 
{noformat}

And then the repair still hung.  Should this patch have caused the repair to 
correctly error out in this case?

> 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-7560
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7560
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Vladimir Avram
>            Assignee: Yuki Morishita
>             Fix For: 2.0.10
>
>         Attachments: 0001-backport-CASSANDRA-6747.patch, 
> cassandra_daemon.log, cassandra_daemon_rep1.log, cassandra_daemon_rep2.log, 
> nodetool_command.log
>
>
> Running {{nodetool repair -pr}} will sometimes hang on one of the resulting 
> AntiEntropySessions.
> The system logs will show the repair command starting
> {noformat}
>  INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) 
> Starting repair command #1, repairing 256 ranges for keyspace x
> {noformat}
> You can then see a few AntiEntropySessions completing with:
> {noformat}
> INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line 
> 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed 
> successfully
> {noformat}
> Finally we reach an AntiEntropySession at some point that hangs just before 
> requesting the merkle trees for the next column family in line for repair. So 
> we first see the previous CF being finished and the whole repair sessions 
> hangs here with no visible progress or errors on this or any of the related 
> nodes.
> {noformat}
> INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line 
> 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully 
> synced
> {noformat}
> Notes:
> * Single DC 6 node cluster with an average load of 86 GB per node.
> * This appears to be random; it does not always happen on the same CF or on 
> the same session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession

Reply via email to