[ 
https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073333#comment-14073333
 ] 

Yuki Morishita commented on CASSANDRA-7560:
-------------------------------------------

>From the jstack logs, it looks like repair session on coordinator node is 
>waiting for validations (merkle trees), but none of the logs show 
>ValidationExecutor running.
By default, repair takes snapshot before validating, so it is possible that 
snapshotting is taking longer on replica node.

One possible 'hang' point is snapshot time out. Coordinator waits snapshot 
response for "rpc_timeout" millisec, and after that, response handler can be 
removed.
This is addressed in CASSANDRA-6747, and fixed for 2.1.0.

You can try temporarily set rpc_timeout longer and see if that solves the 
problem.

> 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-7560
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7560
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Vladimir Avram
>         Attachments: cassandra_daemon.log, cassandra_daemon_rep1.log, 
> cassandra_daemon_rep2.log, nodetool_command.log
>
>
> Running {{nodetool repair -pr}} will sometimes hang on one of the resulting 
> AntiEntropySessions.
> The system logs will show the repair command starting
> {noformat}
>  INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) 
> Starting repair command #1, repairing 256 ranges for keyspace x
> {noformat}
> You can then see a few AntiEntropySessions completing with:
> {noformat}
> INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line 
> 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed 
> successfully
> {noformat}
> Finally we reach an AntiEntropySession at some point that hangs just before 
> requesting the merkle trees for the next column family in line for repair. So 
> we first see the previous CF being finished and the whole repair sessions 
> hangs here with no visible progress or errors on this or any of the related 
> nodes.
> {noformat}
> INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line 
> 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully 
> synced
> {noformat}
> Notes:
> * Single DC 6 node cluster with an average load of 86 GB per node.
> * This appears to be random; it does not always happen on the same CF or on 
> the same session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to