[jira] [Commented] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
[ https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289326#comment-14289326 ] Yuki Morishita commented on CASSANDRA-7560: --- For 2.0, yes, I believe. > 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession > -- > > Key: CASSANDRA-7560 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7560 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Vladimir Avram >Assignee: Yuki Morishita > Fix For: 2.0.10 > > Attachments: 0001-backport-CASSANDRA-6747.patch, > 0001-partial-backport-3569.patch, cassandra_daemon.log, > cassandra_daemon_rep1.log, cassandra_daemon_rep2.log, nodetool_command.log > > > Running {{nodetool repair -pr}} will sometimes hang on one of the resulting > AntiEntropySessions. > The system logs will show the repair command starting > {noformat} > INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) > Starting repair command #1, repairing 256 ranges for keyspace x > {noformat} > You can then see a few AntiEntropySessions completing with: > {noformat} > INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line > 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed > successfully > {noformat} > Finally we reach an AntiEntropySession at some point that hangs just before > requesting the merkle trees for the next column family in line for repair. So > we first see the previous CF being finished and the whole repair sessions > hangs here with no visible progress or errors on this or any of the related > nodes. > {noformat} > INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line > 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully > synced > {noformat} > Notes: > * Single DC 6 node cluster with an average load of 86 GB per node. > * This appears to be random; it does not always happen on the same CF or on > the same session. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
[ https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289291#comment-14289291 ] Nick Bailey commented on CASSANDRA-7560: Is this a bug only for snapshot repair? > 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession > -- > > Key: CASSANDRA-7560 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7560 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Vladimir Avram >Assignee: Yuki Morishita > Fix For: 2.0.10 > > Attachments: 0001-backport-CASSANDRA-6747.patch, > 0001-partial-backport-3569.patch, cassandra_daemon.log, > cassandra_daemon_rep1.log, cassandra_daemon_rep2.log, nodetool_command.log > > > Running {{nodetool repair -pr}} will sometimes hang on one of the resulting > AntiEntropySessions. > The system logs will show the repair command starting > {noformat} > INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) > Starting repair command #1, repairing 256 ranges for keyspace x > {noformat} > You can then see a few AntiEntropySessions completing with: > {noformat} > INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line > 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed > successfully > {noformat} > Finally we reach an AntiEntropySession at some point that hangs just before > requesting the merkle trees for the next column family in line for repair. So > we first see the previous CF being finished and the whole repair sessions > hangs here with no visible progress or errors on this or any of the related > nodes. > {noformat} > INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line > 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully > synced > {noformat} > Notes: > * Single DC 6 node cluster with an average load of 86 GB per node. > * This appears to be random; it does not always happen on the same CF or on > the same session. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
[ https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14100945#comment-14100945 ] Joshua McKenzie commented on CASSANDRA-7560: +1 on the 3569 backport. > 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession > -- > > Key: CASSANDRA-7560 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7560 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Vladimir Avram >Assignee: Yuki Morishita > Fix For: 2.0.10 > > Attachments: 0001-backport-CASSANDRA-6747.patch, > 0001-partial-backport-3569.patch, cassandra_daemon.log, > cassandra_daemon_rep1.log, cassandra_daemon_rep2.log, nodetool_command.log > > > Running {{nodetool repair -pr}} will sometimes hang on one of the resulting > AntiEntropySessions. > The system logs will show the repair command starting > {noformat} > INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) > Starting repair command #1, repairing 256 ranges for keyspace x > {noformat} > You can then see a few AntiEntropySessions completing with: > {noformat} > INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line > 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed > successfully > {noformat} > Finally we reach an AntiEntropySession at some point that hangs just before > requesting the merkle trees for the next column family in line for repair. So > we first see the previous CF being finished and the whole repair sessions > hangs here with no visible progress or errors on this or any of the related > nodes. > {noformat} > INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line > 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully > synced > {noformat} > Notes: > * Single DC 6 node cluster with an average load of 86 GB per node. > * This appears to be random; it does not always happen on the same CF or on > the same session. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
[ https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14100820#comment-14100820 ] Yuki Morishita commented on CASSANDRA-7560: --- [~jjordan] right. > 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession > -- > > Key: CASSANDRA-7560 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7560 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Vladimir Avram >Assignee: Yuki Morishita > Fix For: 2.0.10 > > Attachments: 0001-backport-CASSANDRA-6747.patch, > cassandra_daemon.log, cassandra_daemon_rep1.log, cassandra_daemon_rep2.log, > nodetool_command.log > > > Running {{nodetool repair -pr}} will sometimes hang on one of the resulting > AntiEntropySessions. > The system logs will show the repair command starting > {noformat} > INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) > Starting repair command #1, repairing 256 ranges for keyspace x > {noformat} > You can then see a few AntiEntropySessions completing with: > {noformat} > INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line > 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed > successfully > {noformat} > Finally we reach an AntiEntropySession at some point that hangs just before > requesting the merkle trees for the next column family in line for repair. So > we first see the previous CF being finished and the whole repair sessions > hangs here with no visible progress or errors on this or any of the related > nodes. > {noformat} > INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line > 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully > synced > {noformat} > Notes: > * Single DC 6 node cluster with an average load of 86 GB per node. > * This appears to be random; it does not always happen on the same CF or on > the same session. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
[ https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14100731#comment-14100731 ] Jeremiah Jordan commented on CASSANDRA-7560: [~yukim] ah. so run with -par and you can avoid that problem? > 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession > -- > > Key: CASSANDRA-7560 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7560 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Vladimir Avram >Assignee: Yuki Morishita > Fix For: 2.0.10 > > Attachments: 0001-backport-CASSANDRA-6747.patch, > cassandra_daemon.log, cassandra_daemon_rep1.log, cassandra_daemon_rep2.log, > nodetool_command.log > > > Running {{nodetool repair -pr}} will sometimes hang on one of the resulting > AntiEntropySessions. > The system logs will show the repair command starting > {noformat} > INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) > Starting repair command #1, repairing 256 ranges for keyspace x > {noformat} > You can then see a few AntiEntropySessions completing with: > {noformat} > INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line > 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed > successfully > {noformat} > Finally we reach an AntiEntropySession at some point that hangs just before > requesting the merkle trees for the next column family in line for repair. So > we first see the previous CF being finished and the whole repair sessions > hangs here with no visible progress or errors on this or any of the related > nodes. > {noformat} > INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line > 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully > synced > {noformat} > Notes: > * Single DC 6 node cluster with an average load of 86 GB per node. > * This appears to be random; it does not always happen on the same CF or on > the same session. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
[ https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14100653#comment-14100653 ] Yuki Morishita commented on CASSANDRA-7560: --- [~jjordan] Looks like I need one more patch to backport from 2.1.0 to prevent hang... CASSANDRA-3569 It introduces RepairJobEventListener to handle snapshot failure. Not sure I can backport it to 2.0 yet. > 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession > -- > > Key: CASSANDRA-7560 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7560 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Vladimir Avram >Assignee: Yuki Morishita > Fix For: 2.0.10 > > Attachments: 0001-backport-CASSANDRA-6747.patch, > cassandra_daemon.log, cassandra_daemon_rep1.log, cassandra_daemon_rep2.log, > nodetool_command.log > > > Running {{nodetool repair -pr}} will sometimes hang on one of the resulting > AntiEntropySessions. > The system logs will show the repair command starting > {noformat} > INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) > Starting repair command #1, repairing 256 ranges for keyspace x > {noformat} > You can then see a few AntiEntropySessions completing with: > {noformat} > INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line > 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed > successfully > {noformat} > Finally we reach an AntiEntropySession at some point that hangs just before > requesting the merkle trees for the next column family in line for repair. So > we first see the previous CF being finished and the whole repair sessions > hangs here with no visible progress or errors on this or any of the related > nodes. > {noformat} > INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line > 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully > synced > {noformat} > Notes: > * Single DC 6 node cluster with an average load of 86 GB per node. > * This appears to be random; it does not always happen on the same CF or on > the same session. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
[ https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099403#comment-14099403 ] Jeremiah Jordan commented on CASSANDRA-7560: [~yukim] running with the patch had a cluster get the following error: {noformat} ERROR [RepairJobTask:1] 2014-08-15 20:16:46,807 RepairJob.java (line 117) Error while snapshot java.lang.RuntimeException: Could not create snapshot at localhost-grid/10.96.100.22 at org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:81) at org.apache.cassandra.net.MessagingService$5$1.run(MessagingService.java:344) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) {noformat} And then the repair still hung. Should this patch have caused the repair to correctly error out in this case? > 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession > -- > > Key: CASSANDRA-7560 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7560 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Vladimir Avram >Assignee: Yuki Morishita > Fix For: 2.0.10 > > Attachments: 0001-backport-CASSANDRA-6747.patch, > cassandra_daemon.log, cassandra_daemon_rep1.log, cassandra_daemon_rep2.log, > nodetool_command.log > > > Running {{nodetool repair -pr}} will sometimes hang on one of the resulting > AntiEntropySessions. > The system logs will show the repair command starting > {noformat} > INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) > Starting repair command #1, repairing 256 ranges for keyspace x > {noformat} > You can then see a few AntiEntropySessions completing with: > {noformat} > INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line > 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed > successfully > {noformat} > Finally we reach an AntiEntropySession at some point that hangs just before > requesting the merkle trees for the next column family in line for repair. So > we first see the previous CF being finished and the whole repair sessions > hangs here with no visible progress or errors on this or any of the related > nodes. > {noformat} > INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line > 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully > synced > {noformat} > Notes: > * Single DC 6 node cluster with an average load of 86 GB per node. > * This appears to be random; it does not always happen on the same CF or on > the same session. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
[ https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074685#comment-14074685 ] Vladimir Avram commented on CASSANDRA-7560: --- Thanks, I will try the patch out on 2.0.7 > 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession > -- > > Key: CASSANDRA-7560 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7560 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Vladimir Avram >Assignee: Yuki Morishita > Fix For: 2.0.10 > > Attachments: 0001-backport-CASSANDRA-6747.patch, > cassandra_daemon.log, cassandra_daemon_rep1.log, cassandra_daemon_rep2.log, > nodetool_command.log > > > Running {{nodetool repair -pr}} will sometimes hang on one of the resulting > AntiEntropySessions. > The system logs will show the repair command starting > {noformat} > INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) > Starting repair command #1, repairing 256 ranges for keyspace x > {noformat} > You can then see a few AntiEntropySessions completing with: > {noformat} > INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line > 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed > successfully > {noformat} > Finally we reach an AntiEntropySession at some point that hangs just before > requesting the merkle trees for the next column family in line for repair. So > we first see the previous CF being finished and the whole repair sessions > hangs here with no visible progress or errors on this or any of the related > nodes. > {noformat} > INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line > 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully > synced > {noformat} > Notes: > * Single DC 6 node cluster with an average load of 86 GB per node. > * This appears to be random; it does not always happen on the same CF or on > the same session. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
[ https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14074314#comment-14074314 ] Marcus Eriksson commented on CASSANDRA-7560: In general, LGTM, nit; would be nice with some javadoc on the failureCallback param to sendRR(..) and on sendRRWithFailure(..) Btw, I think "MessageIn.isFailureCallback()" is a bit confusing, would it make sense to rename that to something like "doCallbackOnFailure()"? > 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession > -- > > Key: CASSANDRA-7560 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7560 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Vladimir Avram >Assignee: Yuki Morishita > Fix For: 2.0.10 > > Attachments: 0001-backport-CASSANDRA-6747.patch, > cassandra_daemon.log, cassandra_daemon_rep1.log, cassandra_daemon_rep2.log, > nodetool_command.log > > > Running {{nodetool repair -pr}} will sometimes hang on one of the resulting > AntiEntropySessions. > The system logs will show the repair command starting > {noformat} > INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) > Starting repair command #1, repairing 256 ranges for keyspace x > {noformat} > You can then see a few AntiEntropySessions completing with: > {noformat} > INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line > 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed > successfully > {noformat} > Finally we reach an AntiEntropySession at some point that hangs just before > requesting the merkle trees for the next column family in line for repair. So > we first see the previous CF being finished and the whole repair sessions > hangs here with no visible progress or errors on this or any of the related > nodes. > {noformat} > INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line > 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully > synced > {noformat} > Notes: > * Single DC 6 node cluster with an average load of 86 GB per node. > * This appears to be random; it does not always happen on the same CF or on > the same session. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
[ https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1407#comment-1407 ] Yuki Morishita commented on CASSANDRA-7560: --- >From the jstack logs, it looks like repair session on coordinator node is >waiting for validations (merkle trees), but none of the logs show >ValidationExecutor running. By default, repair takes snapshot before validating, so it is possible that snapshotting is taking longer on replica node. One possible 'hang' point is snapshot time out. Coordinator waits snapshot response for "rpc_timeout" millisec, and after that, response handler can be removed. This is addressed in CASSANDRA-6747, and fixed for 2.1.0. You can try temporarily set rpc_timeout longer and see if that solves the problem. > 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession > -- > > Key: CASSANDRA-7560 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7560 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Vladimir Avram > Attachments: cassandra_daemon.log, cassandra_daemon_rep1.log, > cassandra_daemon_rep2.log, nodetool_command.log > > > Running {{nodetool repair -pr}} will sometimes hang on one of the resulting > AntiEntropySessions. > The system logs will show the repair command starting > {noformat} > INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) > Starting repair command #1, repairing 256 ranges for keyspace x > {noformat} > You can then see a few AntiEntropySessions completing with: > {noformat} > INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line > 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed > successfully > {noformat} > Finally we reach an AntiEntropySession at some point that hangs just before > requesting the merkle trees for the next column family in line for repair. So > we first see the previous CF being finished and the whole repair sessions > hangs here with no visible progress or errors on this or any of the related > nodes. > {noformat} > INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line > 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully > synced > {noformat} > Notes: > * Single DC 6 node cluster with an average load of 86 GB per node. > * This appears to be random; it does not always happen on the same CF or on > the same session. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
[ https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067331#comment-14067331 ] Yuki Morishita commented on CASSANDRA-7560: --- Thanks. Can you also attach jstack(s) from replica? I want to check validation compaction is still running. > 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession > -- > > Key: CASSANDRA-7560 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7560 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Vladimir Avram > Attachments: cassandra_daemon.log, nodetool_command.log > > > Running {{nodetool repair -pr}} will sometimes hang on one of the resulting > AntiEntropySessions. > The system logs will show the repair command starting > {noformat} > INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) > Starting repair command #1, repairing 256 ranges for keyspace x > {noformat} > You can then see a few AntiEntropySessions completing with: > {noformat} > INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line > 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed > successfully > {noformat} > Finally we reach an AntiEntropySession at some point that hangs just before > requesting the merkle trees for the next column family in line for repair. So > we first see the previous CF being finished and the whole repair sessions > hangs here with no visible progress or errors on this or any of the related > nodes. > {noformat} > INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line > 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully > synced > {noformat} > Notes: > * Single DC 6 node cluster with an average load of 86 GB per node. > * This appears to be random; it does not always happen on the same CF or on > the same session. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
[ https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14067194#comment-14067194 ] Vladimir Avram commented on CASSANDRA-7560: --- Here is what 'nodetool tpstats' looks like {noformat} Pool NameActive Pending Completed Blocked All time blocked ReadStage 2 2 125806824 0 0 RequestResponseStage 0 0 355784492 0 0 MutationStage32 766 333060443 0 0 ReadRepairStage 0 04972365 0 0 ReplicateOnWriteStage 0 0 47863116 0 0 GossipStage 0 01110849 0 0 AntiEntropyStage 0 0 2384 0 0 MigrationStage0 0 0 0 0 MemoryMeter 0 0 31508 0 0 MemtablePostFlusher 0 0 21543 0 0 FlushWriter 0 0 20196 0 10 MiscStage 0 0 1049 0 0 PendingRangeCalculator0 0 6 0 0 commitlog_archiver0 0 0 0 0 AntiEntropySessions 1 1 3 0 0 InternalResponseStage 0 0146 0 0 HintedHandoff 0 0150 0 0 Message type Dropped RANGE_SLICE 0 READ_REPAIR 6 PAGED_RANGE 0 BINARY 0 READ62 MUTATION 2377 _TRACE 0 REQUEST_RESPONSE46 COUNTER_MUTATION 347 {noformat} netstats {noformat} Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 4395496 Mismatch (Blocking): 49764 Mismatch (Background): 3505 Pool NameActive Pending Completed Commandsn/a 0 355976985 Responses n/a 0 407590806 {noformat} > 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession > -- > > Key: CASSANDRA-7560 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7560 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Vladimir Avram > > Running {{nodetool repair -pr}} will sometimes hang on one of the resulting > AntiEntropySessions. > The system logs will show the repair command starting > {noformat} > INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) > Starting repair command #1, repairing 256 ranges for keyspace x > {noformat} > You can then see a few AntiEntropySessions completing with: > {noformat} > INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line > 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed > successfully > {noformat} > Finally we reach an AntiEntropySession at some point that hangs just before > requesting the merkle trees for the next column family in line for repair. So > we first see the previous CF being finished and the whole repair sessions > hangs here with no visible progress or errors on this or any of the related > nodes. > {noformat} > INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line > 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully > synced > {noformat} > Notes: > * Single DC 6 node cluster with an average load of 86 GB per node. > * This appears to be random; it does not always happen on the same CF or on > the same session. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7560) 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession
[ https://issues.apache.org/jira/browse/CASSANDRA-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14066996#comment-14066996 ] Yuki Morishita commented on CASSANDRA-7560: --- How is your nodetool netstats/tpstats look like when hanging? Can you provide jstack on repairing nodes? > 'nodetool repair -pr' leads to indefinitely hanging AntiEntropySession > -- > > Key: CASSANDRA-7560 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7560 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Vladimir Avram > > Running {{nodetool repair -pr}} will sometimes hang on one of the resulting > AntiEntropySessions. > The system logs will show the repair command starting > {noformat} > INFO [Thread-3079] 2014-07-15 02:22:56,514 StorageService.java (line 2569) > Starting repair command #1, repairing 256 ranges for keyspace x > {noformat} > You can then see a few AntiEntropySessions completing with: > {noformat} > INFO [AntiEntropySessions:2] 2014-07-15 02:28:12,766 RepairSession.java (line > 282) [repair #eefb3c30-0bc6-11e4-83f7-a378978d0c49] session completed > successfully > {noformat} > Finally we reach an AntiEntropySession at some point that hangs just before > requesting the merkle trees for the next column family in line for repair. So > we first see the previous CF being finished and the whole repair sessions > hangs here with no visible progress or errors on this or any of the related > nodes. > {noformat} > INFO [AntiEntropyStage:1] 2014-07-15 02:38:20,325 RepairSession.java (line > 221) [repair #8f85c1b0-0bc8-11e4-83f7-a378978d0c49] previous_cf is fully > synced > {noformat} > Notes: > * Single DC 6 node cluster with an average load of 86 GB per node. > * This appears to be random; it does not always happen on the same CF or on > the same session. -- This message was sent by Atlassian JIRA (v6.2#6252)