[ 
https://issues.apache.org/jira/browse/CASSANDRA-6097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-6097:
----------------------------------------

    Priority: Trivial  (was: Major)
    
> nodetool repair randomly hangs.
> -------------------------------
>
>                 Key: CASSANDRA-6097
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6097
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: DataStax AMI
>            Reporter: J.B. Langston
>            Priority: Trivial
>
> nodetool repair randomly hangs. This is not the same issue where repair hangs 
> if a stream is disrupted. This can be reproduced on a single-node cluster 
> where no streaming takes place, so I think this may be a JMX connection or 
> timeout issue. Thread dumps show that nodetool is waiting on a JMX response 
> and there are no repair-related threads running in Cassandra. Nodetool main 
> thread waiting for JMX response:
> {code}
> "main" prio=5 tid=7ffa4b001800 nid=0x10aedf000 in Object.wait() [10aede000]
>    java.lang.Thread.State: WAITING (on object monitor)
>       at java.lang.Object.wait(Native Method)
>       - waiting on <7f90d62e8> (a org.apache.cassandra.utils.SimpleCondition)
>       at java.lang.Object.wait(Object.java:485)
>       at 
> org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:34)
>       - locked <7f90d62e8> (a org.apache.cassandra.utils.SimpleCondition)
>       at 
> org.apache.cassandra.tools.RepairRunner.repairAndWait(NodeProbe.java:976)
>       at 
> org.apache.cassandra.tools.NodeProbe.forceRepairAsync(NodeProbe.java:221)
>       at 
> org.apache.cassandra.tools.NodeCmd.optionalKSandCFs(NodeCmd.java:1444)
>       at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1213)
> {code}
> When nodetool hangs, it does not print out the following message:
> "Starting repair command #XX, repairing 1 ranges for keyspace XXX"
> However, Cassandra logs that repair in system.log:
> 1380033480.95  INFO [Thread-154] 10:38:00,882 Starting repair command #X, 
> repairing X ranges for keyspace XXX
> This suggests that the repair command was received by Cassandra but the 
> connection then failed and nodetool didn't receive a response.
> Obviously, running repair on a single-node cluster is pointless but it's the 
> easiest way to demonstrate this problem. The customer who reported this has 
> also seen the issue on his real multi-node cluster.
> Steps to reproduce:
> Note: I reproduced this once on the official DataStax AMI with DSE 3.1.3 
> (Cassandra 1.2.6+patches).  I was unable to reproduce on my Mac using the 
> same version, and subsequent attempts to reproduce it on the AMI were 
> unsuccessful. The customer says he is able is able to reliably reproduce on 
> his Mac using DSE 3.1.3 and occasionally reproduce it on his real cluster. 
> 1) Deploy an AMI using the DataStax AMI at 
> https://aws.amazon.com/amis/datastax-auto-clustering-ami-2-2
> 2) Create a test keyspace
> {code}
> create keyspace test WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> {code}
> 3) Run an endless loop that runs nodetool repair repeatedly:
> {code}
> while true; do nodetool repair -pr test; done
> {code}
> 4) Wait until repair hangs. It may take many tries; the behavior is random.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to