[ https://issues.apache.org/jira/browse/CASSANDRA-6097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Ellis updated CASSANDRA-6097: -------------------------------------- Priority: Minor (was: Trivial) > nodetool repair randomly hangs. > ------------------------------- > > Key: CASSANDRA-6097 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6097 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: DataStax AMI > Reporter: J.B. Langston > Priority: Minor > Attachments: dse.stack, nodetool.stack > > > nodetool repair randomly hangs. This is not the same issue where repair hangs > if a stream is disrupted. This can be reproduced on a single-node cluster > where no streaming takes place, so I think this may be a JMX connection or > timeout issue. Thread dumps show that nodetool is waiting on a JMX response > and there are no repair-related threads running in Cassandra. Nodetool main > thread waiting for JMX response: > {code} > "main" prio=5 tid=7ffa4b001800 nid=0x10aedf000 in Object.wait() [10aede000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <7f90d62e8> (a org.apache.cassandra.utils.SimpleCondition) > at java.lang.Object.wait(Object.java:485) > at > org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:34) > - locked <7f90d62e8> (a org.apache.cassandra.utils.SimpleCondition) > at > org.apache.cassandra.tools.RepairRunner.repairAndWait(NodeProbe.java:976) > at > org.apache.cassandra.tools.NodeProbe.forceRepairAsync(NodeProbe.java:221) > at > org.apache.cassandra.tools.NodeCmd.optionalKSandCFs(NodeCmd.java:1444) > at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1213) > {code} > When nodetool hangs, it does not print out the following message: > "Starting repair command #XX, repairing 1 ranges for keyspace XXX" > However, Cassandra logs that repair in system.log: > 1380033480.95 INFO [Thread-154] 10:38:00,882 Starting repair command #X, > repairing X ranges for keyspace XXX > This suggests that the repair command was received by Cassandra but the > connection then failed and nodetool didn't receive a response. > Obviously, running repair on a single-node cluster is pointless but it's the > easiest way to demonstrate this problem. The customer who reported this has > also seen the issue on his real multi-node cluster. > Steps to reproduce: > Note: I reproduced this once on the official DataStax AMI with DSE 3.1.3 > (Cassandra 1.2.6+patches). I was unable to reproduce on my Mac using the > same version, and subsequent attempts to reproduce it on the AMI were > unsuccessful. The customer says he is able is able to reliably reproduce on > his Mac using DSE 3.1.3 and occasionally reproduce it on his real cluster. > 1) Deploy an AMI using the DataStax AMI at > https://aws.amazon.com/amis/datastax-auto-clustering-ami-2-2 > 2) Create a test keyspace > {code} > create keyspace test WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 1}; > {code} > 3) Run an endless loop that runs nodetool repair repeatedly: > {code} > while true; do nodetool repair -pr test; done > {code} > 4) Wait until repair hangs. It may take many tries; the behavior is random. -- This message was sent by Atlassian JIRA (v6.1#6144)