We're testing expanding a 4-node cluster into an 8-node cluster, and we
keep running into issues with the repair process near the end.

We're bringing up nodes 1-by-1 into the cluster, retokening nodes for an
8-node configuration, running nodetool cleanup on the nodes after each
retokening, and then increasing the replication factor to 5. This all works
without issue, and the cluster appears to be healthy in that 8-node
configuration with a replication factor of 5.

However, when we then run nodetool repair on the nodes, it will at some
point stall, even when being run on one of the new nodes.

It doesn't appear to stall while it's performing a compaction or
transferring CF data. We've monitored compactionstats and netstats closely,
and things always stall when a repair command is started, ie:

[2013-10-02 23:19:39,254] Starting repair command #9, repairing 5 ranges
for keyspace ourkeyspace

The last message from AntiEntropyService is usually something to the effect
of:

<190>Oct  3 00:01:02 myhost.com 1970947950 [AntiEntropySessions:24] INFO
 org.apache.cassandra.service.AntiEntropyService  - [repair
#9b17d310-2bbd-11e3-0000-e06ec6c436ff] session completed successfully

... and then things don't start for the next repair. Nothing in the logs
that looks related.

Where this occurs is arbitrary. If I run on individual CFs within
ourkeyspace, some will succeed, and some will fail, but if we start over
and do the 4-node to 8-node expansion again, things will fail at a
different place.

Advice as to what to look at next?

Thanks,

Dave

Reply via email to