I've been running into a variety of tricky to diagnose problems recently
that could be summarized as "bootstrap & related tasks fail without
extra hacky sleep time".
This is a sample edited log file for bootstrapping a node that captures
the general dynamics: http://pastebin.com/yeN9USLt This build has been
modified (from 1.2.10) to sleep 4*RING_DELAY in
StorageService.bootstrap(). A few notes:
* At 30s nodes are still flapping UP and DOWN
* handshaking is still going strong at 90s
* Things do stabilize; they don't flap indefinitely
* Bootstrap succeeds once it starts. In this particular cluster a
default RING_DELAY/build (30s) fails every time.
Ping times, TCP retransmit, and other general network stuff look fine.
There are several different tickets (some from me) that reference what
seemed to me to be possibly similar or at least correlated issues:
* CASSANDRA-4288 : prevent thrift server from starting before gossip
has settled
* CASSANDRA-5815 : NPE from migration manager
* CASSANDRA-5915 : node flapping prevents replace_node from succeeding
consistently
* CASSANDRA-6156 : Poor resilience and recovery for bootstrapping node
- "unable to fetch range"
* CASSANDRA-6127 : vnodes don't scale to hundreds of nodes
I suspect that a combination of factors is causing gossip to take longer
to stabilize:
* vnodes
* (cross country or greater) multi-dc
* bigger than a test cluster (> 50 nodes)
* reconnecting snitch
What are other people seeing in their clusters? Doe anyone routinely
change RING_DELAY (google finds precious few references)?