I've been running into a variety of tricky to diagnose problems recently that could be summarized as "bootstrap & related tasks fail without extra hacky sleep time".

This is a sample edited log file for bootstrapping a node that captures the general dynamics: http://pastebin.com/yeN9USLt This build has been modified (from 1.2.10) to sleep 4*RING_DELAY in StorageService.bootstrap(). A few notes:
 * At 30s nodes are still flapping UP and DOWN
 * handshaking is still going strong at 90s
 * Things do stabilize; they don't flap indefinitely
* Bootstrap succeeds once it starts. In this particular cluster a default RING_DELAY/build (30s) fails every time.

Ping times, TCP retransmit, and other general network stuff look fine. There are several different tickets (some from me) that reference what seemed to me to be possibly similar or at least correlated issues: * CASSANDRA-4288 : prevent thrift server from starting before gossip has settled
 * CASSANDRA-5815 : NPE from migration manager
* CASSANDRA-5915 : node flapping prevents replace_node from succeeding consistently * CASSANDRA-6156 : Poor resilience and recovery for bootstrapping node - "unable to fetch range"
 * CASSANDRA-6127 : vnodes don't scale to hundreds of nodes

I suspect that a combination of factors is causing gossip to take longer to stabilize:
 * vnodes
 * (cross country or greater) multi-dc
 * bigger than a test cluster (> 50 nodes)
 * reconnecting snitch

What are other people seeing in their clusters? Doe anyone routinely change RING_DELAY (google finds precious few references)?

Reply via email to