gossip settling and bootstrap problems

Chris Burroughs Mon, 07 Oct 2013 17:45:46 -0700

I've been running into a variety of tricky to diagnose problems recentlythat could be summarized as "bootstrap & related tasks fail withoutextra hacky sleep time".

This is a sample edited log file for bootstrapping a node that capturesthe general dynamics: http://pastebin.com/yeN9USLt This build has beenmodified (from 1.2.10) to sleep 4*RING_DELAY inStorageService.bootstrap(). A few notes:

 * At 30s nodes are still flapping UP and DOWN
 * handshaking is still going strong at 90s
 * Things do stabilize; they don't flap indefinitely

* Bootstrap succeeds once it starts. In this particular cluster adefault RING_DELAY/build (30s) fails every time.

Ping times, TCP retransmit, and other general network stuff look fine.There are several different tickets (some from me) that reference whatseemed to me to be possibly similar or at least correlated issues:* CASSANDRA-4288 : prevent thrift server from starting before gossiphas settled

 * CASSANDRA-5815 : NPE from migration manager

* CASSANDRA-5915 : node flapping prevents replace_node from succeedingconsistently* CASSANDRA-6156 : Poor resilience and recovery for bootstrapping node- "unable to fetch range"

 * CASSANDRA-6127 : vnodes don't scale to hundreds of nodes

I suspect that a combination of factors is causing gossip to take longerto stabilize:

 * vnodes
 * (cross country or greater) multi-dc
 * bigger than a test cluster (> 50 nodes)
 * reconnecting snitch

What are other people seeing in their clusters? Doe anyone routinelychange RING_DELAY (google finds precious few references)?

gossip settling and bootstrap problems

Reply via email to