Looking at the output of "nodetool netstats" I see that the bootstrapping nodes pulling from only two of the nine nodes currently in the datacenter. That surprises me: I'd think the vnodes it pulls from would be randomly spread across the existing nodes. We're using Cassandra 2.0.11 with 256 vnodes each.
I also notice that while bootstrapping, the node is quite busy doing compactions. There are over 1000 pending compactions on the new node and it's not finished bootstrapping. I'd think those would be unnecessary, since the other nodes in the data center have zero pending compactions. Perhaps the compactions explains why running "du -hs /var/lib/cassandra/data" on the new node shows more disk space usage than on the old nodes. Is it reasonable to do "nodetool disableautocompaction" on the bootstrapping node? Should that be the default??? If I start bootstrapping one node, it's not yet in the cluster but it decides which token ranges it owns and requests streams for that data. If I then try to bootstrap a SECOND node concurrently, it will take over ownership of some token ranges from the first node. Will the first node then adjust what data it streams? It seems to me the cassandra server needs to keep track of both the OLD token ranges and vnodes and the NEW ones. I'm not convinced that running two bootstraps concurrently (starting the second one after several minutes of delay) is safe. Thanks, Don Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.com<mailto:dona...@audiencescience.com> [AudienceScience]