[ https://issues.apache.org/jira/browse/CASSANDRA-15155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486218#comment-17486218 ]
Jai Bheemsen Rao Dhanwada commented on CASSANDRA-15155: ------------------------------------------------------- [~cowardlydragon] just curious, were able to narrow down this issue to see why it happened? > Bootstrapping node finishes 'successfully' before schema synced, data not > streamed > ---------------------------------------------------------------------------------- > > Key: CASSANDRA-15155 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15155 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Membership, Consistency/Bootstrap and > Decommission > Reporter: Constance Eustace > Priority: Normal > Attachments: debug.log.zip > > > Bootstrap of a new node to expand an existing cluster is completing the > bootstrapping "successfully", joining the cluster as UN in nodetool status, > but has no data and no active streams. Writes and reads start being served. > Environment: AWS, IPV6, three datacenters, asia / europe / us > Version: 2.2.13 > We have previously scaled the europe and us datacenters from 5 nodes to 25 > nodes (one node at a time) without incident. > Asia (tokyo) is a different story. We have seen multiple failure scenarios, > but the most troubling is a string of attempted node bootstrappings where the > bootstrap completes and the node joins the ring, but there is no data. > We were able to expand Asia by four nodes by increasing ring delay to 100 > seconds, but that has not worked anymore. > Attached Log: Our autoscaling + Ansible initial setup creates the node, but > the ansible has not run yet, so the autostarted cassandra fails to load, but > it has no security group yet so it did not communicate with any other node. > That is the 15:15:XX series log messages at the very top. > Then 15:20:XX series messages begin after ansible has completed setup of the > box, and the data dirs and commit log dirs have been scrubbed. > This same process ran for EU and US expansions without incident. > From what I can tell of the log (DEBUG was enabled): > Ring information collection begins, so some sort of gossip/cluster > communication is healthy: > INFO [main] 2019-06-12 15:20:05,468 StorageService.java:1142 - JOINING: > waiting for ring information > But almost all of those checks output: > DEBUG [GossipStage:1] 2019-06-12 15:20:07,673 MigrationManager.java:96 - Not > pulling schema because versions match or shouldPullSchemaFrom returned false > Which seems weird, as we shall see soon. After all the nodes have reported in > a similar way, most not pulling because of the above message, and a couple > that were interpreted as DOWN, it then does: > INFO [main] 2019-06-12 15:21:45,486 StorageService.java:1142 - JOINING: > schema complete, ready to bootstrap > INFO [main] 2019-06-12 15:21:45,487 StorageService.java:1142 - JOINING: > waiting for pending range calculation > INFO [main] 2019-06-12 15:21:45,487 StorageService.java:1142 - JOINING: > calculation complete, ready to bootstrap > We then get a huge number of > org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find > cfId=dd5d7fa0-1e50-11e9-a62d-0d41c97b2404 > INFO [main] 2019-06-12 15:23:25,515 StorageService.java:1142 - JOINING: > Starting to bootstrap... > INFO [main] 2019-06-12 15:23:25,525 StreamResultFuture.java:87 - [Stream > #05af9ee0-8d26-11e9-85c1-bd5476090c54] Executing streaming plan for Bootstrap > INFO [main] 2019-06-12 15:23:25,526 StorageService.java:1199 - Bootstrap > completed! for the tokens [-7314981925085449175, ...<tokens>... > 5499447097629838103] > Here are the log messages for MIgrationManager for schema gossiping: > DEBUG [main] 2019-06-12 15:20:05,423 MigrationManager.java:493 - Gossiping my > schema version 59adb24e-f3cd-3e02-97f0-5b395827453f > DEBUG [MigrationStage:1] 2019-06-12 15:23:25,694 MigrationManager.java:493 - > Gossiping my schema version 3d1a9d9e-1120-37ae-abe0-e064cd147a99 > DEBUG [MigrationStage:1] 2019-06-12 15:23:25,775 MigrationManager.java:493 - > Gossiping my schema version 0bf74f5a-4b39-3525-b217-e9ccf7a1b6cb > DEBUG [MigrationStage:1] 2019-06-12 15:23:25,905 MigrationManager.java:493 - > Gossiping my schema version b145475a-02dc-370c-8af7-a9aba2d61362 > DEBUG [InternalResponseStage:12] 2019-06-12 15:24:26,445 > MigrationManager.java:493 - Gossiping my schema version > 9c2ed14a-8db5-39b3-af48-6cdb5463c772 > the schema version ending in -6cdb5463c772 is the proper version in the other > nodes per gossipinfo. But as can be seen, the bootstrap completion message > (15:23:25,526) is logged before four or five intermediate schema versions are > created, which seem to be due to system_distributed and other keyspaces being > created. > The Bootstrap completed! message comes from a Future in StorageService line > 1199 for cassandra 2.2.13 source: > {code:java} > ListenableFuture<StreamState> bootstrapStream = > bootstrapper.bootstrap(streamStateStore, !replacing && useStrictConsistency); > // handles token update Futures.addCallback(bootstrapStream, new > FutureCallback<StreamState>() { @Override public void onSuccess(StreamState > streamState) { isBootstrapMode = false; logger.info("Bootstrap completed! for > the tokens {}", tokens); } > {code} > That future seems to complete a bit too quickly. Perhaps the schema did not > synchronize quickly enough and therefore a list of migration/bootstrap tasks > not created in time for the bootstrapper future to handle. We suspect the AWS > asia datacenter networking/WAN links are very slow, or IPV6 is introducing > other delays. > But the code should not allow a bootstrap to silently/falsely complete > "successfully". > Smells like a race condition. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org