[ 
https://issues.apache.org/jira/browse/CASSANDRA-15155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17486218#comment-17486218
 ] 

Jai Bheemsen Rao Dhanwada commented on CASSANDRA-15155:
-------------------------------------------------------

[~cowardlydragon] just curious, were able to narrow down this issue to see why 
it happened?

> Bootstrapping node finishes 'successfully' before schema synced, data not 
> streamed
> ----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15155
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15155
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Membership, Consistency/Bootstrap and 
> Decommission
>            Reporter: Constance Eustace
>            Priority: Normal
>         Attachments: debug.log.zip
>
>
> Bootstrap of a new node to expand an existing cluster is completing the 
> bootstrapping "successfully", joining the cluster as UN in nodetool status, 
> but has no data and no active streams. Writes and reads start being served. 
> Environment: AWS, IPV6, three datacenters, asia / europe / us
> Version: 2.2.13
> We have previously scaled the europe and us datacenters from 5 nodes to 25 
> nodes (one node at a time) without incident.
> Asia (tokyo) is a different story. We have seen multiple failure scenarios, 
> but the most troubling is a string of attempted node bootstrappings where the 
> bootstrap completes and the node joins the ring, but there is no data. 
> We were able to expand Asia by four nodes by increasing ring delay to 100 
> seconds, but that has not worked anymore. 
> Attached Log: Our autoscaling + Ansible initial setup creates the node, but 
> the ansible has not run yet, so the autostarted cassandra fails to load, but 
> it has no security group yet so it did not communicate with any other node. 
> That is the 15:15:XX series log messages at the very top.
> Then 15:20:XX series messages begin after ansible has completed setup of the 
> box, and the data dirs and commit log dirs have been scrubbed. 
> This same process ran for EU and US expansions without incident. 
> From what I can tell of the log (DEBUG was enabled):
> Ring information collection begins, so some sort of gossip/cluster 
> communication is healthy:
> INFO [main] 2019-06-12 15:20:05,468 StorageService.java:1142 - JOINING: 
> waiting for ring information
> But almost all of those checks output:
> DEBUG [GossipStage:1] 2019-06-12 15:20:07,673 MigrationManager.java:96 - Not 
> pulling schema because versions match or shouldPullSchemaFrom returned false
> Which seems weird, as we shall see soon. After all the nodes have reported in 
> a similar way, most not pulling because of the above message, and a couple 
> that were interpreted as DOWN, it then does:
> INFO [main] 2019-06-12 15:21:45,486 StorageService.java:1142 - JOINING: 
> schema complete, ready to bootstrap
> INFO [main] 2019-06-12 15:21:45,487 StorageService.java:1142 - JOINING: 
> waiting for pending range calculation
> INFO [main] 2019-06-12 15:21:45,487 StorageService.java:1142 - JOINING: 
> calculation complete, ready to bootstrap
> We then get a huge number of
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find 
> cfId=dd5d7fa0-1e50-11e9-a62d-0d41c97b2404
> INFO [main] 2019-06-12 15:23:25,515 StorageService.java:1142 - JOINING: 
> Starting to bootstrap...
> INFO [main] 2019-06-12 15:23:25,525 StreamResultFuture.java:87 - [Stream 
> #05af9ee0-8d26-11e9-85c1-bd5476090c54] Executing streaming plan for Bootstrap
> INFO [main] 2019-06-12 15:23:25,526 StorageService.java:1199 - Bootstrap 
> completed! for the tokens [-7314981925085449175, ...<tokens>... 
> 5499447097629838103]
> Here are the log messages for MIgrationManager for schema gossiping:
> DEBUG [main] 2019-06-12 15:20:05,423 MigrationManager.java:493 - Gossiping my 
> schema version 59adb24e-f3cd-3e02-97f0-5b395827453f
> DEBUG [MigrationStage:1] 2019-06-12 15:23:25,694 MigrationManager.java:493 - 
> Gossiping my schema version 3d1a9d9e-1120-37ae-abe0-e064cd147a99
> DEBUG [MigrationStage:1] 2019-06-12 15:23:25,775 MigrationManager.java:493 - 
> Gossiping my schema version 0bf74f5a-4b39-3525-b217-e9ccf7a1b6cb
> DEBUG [MigrationStage:1] 2019-06-12 15:23:25,905 MigrationManager.java:493 - 
> Gossiping my schema version b145475a-02dc-370c-8af7-a9aba2d61362
> DEBUG [InternalResponseStage:12] 2019-06-12 15:24:26,445 
> MigrationManager.java:493 - Gossiping my schema version 
> 9c2ed14a-8db5-39b3-af48-6cdb5463c772
> the schema version ending in -6cdb5463c772 is the proper version in the other 
> nodes per gossipinfo. But as can be seen, the bootstrap completion message 
> (15:23:25,526) is logged before four or five intermediate schema versions are 
> created, which seem to be due to system_distributed and other keyspaces being 
> created.
> The Bootstrap completed! message comes from a Future in StorageService line 
> 1199 for cassandra 2.2.13 source:
> {code:java}
> ListenableFuture<StreamState> bootstrapStream = 
> bootstrapper.bootstrap(streamStateStore, !replacing && useStrictConsistency); 
> // handles token update Futures.addCallback(bootstrapStream, new 
> FutureCallback<StreamState>() { @Override public void onSuccess(StreamState 
> streamState) { isBootstrapMode = false; logger.info("Bootstrap completed! for 
> the tokens {}", tokens); }
> {code}
> That future seems to complete a bit too quickly. Perhaps the schema did not 
> synchronize quickly enough and therefore a list of migration/bootstrap tasks 
> not created in time for the bootstrapper future to handle. We suspect the AWS 
> asia datacenter networking/WAN links are very slow, or IPV6 is introducing 
> other delays. 
> But the code should not allow a bootstrap to silently/falsely complete 
> "successfully". 
> Smells like a race condition. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to