[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163023#comment-17163023 ]
Blake Eggleston commented on CASSANDRA-15158: --------------------------------------------- I've reworked this a bit more [here|https://github.com/bdeggleston/cassandra/tree/15158-coordinator]. It's now pretty self contained, has some fairly granular unit tests, and fixes a few functional things. Can you take a look [~stefan.miklosovic] and let me know what you think? [~aleksey], can you also take a look / review? The basic idea is that we now track which schema versions exist in gossip, and which endpoints are reporting them, then block bootstrap until we've received a schema for each version. It also tracks how many outstanding migration requests we have per version so we don't send out thousands. Also, what are your opinions of this going into 3.x? I think I'd lean towards putting it in, since it eliminates a scenario where data loss could occur and will shave a few hours off of adding/replacing nodes in large clusters. On the other hand, since it reduces the amount of migration requests sent out on bootstrap, any bugs determining if we've received sufficient schema data could make data loss _more_ likely. > Wait for schema agreement rather then in flight schema requests when > bootstrapping > ---------------------------------------------------------------------------------- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema > Reporter: Vincent White > Assignee: Ben Bromhead > Priority: Normal > Time Spent: 10m > Remaining Estimate: 0h > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of > in-flight schema pull requests, and we don't proceed with > bootstrapping/stream until all the latches are released (or we timeout > waiting for each one). One issue with this is that if we have a large schema, > or the retrieval of the schema from the other nodes was unexpectedly slow > then we have no explicit check in place to ensure we have actually received a > schema before we proceed. > While it's possible to increase "migration_task_wait_in_seconds" to force the > node to wait on each latche longer, there are cases where this doesn't help > because the callbacks for the schema pull requests have expired off the > messaging service's callback map > (org.apache.cassandra.net.MessagingService#callbacks) after > request_timeout_in_ms (default 10 seconds) before the other nodes were able > to respond to the new node. > This patch checks for schema agreement between the bootstrapping node and the > rest of the live nodes before proceeding with bootstrapping. It also adds a > check to prevent the new node from flooding existing nodes with simultaneous > schema pull requests as can happen in large clusters. > Removing the latch system should also prevent new nodes in large clusters > getting stuck for extended amounts of time as they wait > `migration_task_wait_in_seconds` on each of the latches left orphaned by the > timed out callbacks. > > ||3.11|| > |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]| > |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]| > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org