[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216816#comment-17216816 ]
Aleksey Yeschenko edited comment on CASSANDRA-15158 at 10/30/20, 4:07 PM: -------------------------------------------------------------------------- Left a small comment on the 3.0 branch. Also, the following nits for {{MigrationCoordinator}}: 1. A bunch of unused imports 2. {{shouldApplySchemaFrom()}} has an unused argument 3. {{requestQueue}} could be an {{ArrayDequeue}} instead of a {{LinkedList}} - should set a good example for anyone randomly reading this code, even if it's not critical to do the right thing in this context EDIT: LGTM, +1, ship it was (Author: iamaleksey): Left a small comment on the 3.0 branch. Also, the following nits for {{MigrationCoordinator}}: 1. A bunch of unused imports 2. {{shouldApplySchemaFrom()}} has an unused argument 3. {{requestQueue}} could be an {{ArrayDequeue}} instead of a {{LinkedList}} - should set a good example for anyone randomly reading this code, even if it's not critical to do the right thing in this context > Wait for schema agreement rather than in flight schema requests when > bootstrapping > ---------------------------------------------------------------------------------- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema > Reporter: Vincent White > Assignee: Blake Eggleston > Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Time Spent: 10m > Remaining Estimate: 0h > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of > in-flight schema pull requests, and we don't proceed with > bootstrapping/stream until all the latches are released (or we timeout > waiting for each one). One issue with this is that if we have a large schema, > or the retrieval of the schema from the other nodes was unexpectedly slow > then we have no explicit check in place to ensure we have actually received a > schema before we proceed. > While it's possible to increase "migration_task_wait_in_seconds" to force the > node to wait on each latche longer, there are cases where this doesn't help > because the callbacks for the schema pull requests have expired off the > messaging service's callback map > (org.apache.cassandra.net.MessagingService#callbacks) after > request_timeout_in_ms (default 10 seconds) before the other nodes were able > to respond to the new node. > This patch checks for schema agreement between the bootstrapping node and the > rest of the live nodes before proceeding with bootstrapping. It also adds a > check to prevent the new node from flooding existing nodes with simultaneous > schema pull requests as can happen in large clusters. > Removing the latch system should also prevent new nodes in large clusters > getting stuck for extended amounts of time as they wait > `migration_task_wait_in_seconds` on each of the latches left orphaned by the > timed out callbacks. > > ||3.11|| > |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]| > |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]| > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org