[ https://issues.apache.org/jira/browse/CASSANDRA-13569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061094#comment-16061094 ]
Michael Fong commented on CASSANDRA-13569: ------------------------------------------ Hi, [~spo...@gmail.com] I agree w/ you that even ScheduledExecutor on MigrationTask would fail on rare cases. In CASSANDRA-11748, we had patched our own v2.0 source code with similar idea that limits schema pull only once per endpoint. However, we later on have observed a corner case that when two nodes with different schema version boot up at the same time, one node running slightly - a few seconds - faster than the other. The first node requests schema pull and failed since the other node has not yet finished initialization. There has been a huge difference in v2.0 and 3.x code bases, and I do not know if the corner problem still persists. Here is the the problematic code snippet for your reference. {code:java} if (epState == null) { {code} would probably not prevent this. In your patch, if the state of ScheduledFuture return done, things could get much messier since schema migration would never happen. Sincerely, Michael Fong > Schedule schema pulls just once per endpoint > -------------------------------------------- > > Key: CASSANDRA-13569 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13569 > Project: Cassandra > Issue Type: Improvement > Components: Distributed Metadata > Reporter: Stefan Podkowinski > Assignee: Stefan Podkowinski > Fix For: 3.0.x, 3.11.x, 4.x > > > Schema mismatches detected through gossip will get resolved by calling > {{MigrationManager.maybeScheduleSchemaPull}}. This method may decide to > schedule execution of {{MigrationTask}}, but only after using a > {{MIGRATION_DELAY_IN_MS = 60000}} delay (for reasons unclear to me). > Meanwhile, as long as the migration task hasn't been executed, we'll continue > to have schema mismatches reported by gossip and will have corresponding > {{maybeScheduleSchemaPull}} calls, which will schedule further tasks with the > mentioned delay. Some local testing shows that dozens of tasks for the same > endpoint will eventually be executed and causing the same, stormy behavior > for this very endpoints. > My proposal would be to simply not schedule new tasks for the same endpoint, > in case we still have pending tasks waiting for execution after > {{MIGRATION_DELAY_IN_MS}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org