[ 
https://issues.apache.org/jira/browse/CASSANDRA-13569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061094#comment-16061094
 ] 

Michael Fong commented on CASSANDRA-13569:
------------------------------------------

Hi, [~spo...@gmail.com]

I agree w/ you that even ScheduledExecutor on MigrationTask would fail on rare 
cases. 

In CASSANDRA-11748, we had patched our own v2.0 source code with similar idea 
that limits schema pull only once per endpoint. However, we later on have 
observed a corner case that when two nodes with different schema version boot 
up at the same time, one node running slightly - a few seconds - faster than 
the other. The first node requests schema pull and failed since the other node 
has not yet finished initialization. 

There has been a huge difference in v2.0 and 3.x code bases, and I do not know 
if the corner problem still persists. Here is the the problematic code snippet 
for your reference. 
{code:java}
if (epState == null)  {
{code} would probably not prevent this. In your patch, if the state of 
ScheduledFuture return done, things could get much messier since schema 
migration would never happen. 

Sincerely,

Michael Fong


> Schedule schema pulls just once per endpoint
> --------------------------------------------
>
>                 Key: CASSANDRA-13569
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13569
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Distributed Metadata
>            Reporter: Stefan Podkowinski
>            Assignee: Stefan Podkowinski
>             Fix For: 3.0.x, 3.11.x, 4.x
>
>
> Schema mismatches detected through gossip will get resolved by calling 
> {{MigrationManager.maybeScheduleSchemaPull}}. This method may decide to 
> schedule execution of {{MigrationTask}}, but only after using a 
> {{MIGRATION_DELAY_IN_MS = 60000}} delay (for reasons unclear to me). 
> Meanwhile, as long as the migration task hasn't been executed, we'll continue 
> to have schema mismatches reported by gossip and will have corresponding 
> {{maybeScheduleSchemaPull}} calls, which will schedule further tasks with the 
> mentioned delay. Some local testing shows that dozens of tasks for the same 
> endpoint will eventually be executed and causing the same, stormy behavior 
> for this very endpoints.
> My proposal would be to simply not schedule new tasks for the same endpoint, 
> in case we still have pending tasks waiting for execution after 
> {{MIGRATION_DELAY_IN_MS}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to