[ 
https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15865504#comment-15865504
 ] 

Stefan Podkowinski commented on CASSANDRA-12653:
------------------------------------------------

bq. Stefan Podkowinski - is there some deeper purpose of moving the 
FD.instance.isAlive() check higher in MigrationTask#runMayThrow() method beyond 
"check to see if it's dead before we bother checking to see if it's worth 
sending a migration task"? Is there a reason we don't let 
MM#shouldPullSchemaFrom return false if FD says the instance is dead?

We could move FS.isAlive into MM.shouldPullSchemaFrom, yes. Not totally against 
it, but the log message in MigrationTask in case of a false return value would 
have to be changed and actually the isAlive status should only be relevant at 
task execution, as there's a 60 second delay after submitting it. So in theory 
you could submit a task for a node that has been dead but will be alive again 
at time of execution.

bq. Given that the shadow round is meant to just get ring state without 
changing anything, should we add an explicit check to 
MigrationManager#scheduleSchemaPull() to ensure that 
Gossiper.instance.isInShadowRound() is false before scheduling?

The MigrationManager should never issue a schema pull during shadow round. If 
we add such check, I'd prefer to throw an exception instead of failing 
silently, instead of letting the process run in an undefined state. On the 
other hand, it's not really the business of the MM to monitor the gossiper 
life-cycle when it comes to separations of concerns.

> In-flight shadow round requests
> -------------------------------
>
>                 Key: CASSANDRA-12653
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12653
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Distributed Metadata
>            Reporter: Stefan Podkowinski
>            Assignee: Stefan Podkowinski
>            Priority: Minor
>             Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x
>
>         Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch
>
>
> Bootstrapping or replacing a node in the cluster requires to gather and check 
> some host IDs or tokens by doing a gossip "shadow round" once before joining 
> the cluster. This is done by sending a gossip SYN to all seeds until we 
> receive a response with the cluster state, from where we can move on in the 
> bootstrap process. Receiving a response will call the shadow round done and 
> calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state 
> again.
> The issue here is that at this point there might be other in-flight requests 
> and it's very likely that shadow round responses from other seeds will be 
> received afterwards, while the current state of the bootstrap process doesn't 
> expect this to happen (e.g. gossiper may or may not be enabled). 
> One side effect will be that MigrationTasks are spawned for each shadow round 
> reply except the first. Tasks might or might not execute based on whether at 
> execution time {{Gossiper.resetEndpointStateMap}} had been called, which 
> effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at 
> start of the task. You'll see error log messages such as follows when this 
> happend:
> {noformat}
> INFO  [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - 
> InetAddress /xx.xx.xx.xx is now UP
> ERROR [MigrationStage:1]    2016-09-08 08:36:39,255 FailureDetector.java:223 
> - unknown endpoint /xx.xx.xx.xx
> {noformat}
> Although is isn't pretty, I currently don't see any serious harm from this, 
> but it would be good to get a second opinion (feel free to close as "wont 
> fix").
> /cc [~Stefania] [~thobbs]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to