[jira] [Commented] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message

Brandon Williams (JIRA) Fri, 20 Feb 2015 10:17:26 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14329286#comment-14329286
 ]


Brandon Williams commented on CASSANDRA-8336:
---------------------------------------------

bq. The shutting down node might as well set the version of the shutdown state 
to Integer.MAX_VALUE since receiving nodes will blindly use that.

Well, as I explained in an earlier comment, this isn't really much of an 
optimization, and if the nodes receive the RPC first, we have to modify it on 
the receiver anyway, so it seems cleaner to reuse markAsShutdown for both.

bq. Why does it increment the generation number? We call 
Gossiper.instance.start with a new generation number set to the current time so 
it would make sense to use that.

Because start calls maybeInitializeLocalState which won't actually add the 
current time heartbeat, since as the method says, it will only add the new 
state if the gossiper has never been started before (meaning we don't know our 
own state.)

bq. If hit 'Unable to gossip with any seeds’ on replace, it shuts down the 
gossiper. This throws an AssertionError in addLocalApplicationState since the 
local epState is null.

Hmm, probably the best thing to do there is change it from stop to 
stopForLeaving (though that method needs a better name now) since there's no 
point in sending shutdown notifications for a node that isn't a member.

> Quarantine nodes after receiving the gossip shutdown message
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-8336
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8336
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Brandon Williams
>            Assignee: Brandon Williams
>             Fix For: 2.0.13
>
>         Attachments: 8336-v2.txt, 8336-v3.txt, 8336.txt
>
>
> In CASSANDRA-3936 we added a gossip shutdown announcement.  The problem here 
> is that this isn't sufficient; you can still get TOEs and have to wait on the 
> FD to figure things out.  This happens due to gossip propagation time and 
> variance; if node X shuts down and sends the message to Y, but Z has a 
> greater gossip version than Y for X and has not yet received the message, it 
> can initiate gossip with Y and thus mark X alive again.  I propose 
> quarantining to solve this, however I feel it should be a -D parameter you 
> have to specify, so as not to destroy current dev and test practices, since 
> this will mean a node that shuts down will not be able to restart until the 
> quarantine expires.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-8336) Quarantine nodes after receiving the gossip shutdown message

Reply via email to