Nodes can get stuck in UP state forever, despite being DOWN
-----------------------------------------------------------

                 Key: CASSANDRA-3626
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3626
             Project: Cassandra
          Issue Type: Bug
          Components: Core
            Reporter: Peter Schuller
            Assignee: Peter Schuller


This is a proposed phrasing for an upstream ticket named "Newly discovered 
nodes that are down get stuck in UP state forever" (will edit w/ feedback until 
done):

We have a observed a problem with gossip which, when you are bootstrapping a 
new node (or replacing using the replace_token support), any node in the 
cluster which is Down at the time the node is started, will be assumed to be Up 
and then *never ever* flapped back to Down until you restart the node.

This has at least two implications to replacing or bootstrapping new nodes when 
there are nodes down in the ring:

* If the new node happens to select a node listed as (UP but in reality is 
DOWN) as a stream source, streaming will sit there hanging forever.
* If that doesn't happen (by picking another host), it will instead finish 
bootstrapping correctly, and begin servicing requests all the while thinking 
DOWN nodes are UP, and thus routing requests to them, generating timeouts.

The way to get out of this is to restart the node(s) that you bootstrapped.

I have tested and confirmed the symptom (that the bootstrapped node things 
other nodes are Up) using a fairly recent 1.0. The main debugging effort 
happened on 0.8 however, so all details below refer to 0.8 but are probably 
similar in 1.0.

Steps to reproduce:

* Bring up a cluster of >= 3 nodes. *Ensure RF is < N*, so that the cluster is 
operative with one node removed.
* Pick two random nodes A, and B. Shut them *both* off.
* Wait for everyone to realize they are both off (for good measure).
* Now, take node A and nuke it's data directories and re-start it, such that it 
comes up w/ normal bootstrap (or use replace_token; didn't test that but should 
not affect it).
* Watch how node A starts up, all the while believing node B is down, even 
though all other nodes in the cluster agree that B is down and B is in fact 
still turned off.

The mechanism by which it initially goes into Up state is that the node 
receives a gossip response from any other node in the cluster, and 
GossipDigestAck2VerbHandler.doVerb() calls Gossiper.applyStateLocally().

Gossiper.applyStateLocally() doesn't have any local endpoint state for the 
cluster, so the else statement at the end ("it's a new node") gets triggered 
and handleMajorStateChange() is called. handleMajorStateChange() always calls 
markAlive(), unless the state is a dead state (but "dead" here does not mean 
"not up", but refers to joining/hibernate etc).

So at this point the node is up in the mind of the node you just bootstrapped.

Now, in each gossip round doStatusCheck() is called, which iterates over all 
nodes (including the one falsly Up) and among other things, calls 
FailureDetector.interpret() on each node.

FailureDetector.interpret() is meant to update its sense of Phi for the node, 
and potentially convict it. However there is a short-circuit at the top, 
whereby if we do not yet have any arrival window for the node, we simply return 
immediately.

Arrival intervals are only added as a result of a FailureDetector.report() 
call, which never happens in this case because the initial endpoint state we 
added, which came from a remote node that was up, had the latest version of the 
gossip state (so Gossiper.reportFailureDetector() will never call report()).

The result is that the node can never ever be convicted.

Now, let's ignore for a moment the problem that a node that is actually Down 
will be thought to be Up temporarily for a little while. That is sub-optimal, 
but let's aim for a fix to the more serious problem in this ticket - which is 
that is stays up forever.

Considered solutions:

* When interpret() gets called and there is no arrival window, we could add a 
faked arrival window far back in time to cause the node to have history and be 
marked down. This "works" in the particular test case. The problem is that 
since we are not ourselves actively trying to gossip to these nodes with any 
particular speed, it might take a significant time before we get any kind of 
confirmation from someone else that it's actually Up in cases where the node 
actually *is* Up, so it's not clear that this is a good idea.

* When interpret() gets called and there is no arrival window, we can simply 
convict it immediately. This has roughly similar behavior as the previous 
suggestion.

* When interpret() gets called and there is no arrival window, we can add a 
faked arrival window at the current time, which will allow it to be treated as 
Up until the usual time has passed before we exceed the Phi conviction 
threshold.

* When interpret() gets called and there is no arrival window, we can 
immediately convict it, *and* schedule it for immediate gossip on the next 
round in order to try to ensure they go Up quickly if they are indeed up. This 
has an effect of O(n) gossip traffic, as a special case once during node 
start-up. While theoretically a problem, I personally thing we can ignore it 
for now since it won't be a significant problem any time soon. However, this is 
more complicated since the way we queue up messages is asynchronously to 
background connection attempts. We'd have to make sure the initial gossip 
message actually gets sent on an open TCP connection (I haven't confirmed 
whether this will be the case or not).

The first three are simple to implement, possibly the fourth. But in all cases, 
I am worried about potential negative consequences that I am not seeing.

Thoughts?


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to