[jira] [Commented] (CASSANDRA-10969) long-running cluster sees bad gossip generation when a node restarts

Joel Knighton (JIRA) Fri, 15 Jan 2016 10:51:06 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102281#comment-15102281
 ]


Joel Knighton commented on CASSANDRA-10969:
-------------------------------------------

Yes, this is the situation described in this ticket.

In 2.1.2, functionality to prevent gossip corruption is incorrectly 
implemented. This can harm the ability of long-running clusters to function 
properly after a restart.

The generation for local nodes is store in memory. When a node starts, it will 
receive the generations of other nodes through gossip.

After a restart, this problem can occur when a node (say, Node C) first gossips 
with a different remote node (say, Node A) that has the old generation for a 
remote node (say, Node B). Then, Node C can no longer get gossip updates for 
Node B. If Node C had first gossiped with Node B after the restart, then Node C 
will be fine and can continue to receive gossip updates from Node B.

Eventually, rolling restarts of the cluster will solve the issue. It may take 
several rolling restarts since a node may first gossip with a node with old 
stored generations, but this will eventually resolve the problem (with 
increasing probability of success, as fewer nodes in the cluster will have old 
generations stored).

If you ran into this problem with a development or local cluster, you could 
accelerate this process by restarting the whole cluster at once, but this is 
clearly unacceptable for a production cluster.

> long-running cluster sees bad gossip generation when a node restarts
> --------------------------------------------------------------------
>
>                 Key: CASSANDRA-10969
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10969
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>         Environment: 4-node Cassandra 2.1.1 cluster, each node running on a 
> Linux 2.6.32-431.20.3.dl6.x86_64 VM
>            Reporter: T. David Hudson
>            Assignee: Joel Knighton
>            Priority: Minor
>             Fix For: 3.3, 2.1.x, 2.2.x, 3.0.x
>
>
> One of the nodes in a long-running Cassandra 2.1.1 cluster (not under my 
> control) restarted.  The remaining nodes are logging errors like this:
>     "received an invalid gossip generation for peer xxx.xxx.xxx.xxx; local 
> generation = 1414613355, received generation = 1450978722"
> The gap between the local and received generation numbers exceeds the 
> one-year threshold added for CASSANDRA-8113.  The system clocks are 
> up-to-date for all nodes.
> If this is a bug, the latest released Gossiper.java code in 2.1.x, 2.2.x, and 
> 3.0.x seems not to have changed the behavior that I'm seeing.
> I presume that restarting the remaining nodes will clear up the problem, 
> whence the minor priority.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10969) long-running cluster sees bad gossip generation when a node restarts

Reply via email to