[ https://issues.apache.org/jira/browse/CASSANDRA-10969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102281#comment-15102281 ]
Joel Knighton commented on CASSANDRA-10969: ------------------------------------------- Yes, this is the situation described in this ticket. In 2.1.2, functionality to prevent gossip corruption is incorrectly implemented. This can harm the ability of long-running clusters to function properly after a restart. The generation for local nodes is store in memory. When a node starts, it will receive the generations of other nodes through gossip. After a restart, this problem can occur when a node (say, Node C) first gossips with a different remote node (say, Node A) that has the old generation for a remote node (say, Node B). Then, Node C can no longer get gossip updates for Node B. If Node C had first gossiped with Node B after the restart, then Node C will be fine and can continue to receive gossip updates from Node B. Eventually, rolling restarts of the cluster will solve the issue. It may take several rolling restarts since a node may first gossip with a node with old stored generations, but this will eventually resolve the problem (with increasing probability of success, as fewer nodes in the cluster will have old generations stored). If you ran into this problem with a development or local cluster, you could accelerate this process by restarting the whole cluster at once, but this is clearly unacceptable for a production cluster. > long-running cluster sees bad gossip generation when a node restarts > -------------------------------------------------------------------- > > Key: CASSANDRA-10969 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10969 > Project: Cassandra > Issue Type: Bug > Components: Coordination > Environment: 4-node Cassandra 2.1.1 cluster, each node running on a > Linux 2.6.32-431.20.3.dl6.x86_64 VM > Reporter: T. David Hudson > Assignee: Joel Knighton > Priority: Minor > Fix For: 3.3, 2.1.x, 2.2.x, 3.0.x > > > One of the nodes in a long-running Cassandra 2.1.1 cluster (not under my > control) restarted. The remaining nodes are logging errors like this: > "received an invalid gossip generation for peer xxx.xxx.xxx.xxx; local > generation = 1414613355, received generation = 1450978722" > The gap between the local and received generation numbers exceeds the > one-year threshold added for CASSANDRA-8113. The system clocks are > up-to-date for all nodes. > If this is a bug, the latest released Gossiper.java code in 2.1.x, 2.2.x, and > 3.0.x seems not to have changed the behavior that I'm seeing. > I presume that restarting the remaining nodes will clear up the problem, > whence the minor priority. -- This message was sent by Atlassian JIRA (v6.3.4#6332)