[
https://issues.apache.org/jira/browse/CASSANDRA-13968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273471#comment-16273471
]
Jeff Jirsa commented on CASSANDRA-13968:
----------------------------------------
Marking Jason as reviewer since he was silly enough to suggest he may be
willing to do it.
> Cannot replace a live node on large clusters
> --------------------------------------------
>
> Key: CASSANDRA-13968
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13968
> Project: Cassandra
> Issue Type: Bug
> Components: Coordination
> Environment: Cassandra 2.1.17, Ubuntu Trusty/Xenial (Linux 3.13, 4.4)
> Reporter: Joseph Lynch
> Assignee: Joseph Lynch
> Labels: gossip
> Attachments:
> 0001-During-node-replacement-check-for-updates-in-the-tim.patch,
> 0002-Only-fail-replacement-if-we-_know_-the-node-is-up.patch
>
>
> During forced node replacements we very frequently (~every time for large
> clusters) see:
> {noformat}
> ERROR [main] 2017-10-17 06:54:35,680 CassandraDaemon.java:583 - Exception
> encountered during startup
> java.lang.UnsupportedOperationException: Cannot replace a live node...
> {noformat}
> The old node is dead, the new node that is replacing it thinks it is dead (DN
> state), and all other nodes think it is dead (all have the DN state).
> However, I believe there are two bugs in the "is live" check that can cause
> this error, namely that:
> 1. We sleep for
> [BROADCAST_INTERVAL|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/service/StorageService.java#L905]
> (hardcoded 60s on 2.1, on later version configurable but still 60s by
> default), but
> [check|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/service/StorageService.java#L919]
> for an update in the last RING_DELAY seconds (typically set to 30s). When a
> fresh node is joining, in my experience, [the
> schema|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/service/StorageService.java#L859]
> check almost immediately returns true after gossiping with seeds, so in
> reality we do not even sleep for RING_DELAY. If operators increase ring delay
> past broadcast_interval (as you might do if you think you are victim to the
> second bug below), then you guarantee that you will always get the exception
> because the gossip update is basically guaranteed to happen in the last
> RING_DELAY seconds since you didn't sleep for that duration (you slept for
> broadcast). For example if an operator sets ring delay to 300s, then the
> check says "oh yea, the last update was 59 seconds ago, which is sooner than
> 300s, so fail".
> 2. We don't actually check that the node is alive, we just check that a
> gossip update has happened in the last X seconds. Sometimes with large
> clusters nodes are still converging on the proper generation/version of a
> dead node, and the "is live" check prevents an operator from replacing the
> node until gossip has settled on the cluster regarding the dead node, which
> for large clusters can take a really long time. This can be really hurtful to
> availability in cloud environments and every time I've seen this error it's
> the case that the new node believes that the old node is down (since
> [markAlive|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/gms/Gossiper.java#L954]
> [marks
> dead|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/gms/Gossiper.java#L962]
> first and then triggers a callback to
> [realMarkAlive|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/gms/Gossiper.java#L975]
> which never triggers because the old node is actually down).
> I think that #1 is definitely a bug, #2 might be considered an extra safety"
> feature (that you don't allow replacement during gossip convergence), but
> given that the operator took the effort to supply the replace_address flag, I
> think it's prudent to only fail if we really know something is wrong.
> I've attached two patches against 2.1, one that fixes bug #1 and one that
> fixes (imo) bug #2. I was thinking for #1 that we may want to prevent the
> schema check from exiting the RING_DELAY sleep early but maybe it's just
> better to backport configurable broadcast_interval and pick the maximum or
> something. If we don't like the way I've worked around #2, maybe I could make
> it an option that operators could turn on if they wanted? If folks are happy
> with the approach I can attach patches for 2.2, 3.0, and 3.11.
> A relevant example of a log showing the first bug (in this case the node that
> was being replaced was drained moving it to shutdown before replacement, and
> ring delay was forced to 300s because the cluster is very large):
> {noformat}
> INFO [ScheduledTasks:1] 2017-10-17 06:41:21,325 StorageService.java:189 -
> Overriding RING_DELAY to 300000ms
> INFO [main] 2017-10-17 06:41:24,240 StorageService.java:508 - Gathering
> node replacement information for /OLD_ADDRESS
> INFO [GossipStage:1] 2017-10-17 06:41:25,198 Gossiper.java:1032 - Node
> /OLD_ADDRESS is now part of the cluster
> INFO [GossipStage:1] 2017-10-17 06:41:25,200 Gossiper.java:1011 -
> InetAddress /OLD_ADDRESS is now DOWN
> INFO [main] 2017-10-17 06:41:25,617 StorageService.java:1164 - JOINING:
> waiting for ring information
> INFO [main] 2017-10-17 06:41:29,618 StorageService.java:1164 - JOINING:
> schema complete, ready to bootstrap
> INFO [main] 2017-10-17 06:41:29,618 StorageService.java:1164 - JOINING:
> waiting for pending range calculation
> INFO [main] 2017-10-17 06:41:29,718 StorageService.java:1164 - JOINING:
> calculation complete, ready to bootstrap
> INFO [GossipStage:1] 2017-10-17 06:41:30,606 Gossiper.java:1032 - Node
> /OLD_ADDRESS is now part of the cluster
> INFO [GossipStage:1] 2017-10-17 06:41:30,606 TokenMetadata.java:464 -
> Updating topology for /OLD_ADDRESS
> INFO [GossipStage:1] 2017-10-17 06:41:30,607 TokenMetadata.java:464 -
> Updating topology for /OLD_ADDRESS
> INFO [GossipStage:1] 2017-10-17 06:41:30,614 Gossiper.java:1011 -
> InetAddress /OLD_ADDRESS is now DOWN
> ERROR [main] 2017-10-17 06:42:29,722 CassandraDaemon.java:583 - Exception
> encountered during startup
> java.lang.UnsupportedOperationException: Cannot replace a live node...
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]