[ https://issues.apache.org/jira/browse/CASSANDRA-13968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jeff Jirsa updated CASSANDRA-13968: ----------------------------------- Component/s: Coordination > Cannot replace a live node on large clusters > -------------------------------------------- > > Key: CASSANDRA-13968 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13968 > Project: Cassandra > Issue Type: Bug > Components: Coordination > Environment: Cassandra 2.1.17, Ubuntu Trusty/Xenial (Linux 3.13, 4.4) > Reporter: Joseph Lynch > Assignee: Joseph Lynch > Labels: gossip > Attachments: > 0001-During-node-replacement-check-for-updates-in-the-tim.patch, > 0002-Only-fail-replacement-if-we-_know_-the-node-is-up.patch > > > During forced node replacements we very frequently (~every time for large > clusters) see: > {noformat} > ERROR [main] 2017-10-17 06:54:35,680 CassandraDaemon.java:583 - Exception > encountered during startup > java.lang.UnsupportedOperationException: Cannot replace a live node... > {noformat} > The old node is dead, the new node that is replacing it thinks it is dead (DN > state), and all other nodes think it is dead (all have the DN state). > However, I believe there are two bugs in the "is live" check that can cause > this error, namely that: > 1. We sleep for > [BROADCAST_INTERVAL|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/service/StorageService.java#L905] > (hardcoded 60s on 2.1, on later version configurable but still 60s by > default), but > [check|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/service/StorageService.java#L919] > for an update in the last RING_DELAY seconds (typically set to 30s). When a > fresh node is joining, in my experience, [the > schema|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/service/StorageService.java#L859] > check almost immediately returns true after gossiping with seeds, so in > reality we do not even sleep for RING_DELAY. If operators increase ring delay > past broadcast_interval (as you might do if you think you are victim to the > second bug below), then you guarantee that you will always get the exception > because the gossip update is basically guaranteed to happen in the last > RING_DELAY seconds since you didn't sleep for that duration (you slept for > broadcast). For example if an operator sets ring delay to 300s, then the > check says "oh yea, the last update was 59 seconds ago, which is sooner than > 300s, so fail". > 2. We don't actually check that the node is alive, we just check that a > gossip update has happened in the last X seconds. Sometimes with large > clusters nodes are still converging on the proper generation/version of a > dead node, and the "is live" check prevents an operator from replacing the > node until gossip has settled on the cluster regarding the dead node, which > for large clusters can take a really long time. This can be really hurtful to > availability in cloud environments and every time I've seen this error it's > the case that the new node believes that the old node is down (since > [markAlive|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/gms/Gossiper.java#L954] > [marks > dead|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/gms/Gossiper.java#L962] > first and then triggers a callback to > [realMarkAlive|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/gms/Gossiper.java#L975] > which never triggers because the old node is actually down). > I think that #1 is definitely a bug, #2 might be considered an extra safety" > feature (that you don't allow replacement during gossip convergence), but > given that the operator took the effort to supply the replace_address flag, I > think it's prudent to only fail if we really know something is wrong. > I've attached two patches against 2.1, one that fixes bug #1 and one that > fixes (imo) bug #2. I was thinking for #1 that we may want to prevent the > schema check from exiting the RING_DELAY sleep early but maybe it's just > better to backport configurable broadcast_interval and pick the maximum or > something. If we don't like the way I've worked around #2, maybe I could make > it an option that operators could turn on if they wanted? If folks are happy > with the approach I can attach patches for 2.2, 3.0, and 3.11. > A relevant example of a log showing the first bug (in this case the node that > was being replaced was drained moving it to shutdown before replacement, and > ring delay was forced to 300s because the cluster is very large): > {noformat} > INFO [ScheduledTasks:1] 2017-10-17 06:41:21,325 StorageService.java:189 - > Overriding RING_DELAY to 300000ms > INFO [main] 2017-10-17 06:41:24,240 StorageService.java:508 - Gathering > node replacement information for /OLD_ADDRESS > INFO [GossipStage:1] 2017-10-17 06:41:25,198 Gossiper.java:1032 - Node > /OLD_ADDRESS is now part of the cluster > INFO [GossipStage:1] 2017-10-17 06:41:25,200 Gossiper.java:1011 - > InetAddress /OLD_ADDRESS is now DOWN > INFO [main] 2017-10-17 06:41:25,617 StorageService.java:1164 - JOINING: > waiting for ring information > INFO [main] 2017-10-17 06:41:29,618 StorageService.java:1164 - JOINING: > schema complete, ready to bootstrap > INFO [main] 2017-10-17 06:41:29,618 StorageService.java:1164 - JOINING: > waiting for pending range calculation > INFO [main] 2017-10-17 06:41:29,718 StorageService.java:1164 - JOINING: > calculation complete, ready to bootstrap > INFO [GossipStage:1] 2017-10-17 06:41:30,606 Gossiper.java:1032 - Node > /OLD_ADDRESS is now part of the cluster > INFO [GossipStage:1] 2017-10-17 06:41:30,606 TokenMetadata.java:464 - > Updating topology for /OLD_ADDRESS > INFO [GossipStage:1] 2017-10-17 06:41:30,607 TokenMetadata.java:464 - > Updating topology for /OLD_ADDRESS > INFO [GossipStage:1] 2017-10-17 06:41:30,614 Gossiper.java:1011 - > InetAddress /OLD_ADDRESS is now DOWN > ERROR [main] 2017-10-17 06:42:29,722 CassandraDaemon.java:583 - Exception > encountered during startup > java.lang.UnsupportedOperationException: Cannot replace a live node... > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org