[jira] [Updated] (CASSANDRA-13968) Cannot replace a live node on large clusters

Jeff Jirsa (JIRA) Thu, 30 Nov 2017 13:23:35 -0800

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-13968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jeff Jirsa updated CASSANDRA-13968:
-----------------------------------
    Component/s: Coordination

> Cannot replace a live node on large clusters
> --------------------------------------------
>
>                 Key: CASSANDRA-13968
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13968
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>         Environment: Cassandra 2.1.17, Ubuntu Trusty/Xenial (Linux 3.13, 4.4)
>            Reporter: Joseph Lynch
>            Assignee: Joseph Lynch
>              Labels: gossip
>         Attachments: 
> 0001-During-node-replacement-check-for-updates-in-the-tim.patch, 
> 0002-Only-fail-replacement-if-we-_know_-the-node-is-up.patch
>
>
> During forced node replacements we very frequently (~every time for large 
> clusters) see:
> {noformat}
> ERROR [main] 2017-10-17 06:54:35,680  CassandraDaemon.java:583 - Exception 
> encountered during startup
> java.lang.UnsupportedOperationException: Cannot replace a live node...
> {noformat}
> The old node is dead, the new node that is replacing it thinks it is dead (DN 
> state), and all other nodes think it is dead (all have the DN state). 
> However, I believe there are two bugs in the "is live" check that can cause 
> this error, namely that:
> 1. We sleep for 
> [BROADCAST_INTERVAL|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/service/StorageService.java#L905]
>  (hardcoded 60s on 2.1, on later version configurable but still 60s by 
> default), but 
> [check|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/service/StorageService.java#L919]
>  for an update in the last RING_DELAY seconds (typically set to 30s). When a 
> fresh node is joining, in my experience, [the 
> schema|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/service/StorageService.java#L859]
>  check almost immediately returns true after gossiping with seeds, so in 
> reality we do not even sleep for RING_DELAY. If operators increase ring delay 
> past broadcast_interval (as you might do if you think you are victim to the 
> second bug below), then you guarantee that you will always get the exception 
> because the gossip update is basically guaranteed to happen in the last 
> RING_DELAY seconds since you didn't sleep for that duration (you slept for 
> broadcast). For example if an operator sets ring delay to 300s, then the 
> check says "oh yea, the last update was 59 seconds ago, which is sooner than 
> 300s, so fail".
> 2. We don't actually check that the node is alive, we just check that a 
> gossip update has happened in the last X seconds. Sometimes with large 
> clusters nodes are still converging on the proper generation/version of a 
> dead node, and the "is live" check prevents an operator from replacing the 
> node until gossip has settled on the cluster regarding the dead node, which 
> for large clusters can take a really long time. This can be really hurtful to 
> availability in cloud environments and every time I've seen this error it's 
> the case that the new node believes that the old node is down (since 
> [markAlive|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/gms/Gossiper.java#L954]
>  [marks 
> dead|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/gms/Gossiper.java#L962]
>  first and then triggers a callback to 
> [realMarkAlive|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/gms/Gossiper.java#L975]
>  which never triggers because the old node is actually down).
> I think that #1 is definitely a bug, #2 might be considered an extra safety" 
> feature (that you don't allow replacement during gossip convergence), but 
> given that the operator took the effort to supply the replace_address flag, I 
> think it's prudent to only fail if we really know something is wrong.
> I've attached two patches against 2.1, one that fixes bug #1 and one that 
> fixes (imo) bug #2. I was thinking for #1 that we may want to prevent the 
> schema check from exiting the RING_DELAY sleep early but maybe it's just 
> better to backport configurable broadcast_interval and pick the maximum or 
> something. If we don't like the way I've worked around #2, maybe I could make 
> it an option that operators could turn on if they wanted? If folks are happy 
> with the approach I can attach patches for 2.2, 3.0, and 3.11.
> A relevant example of a log showing the first bug (in this case the node that 
> was being replaced was drained moving it to shutdown before replacement, and 
> ring delay was forced to 300s because the cluster is very large):
> {noformat}
> INFO  [ScheduledTasks:1] 2017-10-17 06:41:21,325  StorageService.java:189 - 
> Overriding RING_DELAY to 300000ms
> INFO  [main] 2017-10-17 06:41:24,240  StorageService.java:508 - Gathering 
> node replacement information for /OLD_ADDRESS
> INFO  [GossipStage:1] 2017-10-17 06:41:25,198  Gossiper.java:1032 - Node 
> /OLD_ADDRESS is now part of the cluster
> INFO  [GossipStage:1] 2017-10-17 06:41:25,200  Gossiper.java:1011 - 
> InetAddress /OLD_ADDRESS is now DOWN
> INFO  [main] 2017-10-17 06:41:25,617  StorageService.java:1164 - JOINING: 
> waiting for ring information
> INFO  [main] 2017-10-17 06:41:29,618  StorageService.java:1164 - JOINING: 
> schema complete, ready to bootstrap
> INFO  [main] 2017-10-17 06:41:29,618  StorageService.java:1164 - JOINING: 
> waiting for pending range calculation
> INFO  [main] 2017-10-17 06:41:29,718  StorageService.java:1164 - JOINING: 
> calculation complete, ready to bootstrap
> INFO  [GossipStage:1] 2017-10-17 06:41:30,606  Gossiper.java:1032 - Node 
> /OLD_ADDRESS is now part of the cluster
> INFO  [GossipStage:1] 2017-10-17 06:41:30,606  TokenMetadata.java:464 - 
> Updating topology for /OLD_ADDRESS
> INFO  [GossipStage:1] 2017-10-17 06:41:30,607  TokenMetadata.java:464 - 
> Updating topology for /OLD_ADDRESS
> INFO  [GossipStage:1] 2017-10-17 06:41:30,614  Gossiper.java:1011 - 
> InetAddress /OLD_ADDRESS is now DOWN
> ERROR [main] 2017-10-17 06:42:29,722  CassandraDaemon.java:583 - Exception 
> encountered during startup
> java.lang.UnsupportedOperationException: Cannot replace a live node...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-13968) Cannot replace a live node on large clusters

Reply via email to