[ 
https://issues.apache.org/jira/browse/CASSANDRA-13968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214123#comment-16214123
 ] 

Joseph Lynch commented on CASSANDRA-13968:
------------------------------------------

That would be awesome, and if you don't like the way I've fixed it let me know 
and I'll submit new patches :-)

For now we've got a reasonable workaround. For the first bug we just set ring 
delay < ~50s and monitor that streaming happens properly (the major reason we 
set it so high on some of the large clusters was to avoid streaming issues 
similar to those described in CASSANDRA-11848; I haven't encountered those 
streaming issues but when I do I'll try to figure out what's going on there as 
well). For the second issue we just wait for gossip convergence on the dead 
node before beginning the replace, which is somewhat of a bummer but not a big 
deal.

> Cannot replace a live node on large clusters
> --------------------------------------------
>
>                 Key: CASSANDRA-13968
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13968
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Cassandra 2.1.17, Ubuntu Trusty/Xenial (Linux 3.13, 4.4)
>            Reporter: Joseph Lynch
>            Assignee: Joseph Lynch
>              Labels: gossip
>         Attachments: 
> 0001-During-node-replacement-check-for-updates-in-the-tim.patch, 
> 0002-Only-fail-replacement-if-we-_know_-the-node-is-up.patch
>
>
> During forced node replacements we very frequently (~every time for large 
> clusters) see:
> {noformat}
> ERROR [main] 2017-10-17 06:54:35,680  CassandraDaemon.java:583 - Exception 
> encountered during startup
> java.lang.UnsupportedOperationException: Cannot replace a live node...
> {noformat}
> The old node is dead, the new node that is replacing it thinks it is dead (DN 
> state), and all other nodes think it is dead (all have the DN state). 
> However, I believe there are two bugs in the "is live" check that can cause 
> this error, namely that:
> 1. We sleep for 
> [BROADCAST_INTERVAL|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/service/StorageService.java#L905]
>  (hardcoded 60s on 2.1, on later version configurable but still 60s by 
> default), but 
> [check|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/service/StorageService.java#L919]
>  for an update in the last RING_DELAY seconds (typically set to 30s). When a 
> fresh node is joining, in my experience, [the 
> schema|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/service/StorageService.java#L859]
>  check almost immediately returns true after gossiping with seeds, so in 
> reality we do not even sleep for RING_DELAY. If operators increase ring delay 
> past broadcast_interval (as you might do if you think you are victim to the 
> second bug below), then you guarantee that you will always get the exception 
> because the gossip update is basically guaranteed to happen in the last 
> RING_DELAY seconds since you didn't sleep for that duration (you slept for 
> broadcast). For example if an operator sets ring delay to 300s, then the 
> check says "oh yea, the last update was 59 seconds ago, which is sooner than 
> 300s, so fail".
> 2. We don't actually check that the node is alive, we just check that a 
> gossip update has happened in the last X seconds. Sometimes with large 
> clusters nodes are still converging on the proper generation/version of a 
> dead node, and the "is live" check prevents an operator from replacing the 
> node until gossip has settled on the cluster regarding the dead node, which 
> for large clusters can take a really long time. This can be really hurtful to 
> availability in cloud environments and every time I've seen this error it's 
> the case that the new node believes that the old node is down (since 
> [markAlive|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/gms/Gossiper.java#L954]
>  [marks 
> dead|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/gms/Gossiper.java#L962]
>  first and then triggers a callback to 
> [realMarkAlive|https://github.com/apache/cassandra/blob/943db2488c8b62e1fbe03b132102f0e579c9ae17/src/java/org/apache/cassandra/gms/Gossiper.java#L975]
>  which never triggers because the old node is actually down).
> I think that #1 is definitely a bug, #2 might be considered an extra safety" 
> feature (that you don't allow replacement during gossip convergence), but 
> given that the operator took the effort to supply the replace_address flag, I 
> think it's prudent to only fail if we really know something is wrong.
> I've attached two patches against 2.1, one that fixes bug #1 and one that 
> fixes (imo) bug #2. I was thinking for #1 that we may want to prevent the 
> schema check from exiting the RING_DELAY sleep early but maybe it's just 
> better to backport configurable broadcast_interval and pick the maximum or 
> something. If we don't like the way I've worked around #2, maybe I could make 
> it an option that operators could turn on if they wanted? If folks are happy 
> with the approach I can attach patches for 2.2, 3.0, and 3.11.
> A relevant example of a log showing the first bug (in this case the node that 
> was being replaced was drained moving it to shutdown before replacement, and 
> ring delay was forced to 300s because the cluster is very large):
> {noformat}
> INFO  [ScheduledTasks:1] 2017-10-17 06:41:21,325  StorageService.java:189 - 
> Overriding RING_DELAY to 300000ms
> INFO  [main] 2017-10-17 06:41:24,240  StorageService.java:508 - Gathering 
> node replacement information for /OLD_ADDRESS
> INFO  [GossipStage:1] 2017-10-17 06:41:25,198  Gossiper.java:1032 - Node 
> /OLD_ADDRESS is now part of the cluster
> INFO  [GossipStage:1] 2017-10-17 06:41:25,200  Gossiper.java:1011 - 
> InetAddress /OLD_ADDRESS is now DOWN
> INFO  [main] 2017-10-17 06:41:25,617  StorageService.java:1164 - JOINING: 
> waiting for ring information
> INFO  [main] 2017-10-17 06:41:29,618  StorageService.java:1164 - JOINING: 
> schema complete, ready to bootstrap
> INFO  [main] 2017-10-17 06:41:29,618  StorageService.java:1164 - JOINING: 
> waiting for pending range calculation
> INFO  [main] 2017-10-17 06:41:29,718  StorageService.java:1164 - JOINING: 
> calculation complete, ready to bootstrap
> INFO  [GossipStage:1] 2017-10-17 06:41:30,606  Gossiper.java:1032 - Node 
> /OLD_ADDRESS is now part of the cluster
> INFO  [GossipStage:1] 2017-10-17 06:41:30,606  TokenMetadata.java:464 - 
> Updating topology for /OLD_ADDRESS
> INFO  [GossipStage:1] 2017-10-17 06:41:30,607  TokenMetadata.java:464 - 
> Updating topology for /OLD_ADDRESS
> INFO  [GossipStage:1] 2017-10-17 06:41:30,614  Gossiper.java:1011 - 
> InetAddress /OLD_ADDRESS is now DOWN
> ERROR [main] 2017-10-17 06:42:29,722  CassandraDaemon.java:583 - Exception 
> encountered during startup
> java.lang.UnsupportedOperationException: Cannot replace a live node...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to