[ https://issues.apache.org/jira/browse/CASSANDRA-7307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010428#comment-14010428 ]
Brandon Williams commented on CASSANDRA-7307: --------------------------------------------- I'll note this is especially pernicious for replacement, since the time to mark the node down will always be longer than RING_DELAY, and overriding RING_DELAY to be long enough is annoying. Here's how long it takes with the initial value set to 3000ms: {noformat} INFO 22:22:06,270 InetAddress /10.208.8.63 is now UP INFO 22:23:01,978 InetAddress /10.208.8.63 is now DOWN {noformat} Which is better, but still exceeds RING_DELAY, though overriding that to one minute or so is much more reasonable. > New nodes mark dead nodes as up for 10 minutes > ---------------------------------------------- > > Key: CASSANDRA-7307 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7307 > Project: Cassandra > Issue Type: Bug > Reporter: Richard Low > Assignee: Brandon Williams > Fix For: 1.2.17 > > > When doing a node replacement when other nodes are down we see the down nodes > marked as up for about 10 minutes. This means requests are routed to the dead > nodes causing timeouts. It also means replacing a node when multiple nodes > from a replica set is extremely difficult - the node usually tries to stream > from a dead node and the replacement fails. > This isn't limited to host replacement. I did a simple test: > 1. Create a 2 node cluster > 2. Kill node 2 > 3. Start a 3rd node with a unique token (I used auto_bootstrap=false but I > don't think this is significant) > The 3rd node lists node 2 (127.0.0.2) as up for almost 10 minutes: > {code} > INFO [main] 2014-05-27 14:28:24,753 CassandraDaemon.java (line 119) Logging > initialized > INFO [GossipStage:1] 2014-05-27 14:28:31,492 Gossiper.java (line 843) Node > /127.0.0.2 is now part of the cluster > INFO [GossipStage:1] 2014-05-27 14:28:31,495 Gossiper.java (line 809) > InetAddress /127.0.0.2 is now UP > INFO [GossipTasks:1] 2014-05-27 14:37:44,526 Gossiper.java (line 823) > InetAddress /127.0.0.2 is now DOWN > {code} > I reproduced on 1.2.15 and 1.2.16. -- This message was sent by Atlassian JIRA (v6.2#6252)