[ https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17839886#comment-17839886 ]
Cameron Zemek commented on CASSANDRA-19580: ------------------------------------------- The node trying to replace. So in my replication steps: # replace a node using '-Dcassandra.replace_address=44.239.237.152' # while its replacing kill off cassandra # wipe the cassandra folders # start cassandra again still using the replace address flag After step 2 if I check 'nodetool gossipinfo' the node being replaced (44.239.237.152 in this example) has status of hibernate. During step 4 the other nodes will say 'Not marking /44.239.237.152 alive due to dead state' I did a whole bunch of testing of this yesterday and this is the key issue as far as I can tell. Due to the replacing node being in hibernate they won't send a SYN (see maybeGossipToUnreachableMember filters out ones in dead state). And without the SYN message the replacing node never gets gossip state of the cluster as its own SYN messages only has itself as digest so ACK replies to those don't include other nodes. > Unable to contact any seeds with node in hibernate status > --------------------------------------------------------- > > Key: CASSANDRA-19580 > URL: https://issues.apache.org/jira/browse/CASSANDRA-19580 > Project: Cassandra > Issue Type: Bug > Reporter: Cameron Zemek > Priority: Normal > > We have customer running into the error 'Unable to contact any seeds!' . I > have been able to reproduce this issue if I kill Cassandra as its joining > which will put the node into hibernate status. Once a node is in hibernate it > will no longer receive any SYN messages from other nodes during startup and > as it sends only itself as digest in outbound SYN messages it never receives > any states in any of the ACK replies. So once it gets to the check > `seenAnySeed` in it fails as the endpointStateMap is empty. > > A workaround is copying the system.peers table from other node but this is > less than ideal. I tested modifying maybeGossipToSeed as follows: > {code:java} > /* Possibly gossip to a seed for facilitating partition healing */ > private void maybeGossipToSeed(MessageOut<GossipDigestSyn> prod) > { > int size = seeds.size(); > if (size > 0) > { > if (size == 1 && > seeds.contains(FBUtilities.getBroadcastAddress())) > { > return; > } > if (liveEndpoints.size() == 0) > { > List<GossipDigest> gDigests = prod.payload.gDigests; > if (gDigests.size() == 1 && > gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress())) > { > gDigests = new ArrayList<GossipDigest>(); > GossipDigestSyn digestSynMessage = new > GossipDigestSyn(DatabaseDescriptor.getClusterName(), > > DatabaseDescriptor.getPartitionerName(), > > gDigests); > MessageOut<GossipDigestSyn> message = new > MessageOut<GossipDigestSyn>(MessagingService.Verb.GOSSIP_DIGEST_SYN, > > digestSynMessage, > > GossipDigestSyn.serializer); > sendGossip(message, seeds); > } > else > { > sendGossip(prod, seeds); > } > } > else > { > /* Gossip with the seed with some probability. */ > double probability = seeds.size() / (double) > (liveEndpoints.size() + unreachableEndpoints.size()); > double randDbl = random.nextDouble(); > if (randDbl <= probability) > sendGossip(prod, seeds); > } > } > } > {code} > Only problem is this is the same as SYN from shadow round. It does resolve > the issue however as then receive an ACK with all the states. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org