[ 
https://issues.apache.org/jira/browse/CASSANDRA-16213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231736#comment-17231736
 ] 

David Capwell commented on CASSANDRA-16213:
-------------------------------------------

Found the issue, it was caused by CASSANDRA-15158 where it creates a config of 
milliseconds, calls a delay which takes milliseconds, but converts the mills as 
if they were seconds, causing a much longer delay than expected.

Once I fix that I then hit the next issue, we now block waiting on schema which 
will fail since it has a downed node.

{code}
case SCHEMA:
                        SystemKeyspace.updatePeerInfo(endpoint, 
"schema_version", UUID.fromString(value.value));
                        
MigrationCoordinator.instance.reportEndpointVersion(endpoint, 
UUID.fromString(value.value));
                        break;
{code}

when we get the gossip info from the peers it will have node2 (the node that 
crashed abruptly) and wait until it gets the schema, but this won't happen 
since node2 is down and we are replacing it.

This looks unrelated to this patch, but also is a bad condition as any schema 
change with a downed node will cause nodes to fail to start up...

> Cannot replace_address /X because it doesn't exist in gossip
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-16213
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16213
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Gossip, Cluster/Membership
>            Reporter: David Capwell
>            Assignee: David Capwell
>            Priority: Normal
>             Fix For: 4.0-beta
>
>
> We see this exception around nodes crashing and trying to do a host 
> replacement; this error appears to be correlated around multiple node 
> failures.
> A simplified case to trigger this is the following
> *) Have a N node cluster
> *) Shutdown all N nodes
> *) Bring up N-1 nodes (at least 1 seed, else replace seed)
> *) Host replace the N-1th node -> this will fail with the above
> The reason this happens is that the N-1th node isn’t gossiping anymore, and 
> the existing nodes do not have its details in gossip (but have the details in 
> the peers table), so the host replacement fails as the node isn’t known in 
> gossip.
> This affects all versions (tested 3.0 and trunk, assume 2.2 as well)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to