[ 
https://issues.apache.org/jira/browse/CASSANDRA-16159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271662#comment-17271662
 ] 

Jon Meredith commented on CASSANDRA-16159:
------------------------------------------

Hey [~e.dimitrova], I'd love to see this fixed. I probably won't have time to 
work on it in the next week or to, so if you have cycles then you're welcome to 
take it.

My plan to fix was for nodes that have \{{replace_address_first_boot}} to delay 
responding to gossip messages until they have completed a real round of gossip 
that contains the host they are replacing, so that when the second replacement 
node contacts the first replacement node it won't get a gossip state that it 
cannot process.

A better longer term fix would be to reuse more of the shadow gossip round 
state, but it's close to release and altering gossip mechanics feels higher 
risk than I'm comfortable with.

> Reduce the Severity of Errors Reported in FailureDetector#isAlive()
> -------------------------------------------------------------------
>
>                 Key: CASSANDRA-16159
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16159
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Gossip
>            Reporter: Caleb Rackliffe
>            Assignee: Jon Meredith
>            Priority: Normal
>             Fix For: 4.0-beta
>
>
> Noticed the following error in the failure detector during a host replacement:
> {noformat}
> java.lang.IllegalArgumentException: Unknown endpoint: 10.38.178.98:7000
>       at 
> org.apache.cassandra.gms.FailureDetector.isAlive(FailureDetector.java:281)
>       at 
> org.apache.cassandra.service.StorageService.handleStateBootreplacing(StorageService.java:2502)
>       at 
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:2182)
>       at 
> org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3145)
>       at 
> org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1242)
>       at 
> org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1368)
>       at 
> org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50)
>       at 
> org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:77)
>       at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:93)
>       at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:44)
>       at 
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:884)
>       at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>       at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> {noformat}
> This particular error looks benign, given that even if it occurs, the node 
> continues to handle the {{BOOT_REPLACE}} state. There are two things we might 
> be able to do to improve {{FailureDetector#isAlive()}} though:
> 1.) We don’t short circuit in the case that the endpoint in question is in 
> quarantine after being removed. It may be useful to check for this so we can 
> avoid logging an ERROR when the endpoint is clearly doomed/dead. (Quarantine 
> works great when the gossip message is _from_ a quarantined endpoint, but in 
> this case, that would be the new/replacing and not the old/replaced one.)
> 2.) We can reduce the severity of the logging from ERROR to WARN and provide 
> better context around how to determine whether or not there’s actually a 
> problem. (ex. “If this occurs while trying to determine liveness for a node 
> that is currently being replaced, it can be safely ignored.”)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to