[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17767358#comment-17767358
 ] 

Cameron Zemek commented on CASSANDRA-18845:
-------------------------------------------

 
{noformat}
Sep 19 08:09:45 ip-10-1-57-23 cassandra[131402]: INFO  
org.apache.cassandra.gms.Gossiper Waiting for gossip to settle...
Sep 19 08:10:56 ip-10-1-57-23 cassandra[131402]: DEBUG 
org.apache.cassandra.gms.Gossiper Sending a EchoMessage to 
/35.83.14.80{noformat}
I am struggling to reproduce this ^ I seen it twice, and after enabling more 
logging haven't been able to reproduce again.

 

What I do sometimes see though it taking over 30 seconds to get the first ECHO 
response. Since there are dtests that rely on having CQL up while nodes are 
down, I have attached a patch [^18845-seperate.patch] (against 5.0 branch) that 
is opt-in. Having settle just check for currentLive == liveSize is still 
allowing NTR to start while nodes are marked down. Yes you can increase 
cassandra.gossip_settle_poll_success_required (and/or the other properties) to 
mitigate it but these increase the minimum startup time. Whereas 
[^18845-seperate.patch] doesn't add to this when the cluster is healthy.

 

A more elaborate solution would be to specify the required consistency level. 
And for all token ranges owned by the node you check if you have the needed 
live endpoints to satisfy the consistency level.

> Waiting for gossip to settle on live endpoints
> ----------------------------------------------
>
>                 Key: CASSANDRA-18845
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Cameron Zemek
>            Priority: Normal
>         Attachments: 18845-seperate.patch, delay.log, example.log, 
> image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to