[ 
https://issues.apache.org/jira/browse/CASSANDRA-18543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731479#comment-17731479
 ] 

Stefan Miklosovic commented on CASSANDRA-18543:
-----------------------------------------------

builds:

3.11 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/2423/workflows/26a6636a-3b8a-4ccd-9d18-192f8945d42a
4.0 j11 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/2328/workflows/c193288a-7c33-4507-bd92-c7d859c380a2
4.0 j8 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/2328/workflows/bd9a4ef4-0156-428e-8b77-701f5affcccf
4.1 j11 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/2416/workflows/b308293d-082e-4700-89cc-e0f1971c82cb
4.1 j8 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/2416/workflows/ae61881c-eb8d-4572-92e5-3070ac560f4e
trunk j11 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/2417/workflows/56e01f19-4b96-46e2-bdc7-b3d731bab2ab
trunk j8 
https://app.circleci.com/pipelines/github/instaclustr/cassandra/2417/workflows/c66ecba4-cfc7-4ccc-87e2-85079db8700d

branches:
3.11 https://github.com/instaclustr/cassandra/commits/CASSANDRA-18543-3.11
4.0 https://github.com/instaclustr/cassandra/commits/CASSANDRA-18543-4.0
4.1 https://github.com/instaclustr/cassandra/commits/CASSANDRA-18543-4.1
trunk: https://github.com/instaclustr/cassandra/commits/CASSANDRA-18543-trunk


> Waiting for gossip to settle does not wait for live endpoints
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-18543
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18543
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Gossip
>            Reporter: Cameron Zemek
>            Assignee: Stefan Miklosovic
>            Priority: Normal
>             Fix For: 3.11.x, 4.0.x, 4.1.x, 5.x
>
>         Attachments: gossip.patch, gossip4.patch
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> When a node starts it will get endpoint states (via shadow round) but have 
> all nodes marked as down. The problem is the wait to settle only checks the 
> size of endpoint states is stable before starting Native transport. Once 
> native transport starts it will receive queries and fail consistency levels 
> such as LOCAL_QUORUM since it still thinks nodes are down.
> This is problem for a number of large clusters for our customers. The cluster 
> has quorum but due to this issue a node restart is causing a bunch of query 
> errors.
> My initial solution to this was to only check live endpoints size in addition 
> to size of endpoint states. This worked but I noticed in testing this fix 
> that there also a lot of duplication of checking the same node (via Echo 
> messages) for liveness. So the patch also removes this duplication of 
> checking node is UP in markAlive.
> The final problem I found while testing is sometimes could still not see a 
> change in live endpoints due to only 1 second polling, so the patch allows 
> for overridding the settle parameters. I could not reliability reproduce this 
> but think its worth providing a way to override these hardcoded values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to