[ https://issues.apache.org/jira/browse/CASSANDRA-18543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stefan Miklosovic updated CASSANDRA-18543: ------------------------------------------ Bug Category: Parent values: Availability(12983) Complexity: Normal Component/s: Cluster/Gossip Discovered By: User Report Fix Version/s: 3.0.x 3.11.x 4.0.x 4.1.x 5.x Severity: Normal Assignee: Stefan Miklosovic Status: Open (was: Triage Needed) > Waiting for gossip to settle does not wait for live endpoints > ------------------------------------------------------------- > > Key: CASSANDRA-18543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18543 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip > Reporter: Cameron Zemek > Assignee: Stefan Miklosovic > Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0.x, 4.1.x, 5.x > > Attachments: gossip.patch, gossip4.patch > > Time Spent: 40m > Remaining Estimate: 0h > > When a node starts it will get endpoint states (via shadow round) but have > all nodes marked as down. The problem is the wait to settle only checks the > size of endpoint states is stable before starting Native transport. Once > native transport starts it will receive queries and fail consistency levels > such as LOCAL_QUORUM since it still thinks nodes are down. > This is problem for a number of large clusters for our customers. The cluster > has quorum but due to this issue a node restart is causing a bunch of query > errors. > My initial solution to this was to only check live endpoints size in addition > to size of endpoint states. This worked but I noticed in testing this fix > that there also a lot of duplication of checking the same node (via Echo > messages) for liveness. So the patch also removes this duplication of > checking node is UP in markAlive. > The final problem I found while testing is sometimes could still not see a > change in live endpoints due to only 1 second polling, so the patch allows > for overridding the settle parameters. I could not reliability reproduce this > but think its worth providing a way to override these hardcoded values. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org