[ https://issues.apache.org/jira/browse/CASSANDRA-18543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Cameron Zemek updated CASSANDRA-18543: -------------------------------------- Attachment: (was: gossip4.patch) > Waiting for gossip to settle does not wait for live endpoints > ------------------------------------------------------------- > > Key: CASSANDRA-18543 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18543 > Project: Cassandra > Issue Type: Bug > Reporter: Cameron Zemek > Priority: Normal > Attachments: gossip.patch, gossip4.patch > > > When a node starts it will get endpoint states (via shadow round) but have > all nodes marked as down. The problem is the wait to settle only checks the > size of endpoint states is stable before starting Native transport. Once > native transport starts it will receive queries and fail consistency levels > such as LOCAL_QUORUM since it still thinks nodes are down. > This is problem for a number of large clusters for our customers. The cluster > has quorum but due to this issue a node restart is causing a bunch of query > errors. > My initial solution to this was to only check live endpoints size in addition > to size of endpoint states. This worked but I noticed in testing this fix > that there also a lot of duplication of checking the same node (via Echo > messages) for liveness. So the patch also removes this duplication of > checking node is UP in markAlive. > The final problem I found while testing is sometimes could still not see a > change in live endpoints due to only 1 second polling, so the patch allows > for overridding the settle parameters. I could not reliability reproduce this > but think its worth providing a way to override these hardcoded values. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org