Goodness Ayinmode created CASSANDRA-19941: ---------------------------------------------
Summary: Move network operations outside the lock in Gossiper$GossipTask Key: CASSANDRA-19941 URL: https://issues.apache.org/jira/browse/CASSANDRA-19941 Project: Cassandra Issue Type: Improvement Components: Cluster/Gossip Reporter: Goodness Ayinmode To execute the gossip protocol and exchange state info with other nodes, _[GossiperTask.run()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L321]_ invokes {_}[doGossipToLiveMember|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L955],[ maybeGossipToUnreachableMember|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L964]{_}, and[ _maybeGossipToSeed_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L982], with all 3 methods invoking[ _sendGossip()_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L933] to send a gossip message to a randomly selected endpoint. The interaction between GossiperTask.run() and sendGossip() creates a potential synchronization bottleneck due to the lock ([_taskLock_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L328]) being held during network-bound operations. GossiperTask.run() directly calls [_waitUntilListening()_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L326] which will wait for the MessagingService to start listening, but there could be delays if the messaging service is slow to start or has issues. Also, if sendGossip() encounters network-related delays (i.e. network latency, timeouts, slow or unresponsive nodes) when there is a large number of nodes, the taskLock could be held for longer periods, possibly increasing the risk of a backlog of waiting threads (if delays are frequent) and also affecting the scheduling of subsequent tasks. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org