Goodness Ayinmode created CASSANDRA-19941:
---------------------------------------------

             Summary: Move network operations outside the lock in 
Gossiper$GossipTask
                 Key: CASSANDRA-19941
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19941
             Project: Cassandra
          Issue Type: Improvement
          Components: Cluster/Gossip
            Reporter: Goodness Ayinmode


To execute the gossip protocol and exchange state info with other nodes, 
_[GossiperTask.run()|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L321]_
 invokes 
{_}[doGossipToLiveMember|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L955],[
 
maybeGossipToUnreachableMember|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L964]{_},
 and[ 
_maybeGossipToSeed_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L982],
 with all 3 methods invoking[ 
_sendGossip()_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L933]
 to send a gossip message to a randomly selected endpoint. The interaction 
between GossiperTask.run() and sendGossip() creates a potential synchronization 
bottleneck due to the lock 
([_taskLock_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L328])
 being held during network-bound operations. GossiperTask.run() directly calls 
[_waitUntilListening()_|https://github.com/apache/cassandra/blob/7b1eb1f0b717beb33e611157766701cd71e4ad8c/src/java/org/apache/cassandra/gms/Gossiper.java#L326]
 which will wait for the MessagingService to start listening, but there could 
be delays if the messaging service is slow to start or has issues. Also, if 
sendGossip() encounters network-related delays (i.e. network latency, timeouts, 
slow or unresponsive nodes) when there is a large number of nodes, the taskLock 
could be held for longer periods,  possibly increasing the risk of a backlog of 
waiting threads (if delays are frequent) and also affecting the scheduling of 
subsequent tasks. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to