[ 
https://issues.apache.org/jira/browse/CASSANDRA-12966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15709058#comment-15709058
 ] 

Stefan Podkowinski commented on CASSANDRA-12966:
------------------------------------------------

Seems like the gossip single thread execution is a bit problematic, as this 
also caused some pain in CASSANDRA-12281. Looks like CASSANDRA-8398 will be a 
good thing to have here.

Some comments regarding your patch:

My thoughts on concurrency aspects:
StorageService.handleStateNormal will update tokens for both TokenMetadata and 
SystemKeyspace. The 
previous blocking behavior would ensure both would be in-sync. Offloading the 
system table update to the mutation stage would allow to have the table lag 
behind, but I would not expect any races between mutations, as the execution 
order hasn't changed, just the executor.
Uncoupling the mutations this way without waiting for the write result 
shouldn't be a problem, as the system table is only used during initialization 
and there's no guarantees that the gossip state for a node is always recent 
anyways.

The synchronized keywords for removeEndpoints looks like a leftover from when 
the code would read and write back the modified token set and it should be safe 
to remove it.

As for API modifications, there are now two updateToken versions, one blocking 
and one asynchronous. Maybe async methods should be named differently, as the 
Future return value will not be checked in the code and you wouldn't be able to 
tell which version is called by reading code on the caller side.


> Gossip thread slows down when using batch commit log
> ----------------------------------------------------
>
>                 Key: CASSANDRA-12966
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12966
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jason Brown
>            Assignee: Jason Brown
>            Priority: Minor
>
> When using batch commit log mode, the Gossip thread slows down when peers 
> after a node bounces. This is because we perform a bunch of updates to the 
> peers table via {{SystemKeyspace.updatePeerInfo}}, which is a synchronized 
> method. How quickly each one of those individual updates takes depends on how 
> busy the system is at the time wrt write traffic. If the system is largely 
> quiescent, each update will be relatively quick (just waiting for the fsync). 
> If the system is getting a lot of writes, and depending on the 
> commitlog_sync_batch_window_in_ms, each of the Gossip thread's updates can 
> get stuck in the backlog, which causes the Gossip thread to stop processing. 
> We have observed in large clusters that a rolling restart causes triggers and 
> exacerbates this behavior. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to