[ https://issues.apache.org/jira/browse/CASSANDRA-15059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802986#comment-16802986 ]
Ariel Weisberg edited comment on CASSANDRA-15059 at 3/27/19 4:22 PM: --------------------------------------------------------------------- Finished my review. [Should this always wrap in AssertionError?|https://github.com/apache/cassandra/compare/cassandra-3.0...bdeggleston:15059-3.0?expand=1#diff-8be666c70553b1f0017a01458c490f47R340]. I get that InterruptedException is probably an Error, but ExecutionException is just a run of the mill exception. It's not super important as it won't have a functional difference. I think this fixes the problem it sets out to fix. I checked the call hierarchy for {{Gossiper.runInGossipStageBlocking}} and I think that every caller is not holding any locks or resources that might stop the Gossip thread from making progress. All of them don't appear to be holding anything obvious. Documentation wise definitely document {{runInGossipStageBlocking}}, {{assassinate}}, and {{convict}} warning about the risk of deadlock if the Gossip thread ends up needing to acquire a resource via a listener (or FD's listener if there is a path for that) the calling thread holds. Future wise this doesn't do anything to address the underlying fragility in how Gossiper doesn't document what is safe to call from outside the Gossip thread and what isn't. It also doesn't validate the correct thread is running a given method. I think we want to add two (or three?) interfaces that Gossiper can implement. One is for methods that are non-mutating and safe to call from any thread. The other is for methods for that should only be called from the Gossip stage thread. And maybe a third which contains methods that mutate, but block on the Gossip stage. So Gossiper.instance would go away and you would have a reference to the Gossiper via one of the 2-3 interfaces. Additionally any method that should only run in Gossip stage interface has a {{Preconditions.checkState}} checking that it's actually running in the Gossip stage. WDYT? was (Author: aweisberg): Finished my review. I can't tell what happened with CircleCI as it's quite complicated now, but it doesn't look like everything ran and/or passed. [Should this always wrap in AssertionError?|https://github.com/apache/cassandra/compare/cassandra-3.0...bdeggleston:15059-3.0?expand=1#diff-8be666c70553b1f0017a01458c490f47R340]. I get that InterruptedException is probably an Error, but ExecutionException is just a run of the mill exception. It's not super important as it won't have a functional difference. I think this fixes the problem it sets out to fix. I checked the call hierarchy for {{Gossiper.runInGossipStageBlocking}} and I think that every caller is not holding any locks or resources that might stop the Gossip thread from making progress. All of them don't appear to be holding anything obvious. Documentation wise definitely document {{runInGossipStageBlocking}}, {{assassinate}}, and {{convict}} warning about the risk of deadlock if the Gossip thread ends up needing to acquire a resource via a listener (or FD's listener if there is a path for that) the calling thread holds. Future wise this doesn't do anything to address the underlying fragility in how Gossiper doesn't document what is safe to call from outside the Gossip thread and what isn't. It also doesn't validate the correct thread is running a given method. I think we want to add two (or three?) interfaces that Gossiper can implement. One is for methods that are non-mutating and safe to call from any thread. The other is for methods for that should only be called from the Gossip stage thread. And maybe a third which contains methods that mutate, but block on the Gossip stage. So Gossiper.instance would go away and you would have a reference to the Gossiper via one of the 2-3 interfaces. Additionally any method that should only run in Gossip stage interface has a {{Preconditions.checkState}} checking that it's actually running in the Gossip stage. WDYT? > Gossiper#markAlive can race with Gossiper#markDead > -------------------------------------------------- > > Key: CASSANDRA-15059 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15059 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip > Reporter: Blake Eggleston > Assignee: Blake Eggleston > Priority: Normal > > The Gossiper class is not threadsafe and assumes all state changes happen in > a single thread (the gossip stage). Gossiper#convict, however, can be called > from the GossipTasks thread. This creates a race where calls to > Gossiper#markAlive and Gossiper#markDead can interleave, corrupting gossip > state. Gossiper#assassinateEndpoint has a similar problem, being called from > the mbean server thread. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org