[ 
https://issues.apache.org/jira/browse/CASSANDRA-15059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16802986#comment-16802986
 ] 

Ariel Weisberg edited comment on CASSANDRA-15059 at 3/27/19 4:23 PM:
---------------------------------------------------------------------

Finished my review.

[Should this always wrap in 
AssertionError?|https://github.com/apache/cassandra/compare/cassandra-3.0...bdeggleston:15059-3.0?expand=1#diff-8be666c70553b1f0017a01458c490f47R340].
 I get that InterruptedException is probably an Error, but ExecutionException 
is just a run of the mill exception. It's not super important as it won't have 
a functional difference.

I think this fixes the problem it sets out to fix. I checked the call hierarchy 
for {{Gossiper.runInGossipStageBlocking}} and I think that every caller is not 
holding any locks or resources that might stop the Gossip thread from making 
progress. All of them don't appear to be holding anything obvious.

Documentation wise definitely document {{runInGossipStageBlocking}}, 
{{assassinate}}, and {{convict}} warning about the risk of deadlock if the 
Gossip thread ends up needing to acquire a resource via a listener (or FD's 
listener if there is a path for that) the calling thread holds.

Future wise this doesn't do anything to address the underlying fragility in how 
Gossiper doesn't document what is safe to call from outside the Gossip thread 
and what isn't. It also doesn't validate the correct thread is running a given 
method.

I think we want to add two (or three?) interfaces that Gossiper can implement. 
One is for methods that are non-mutating and safe to call from any thread. The 
other is for methods for that should only be called from the Gossip stage 
thread. And maybe a third which contains methods that mutate, but block on the 
Gossip stage. So {{Gossiper.instance}} would go away and you would have a 
reference to the Gossiper via one of the 2-3 interfaces.

Additionally any method that should only run in Gossip stage interface should 
have a {{Preconditions.checkState}} checking that it's actually running in the 
Gossip stage.

WDYT?


was (Author: aweisberg):
Finished my review.

[Should this always wrap in 
AssertionError?|https://github.com/apache/cassandra/compare/cassandra-3.0...bdeggleston:15059-3.0?expand=1#diff-8be666c70553b1f0017a01458c490f47R340].
 I get that InterruptedException is probably an Error, but ExecutionException 
is just a run of the mill exception. It's not super important as it won't have 
a functional difference.

I think this fixes the problem it sets out to fix. I checked the call hierarchy 
for {{Gossiper.runInGossipStageBlocking}} and I think that every caller is not 
holding any locks or resources that might stop the Gossip thread from making 
progress. All of them don't appear to be holding anything obvious.

Documentation wise definitely document {{runInGossipStageBlocking}}, 
{{assassinate}}, and {{convict}} warning about the risk of deadlock if the 
Gossip thread ends up needing to acquire a resource via a listener (or FD's 
listener if there is a path for that) the calling thread holds.

Future wise this doesn't do anything to address the underlying fragility in how 
Gossiper doesn't document what is safe to call from outside the Gossip thread 
and what isn't. It also doesn't validate the correct thread is running a given 
method.

I think we want to add two (or three?) interfaces that Gossiper can implement. 
One is for methods that are non-mutating and safe to call from any thread. The 
other is for methods for that should only be called from the Gossip stage 
thread. And maybe a third which contains methods that mutate, but block on the 
Gossip stage. So Gossiper.instance would go away and you would have a reference 
to the Gossiper via one of the 2-3 interfaces.

Additionally any method that should only run in Gossip stage interface has a 
{{Preconditions.checkState}} checking that it's actually running in the Gossip 
stage.

WDYT?

> Gossiper#markAlive can race with Gossiper#markDead
> --------------------------------------------------
>
>                 Key: CASSANDRA-15059
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15059
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Gossip
>            Reporter: Blake Eggleston
>            Assignee: Blake Eggleston
>            Priority: Normal
>
> The Gossiper class is not threadsafe and assumes all state changes happen in 
> a single thread (the gossip stage). Gossiper#convict, however, can be called 
> from the GossipTasks thread. This creates a race where calls to 
> Gossiper#markAlive and Gossiper#markDead can interleave, corrupting gossip 
> state. Gossiper#assassinateEndpoint has a similar problem, being called from 
> the mbean server thread.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to