[ 
https://issues.apache.org/jira/browse/CASSANDRA-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774515#comment-16774515
 ] 

Blake Eggleston commented on CASSANDRA-15027:
---------------------------------------------

Thanks [~spo...@gmail.com]. I’ve extended your code so that in addition to 
waiting for other anti-compactions to complete, the coordinator also 
pro-actively cancels ongoing anti-compactions on the other participants. This 
avoids wasting time waiting for anti-compactions on other machines. The code 
does 3 things:
 * Adds a session state check to the {{isStopRequested}} method in the 
anti-compaction iterator.
 * The coordinator now sends failure messages to all participants when it 
receives a failure message from one of them in the prepare phase. It does not 
mark these participants as having failed internally though, since that would 
cause the nodetool session to immediately complete. Instead, it waits until 
it’s received messages from all the other nodes.
 * The participants will now respond with a failed prepare message if the 
anti-compaction completes, but the session was failed in the mean time. This 
prevents a dead lock on the coordinator in the case where the participant 
received a failure message between the time the anti-compaction completes and 
the callback fires.

Let me know what you think. If everything looks ok to you, I’m +1 on committing.

[trunk|https://github.com/bdeggleston/cassandra/tree/15027-trunk]
 
[circle|https://circleci.com/gh/bdeggleston/workflows/cassandra/tree/15027-trunk]

> Handle IR prepare phase failures less race prone by waiting for all results
> ---------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15027
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15027
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Repair, Local/Compaction
>            Reporter: Stefan Podkowinski
>            Assignee: Stefan Podkowinski
>            Priority: Major
>             Fix For: 4.x
>
>
> Handling incremental repairs as a coordinator begins by sending a 
> {{PrepareConsistentRequest}} message to all participants, which may also 
> include the coordinator itself. Participants will run anti-compactions upon 
> receiving such a message and report the result of the operation back to the 
> coordinator.
> Once we receive a failure response from any of the participants, we fail-fast 
> in {{CoordinatorSession.handlePrepareResponse()}}, which will in turn 
> completes the {{prepareFuture}} that {{RepairRunnable}} is blocking on. Then 
> the repair command will terminate with an error status, as expected.
> The issue is that in case the node will both be coordinator and participant, 
> we may end up with a local session and submitted anti-compactions, which will 
> be executed without any coordination with the coordinator session (on same 
> node). This may result in situations where running repair commands right 
> after another, may cause overlapping execution of anti-compactions that will 
> cause the following (misleading) message to show up in the logs and will 
> cause the repair to fail again:
>  "Prepare phase for incremental repair session %s has failed because it 
> encountered intersecting sstables belonging to another incremental repair 
> session (%s). This is by starting an incremental repair session before a 
> previous one has completed. Check nodetool repair_admin for hung sessions and 
> fix them."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to