[ https://issues.apache.org/jira/browse/CASSANDRA-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774515#comment-16774515 ]
Blake Eggleston commented on CASSANDRA-15027: --------------------------------------------- Thanks [~spo...@gmail.com]. I’ve extended your code so that in addition to waiting for other anti-compactions to complete, the coordinator also pro-actively cancels ongoing anti-compactions on the other participants. This avoids wasting time waiting for anti-compactions on other machines. The code does 3 things: * Adds a session state check to the {{isStopRequested}} method in the anti-compaction iterator. * The coordinator now sends failure messages to all participants when it receives a failure message from one of them in the prepare phase. It does not mark these participants as having failed internally though, since that would cause the nodetool session to immediately complete. Instead, it waits until it’s received messages from all the other nodes. * The participants will now respond with a failed prepare message if the anti-compaction completes, but the session was failed in the mean time. This prevents a dead lock on the coordinator in the case where the participant received a failure message between the time the anti-compaction completes and the callback fires. Let me know what you think. If everything looks ok to you, I’m +1 on committing. [trunk|https://github.com/bdeggleston/cassandra/tree/15027-trunk] [circle|https://circleci.com/gh/bdeggleston/workflows/cassandra/tree/15027-trunk] > Handle IR prepare phase failures less race prone by waiting for all results > --------------------------------------------------------------------------- > > Key: CASSANDRA-15027 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15027 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair, Local/Compaction > Reporter: Stefan Podkowinski > Assignee: Stefan Podkowinski > Priority: Major > Fix For: 4.x > > > Handling incremental repairs as a coordinator begins by sending a > {{PrepareConsistentRequest}} message to all participants, which may also > include the coordinator itself. Participants will run anti-compactions upon > receiving such a message and report the result of the operation back to the > coordinator. > Once we receive a failure response from any of the participants, we fail-fast > in {{CoordinatorSession.handlePrepareResponse()}}, which will in turn > completes the {{prepareFuture}} that {{RepairRunnable}} is blocking on. Then > the repair command will terminate with an error status, as expected. > The issue is that in case the node will both be coordinator and participant, > we may end up with a local session and submitted anti-compactions, which will > be executed without any coordination with the coordinator session (on same > node). This may result in situations where running repair commands right > after another, may cause overlapping execution of anti-compactions that will > cause the following (misleading) message to show up in the logs and will > cause the repair to fail again: > "Prepare phase for incremental repair session %s has failed because it > encountered intersecting sstables belonging to another incremental repair > session (%s). This is by starting an incremental repair session before a > previous one has completed. Check nodetool repair_admin for hung sessions and > fix them." -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org