[ https://issues.apache.org/jira/browse/CASSANDRA-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Paulo Motta reassigned CASSANDRA-11190: --------------------------------------- Assignee: (was: Paulo Motta) > Fail fast repairs > ----------------- > > Key: CASSANDRA-11190 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11190 > Project: Cassandra > Issue Type: Improvement > Components: Streaming and Messaging > Reporter: Paulo Motta > Priority: Minor > > Currently, if one node fails any phase of the repair (validation, streaming), > the repair session is aborted, but the other nodes are not notified and keep > doing either validation or syncing with other nodes. > With CASSANDRA-10070 automatically scheduling repairs and potentially > scheduling retries it would be nice to make sure all nodes abort failed > repairs in other to be able to start other repairs safely in the same nodes. > From CASSANDRA-10070: > bq. As far as I understood, if there are nodes A, B, C running repair, A is > the coordinator. If validation or streaming fails on node B, the coordinator > (A) is notified and fails the repair session, but node C will remain doing > validation and/or streaming, what could cause problems (or increased load) if > we start another repair session on the same range. > bq. We will probably need to extend the repair protocol to perform this > cleanup/abort step on failure. We already have a legacy cleanup message that > doesn't seem to be used in the current protocol that we could maybe reuse to > cleanup repair state after a failure. This repair abortion will probably have > intersection with CASSANDRA-3486. In any case, this is a separate (but > related) issue and we should address it in an independent ticket, and make > this ticket dependent on that. > On CASSANDRA-5426 [~slebresne] suggested doing this to avoid unexpected > conditions/hangs: > bq. I wonder if maybe we should have more of a fail-fast policy when there is > errors. For instance, if one node fail it's validation phase, maybe it might > be worth failing right away and let the user re-trigger a repair once he has > fixed whatever was the source of the error, rather than still > differencing/syncing the other nodes. > bq. Going a bit further, I think we should add 2 messages to interrupt the > validation and sync phase. If only because that could be useful to users if > they need to stop a repair for some reason, but also, if we get an error > during validation from one node, we could use that to interrupt the other > nodes and thus fail fast while minimizing the amount of work done uselessly. -- This message was sent by Atlassian JIRA (v6.3.15#6346)