[jira] [Assigned] (CASSANDRA-11190) Fail fast repairs

Paulo Motta (JIRA) Thu, 30 Mar 2017 14:47:50 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Paulo Motta reassigned CASSANDRA-11190:
---------------------------------------

    Assignee:     (was: Paulo Motta)

> Fail fast repairs
> -----------------
>
>                 Key: CASSANDRA-11190
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11190
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Streaming and Messaging
>            Reporter: Paulo Motta
>            Priority: Minor
>
> Currently, if one node fails any phase of the repair (validation, streaming), 
> the repair session is aborted, but the other nodes are not notified and keep 
> doing either validation or syncing with other nodes.
> With CASSANDRA-10070 automatically scheduling repairs and potentially 
> scheduling retries it would be nice to make sure all nodes abort failed 
> repairs in other to be able to start other repairs safely in the same nodes.
> From CASSANDRA-10070:
> bq. As far as I understood, if there are nodes A, B, C running repair, A is 
> the coordinator. If validation or streaming fails on node B, the coordinator 
> (A) is notified and fails the repair session, but node C will remain doing 
> validation and/or streaming, what could cause problems (or increased load) if 
> we start another repair session on the same range.
> bq. We will probably need to extend the repair protocol to perform this 
> cleanup/abort step on failure. We already have a legacy cleanup message that 
> doesn't seem to be used in the current protocol that we could maybe reuse to 
> cleanup repair state after a failure. This repair abortion will probably have 
> intersection with CASSANDRA-3486. In any case, this is a separate (but 
> related) issue and we should address it in an independent ticket, and make 
> this ticket dependent on that.
> On CASSANDRA-5426 [~slebresne] suggested doing this to avoid unexpected 
> conditions/hangs:
> bq. I wonder if maybe we should have more of a fail-fast policy when there is 
> errors. For instance, if one node fail it's validation phase, maybe it might 
> be worth failing right away and let the user re-trigger a repair once he has 
> fixed whatever was the source of the error, rather than still 
> differencing/syncing the other nodes.
> bq. Going a bit further, I think we should add 2 messages to interrupt the 
> validation and sync phase. If only because that could be useful to users if 
> they need to stop a repair for some reason, but also, if we get an error 
> during validation from one node, we could use that to interrupt the other 
> nodes and thus fail fast while minimizing the amount of work done uselessly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Assigned] (CASSANDRA-11190) Fail fast repairs

Reply via email to