[jira] [Comment Edited] (CASSANDRA-10070) Automatic repair scheduling

Paulo Motta (JIRA) Mon, 15 Feb 2016 12:09:06 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15147751#comment-15147751
 ]


Paulo Motta edited comment on CASSANDRA-10070 at 2/15/16 8:07 PM:
------------------------------------------------------------------

Starting with a single repair per dc and adding support for parallel repair 
sessions later sounds like a good idea.

bq. I agree and we could probably store the parent repair session id in an 
extra column of the lock table and have a thread wake up periodically to see if 
there are repair sessions without locks. 

Do we intend to reuse the lock table for other maintenance tasks as well? If 
so, we must add a generic "holder" column to the lock table so we can reuse to 
identify resources other than the parent repair session in the future. We could 
also add an "attributes" map in the lock table to store additional attributes 
such as status, or have a separate table to maintain status to keep the lock 
table simple.

bq. But then we must somehow be able to differentiate user-defined and 
automatically scheduled repair sessions. It could be done by having all repairs 
go through this scheduling interface, which also would reduce user mistakes 
with multiple repairs in parallel. Another alternative is to have a custom flag 
in the parent repair that makes the garbage collector ignore it if it's 
user-defined. I think that the garbage collector/cancel repairs when unable to 
lock feature is something that should be included in the first pass.

Ideally all repairs would go through this interface, but this would probably 
add complexity at this stage. So we should probably just add a "lockResource" 
attribute to each repair session object, and each node would go through all 
repairs currently running checking if it still holds the lock in case the 
"lockResource" field is set.

bq. The most basic failure scenarios should be covered by retrying a repair if 
it fails and log a warning/error based on how many times it failed. Could the 
retry behaviour cause some unexpected consequences?

It would probably be safe to abort ongoing validation and stream background 
tasks and cleanup repair state on all involved nodes before starting a new 
repair session in the same ranges. This doesn't seem to be done currently. As 
far as I understood, if there are nodes A, B, C running repair, A is the 
coordinator. If validation or streaming fails on node B, the coordinator (A) is 
notified and fails the repair session, but node C will remain doing validation 
and/or streaming, what could cause problems (or increased load) if we start 
another repair session on the same range. 

We will probably need to extend the repair protocol to perform this 
cleanup/abort step on failure. We already have a legacy cleanup message that 
doesn't seem to be used in the current protocol that we could maybe reuse to 
cleanup repair state after a failure. This repair abortion will probably have 
intersection with CASSANDRA-3486. In any case, this is a separate (but related) 
issue and we should address it in an independent ticket, and make this ticket 
dependent on that.

Another unrelated option that we should probably include in the future is a 
timeout, and abort repair sessions running longer than that.


was (Author: pauloricardomg):
Starting with a single repair per dc and adding support for parallel repair 
sessions later sounds like a good idea.

bq. I agree and we could probably store the parent repair session id in an 
extra column of the lock table and have a thread wake up periodically to see if 
there are repair sessions without locks. 

Do we intend to reuse the lock table for other maintenance tasks as well? If 
so, we must add a generic "holder" column to the lock table so we can reuse to 
identify resources other than the parent repair session in the future. We could 
also add an "attributes" map in the lock table to store additional attributes 
such as status, or have a separate table to maintain status to keep the lock 
table simple.

bq. But then we must somehow be able to differentiate user-defined and 
automatically scheduled repair sessions. It could be done by having all repairs 
go through this scheduling interface, which also would reduce user mistakes 
with multiple repairs in parallel. Another alternative is to have a custom flag 
in the parent repair that makes the garbage collector ignore it if it's 
user-defined. I think that the garbage collector/cancel repairs when unable to 
lock feature is something that should be included in the first pass.

Ideally all repairs would go through this interface, but this would probably 
add complexity at this stage. So we should probably just add a "lockResource" 
attribute to each local repair session object (as opposed to only the parent 
repair object), and each node would go through all repairs currently running 
checking if it still holds the lock if the "lockResource" field is set.

bq. The most basic failure scenarios should be covered by retrying a repair if 
it fails and log a warning/error based on how many times it failed. Could the 
retry behaviour cause some unexpected consequences?

It would probably be safe to abort ongoing validation and stream background 
tasks and cleanup repair state on all involved nodes before starting a new 
repair session in the same ranges. This doesn't seem to be done currently. As 
far as I understood, if there are nodes A, B, C running repair, A is the 
coordinator. If validation or streaming fails on node B, the coordinator (A) is 
notified and fails the repair session, but node C will remain doing validation 
and/or streaming, what could cause problems (or increased load) if we start 
another repair session on the same range. 

We will probably need to extend the repair protocol to perform this 
cleanup/abort step on failure. We already have a legacy cleanup message that 
doesn't seem to be used in the current protocol that we could maybe reuse to 
cleanup repair state after a failure. This repair abortion will probably have 
intersection with CASSANDRA-3486. In any case, this is a separate (but related) 
issue and we should address it in an independent ticket, and make this ticket 
dependent on that.

Another unrelated option that we should probably include in the future is a 
timeout, and abort repair sessions running longer than that.

> Automatic repair scheduling
> ---------------------------
>
>                 Key: CASSANDRA-10070
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10070
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Marcus Olsson
>            Assignee: Marcus Olsson
>            Priority: Minor
>             Fix For: 3.x
>
>         Attachments: Distributed Repair Scheduling.doc
>
>
> Scheduling and running repairs in a Cassandra cluster is most often a 
> required task, but this can both be hard for new users and it also requires a 
> bit of manual configuration. There are good tools out there that can be used 
> to simplify things, but wouldn't this be a good feature to have inside of 
> Cassandra? To automatically schedule and run repairs, so that when you start 
> up your cluster it basically maintains itself in terms of normal 
> anti-entropy, with the possibility for manual configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-10070) Automatic repair scheduling

Reply via email to