[jira] [Commented] (CASSANDRA-10070) Automatic repair scheduling

Marcus Olsson (JIRA) Mon, 07 Dec 2015 01:29:29 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15044637#comment-15044637
 ]


Marcus Olsson commented on CASSANDRA-10070:
-------------------------------------------

[~zemeyer] I've added the possibility to schedule a job remotely, so that one 
node can tell another node to run a certain job. Right now it's used for when a 
node discovers that another node has been down longer than the possible hint 
window, and then tells that node to repair it's ranges ASAP. The remote 
scheduling is using the distributed locking mechanism to avoid that multiple 
nodes try to tell the same node to run the repair at the same time.

So a simple flow could be:
Node A goes down at 12:00
Node B recognizes it and saves "Node A DOWN @ 12:00" locally
Node A comes back up at 16:00
Node B sees Node A as online again at 16:00 and sees that Node A has been down 
since 12:00, 4 hours.
Node B sends a repair job to Node A for each table that has a hint window that 
is 4 hours or less.
Node A runs all repairs

---

I'll continue to work on the feature of pausing all repairs and also the 
prevention mechanism. I've done some work for the prevention mechanism for jobs 
in that it checks the job history for repairs and only returns that it *can* 
run a repair if any range hasn't been repaired within the hint window (it's 
still based on the interval though, so the repair shouldn't run more than once 
per interval in the normal case).

To the prevention mechanism I should probably add a way for it to avoid doing 
multiple repairs for a single node at the same time. After that I'll add the 
possibility to run parallel repair tasks over the cluster.

---

The git branch is [here|https://github.com/emolsson/cassandra/commits/10070].

> Automatic repair scheduling
> ---------------------------
>
>                 Key: CASSANDRA-10070
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10070
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Marcus Olsson
>            Assignee: Marcus Olsson
>            Priority: Minor
>             Fix For: 3.x
>
>
> Scheduling and running repairs in a Cassandra cluster is most often a 
> required task, but this can both be hard for new users and it also requires a 
> bit of manual configuration. There are good tools out there that can be used 
> to simplify things, but wouldn't this be a good feature to have inside of 
> Cassandra? To automatically schedule and run repairs, so that when you start 
> up your cluster it basically maintains itself in terms of normal 
> anti-entropy, with the possibility for manual configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10070) Automatic repair scheduling

Reply via email to