[jira] [Commented] (CASSANDRA-10070) Automatic repair scheduling

Marcus Olsson (JIRA) Fri, 05 Feb 2016 06:16:12 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134231#comment-15134231
 ]


Marcus Olsson commented on CASSANDRA-10070:
-------------------------------------------

[~yukim] [~pauloricardomg] Thanks for the comments, great questions/suggestions!

Regarding your questions about the locking:
{quote}
* What would "lock resource" be like for repair scheduling? I think the value 
controls number of repair jobs running at given time in the whole cluster, and 
we don't want to run as many repair jobs at once.
* I second Yuki Morishita's first question above, in that we need to better 
specify how is cluster-wide repair parallelism handled: is it fixed or 
configurable? can a node run repair for multiple ranges in parallel? Perhaps we 
should have a node_repair_paralellism (default 1) and dc_repair_parallelism 
(default 1) global config and reject starting repairs above those thresholds.
{quote}
The thought with the lock resource was that it could be something simple, like 
a table defined as:
{noformat}
CREATE TABLE lock (
resource text PRIMARY KEY
)
{noformat}
And then the different nodes would try to get the lock using LWT with TTL:
{noformat}
INSERT INTO lock (resource) VALUES ('RepairResource') IF NOT EXISTS USING TTL 
30;
{noformat}
After that the node would have to continue to update the locked resource while 
running the repair to prevent that someone else gets the locked resource. The 
value "RepairResource" could just as easily be defined as "RepairResource-N", 
so that it would be possible to allow repairs to run in parallel.

A problem with this table is that if we have a setup with two data centers and 
three replicas in each data center, then we have a total of six replicas and 
QUORUM would require four replicas to succeed. This would require that both 
data centers are available to be able to run repair. Since some of the 
keyspaces might not be replicated across both data centers we would still have 
to be able to run repair even if one of the data centers is unavailable. This 
also applies if we should "force" local dc repairs if a data center has been 
unavailable too long. There are two options as I see it on how to solve this:
* Get the lock with local_serial during these scenarios.
* Have a separate lock table for each data center *and* a global one.

I guess the easiest solution would be to use local_serial, but I'm not sure if 
it might cause some unexpected behavior. If we would go for the other option 
with separate tables it would probably increase the overall complexity, but it 
would make it easier to restrict the number of parallel repairs in a single 
data center.

Just a questions regarding your suggestion with the node_repair_parallelism. 
Should it be used to specify the number of repairs a node can initiate or how 
many repairs the node can be an active part of in parallel? I guess the second 
alternative would be harder to implement, but it is probably what one would 
expect.

---

{quote}
* It seems the scheduling only makes sense for repairing primary range of the 
node ('nodetool -pr') since we end up repairing all nodes eventually. Are you 
considering other options like subrange ('nodetool -st -et') repair?
* For subrange repair, we could maybe have something similar to reaper's 
segmentCount option, but since this would add more complexity we could leave 
for a separate ticket.
{quote}

It should be possible to extend the repair scheduler with subrange repairs, 
either by having it as an option per table or by having a separate scheduler 
for it. The separate scheduler would just be another plugin that could replace 
the default repair scheduler. If we go for a table configuration it could be 
that the user either specifies pr or the number of segments to divide the token 
range in, something like:
{noformat}
repair_options = {..., token_division='pr'}; // Use primary range repair
or
repair_options = {..., token_division='2048'}; // Divide the token range in 
2048 slices
{noformat}
If we would have a separate scheduler it could just be a configuration for it. 
Personally I would prefer to have it all in a single scheduler and I agree that 
it should probably be a separate ticket to keep the complexity of the base 
scheduler to a minimum. But I think this is a feature that will be very much 
needed both with non-vnode token assignment and also with the possibility to 
reduce the number of vnodes as of CASSANDRA-7032.

---

{quote}
* While pausing repair is a nice future for user-based interruptions, we could 
probably embed system known interruptions (such as when a bootstrap or upgrade 
is going on) in the default rejection logic.
{quote}

Agreed, are there any other scenarios that we might have to take into account?

> Automatic repair scheduling
> ---------------------------
>
>                 Key: CASSANDRA-10070
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10070
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Marcus Olsson
>            Assignee: Marcus Olsson
>            Priority: Minor
>             Fix For: 3.x
>
>         Attachments: Distributed Repair Scheduling.doc
>
>
> Scheduling and running repairs in a Cassandra cluster is most often a 
> required task, but this can both be hard for new users and it also requires a 
> bit of manual configuration. There are good tools out there that can be used 
> to simplify things, but wouldn't this be a good feature to have inside of 
> Cassandra? To automatically schedule and run repairs, so that when you start 
> up your cluster it basically maintains itself in terms of normal 
> anti-entropy, with the possibility for manual configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10070) Automatic repair scheduling

Reply via email to