[jira] [Commented] (CASSANDRA-10070) Automatic repair scheduling

Paulo Motta (JIRA) Thu, 11 Feb 2016 07:15:33 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142861#comment-15142861
 ]


Paulo Motta commented on CASSANDRA-10070:
-----------------------------------------

Sorry for the delay, will try to be faster on next iterations. Below are some 
comments in your previous reply:

bq. A problem with this table is that if we have a setup with two data centers 
and three replicas in each data center, then we have a total of six replicas 
and QUORUM would require four replicas to succeed. This would require that both 
data centers are available to be able to run repair. 

All data centers involved in a repair must be available for a repair to 
start/succeed, so if we make the lock resource dc-aware and try to create the 
lock by contacting a node in each involved data center with LOCAL_SERIAL 
consistency that should be sufficient to ensure correctness without the need 
for a global lock. This will also play along well with both dc_parallelism 
global option and with the {{\-\-local}} or {{\-\-dcs}} table repair options.

I thought of something along those lines:

{noformat}
dc_locks = {}
dcs = repair_dcs(keyspace, table) # this will depend on both keyspace settings 
and table repair settings (--local or  --dcs)

for dc in dcs:
  for i in 0..dc_parallelism(dc):
    if ((lock = get_node(dc).execute("INSERT INTO lock (resource) VALUES 
('RepairResource-{dc}-{i}') IF NOT EXISTS USING TTL 30;", LOCAL_SERIAL) != nil)
      dc_locks[dc] = lock

if len(dc_locks) != len(dcs):
  release_locks(dc_locks)
else:
  start_repair(table)
{noformat}

bq. Just a questions regarding your suggestion with the 
node_repair_parallelism. Should it be used to specify the number of repairs a 
node can initiate or how many repairs the node can be an active part of in 
parallel? I guess the second alternative would be harder to implement, but it 
is probably what one would expect.

The second alternative is probably the most desireable. Actually dc_parallelism 
by itself might cause problems, since we can have a situation where all repairs 
run in a single node or range, overloading those nodes. If we are to support 
concurrent repairs in the first pass, I think we need both dc_parallelism and 
node_parallelism options together.

I thought we could extend the previous lock acquiring algorithm with:

{noformat}
dc_locks = previous algorithm

if len(dc_locks) != len(dcs):
  release_locks(dc_locks)
  return;

node_locks = {}
nodes = repair_nodes(table, range) 

for node in nodes:
  for i in 0..node_parallelism(node): 
    if ((lock = node.execute("INSERT INTO lock (resource) VALUES 
('RepairResource-{node}-{i}') IF NOT EXISTS USING TTL 30;", LOCAL_SERIAL)) != 
nil)
      node_locks[node] = lock
      break;

if len(node_locks) != len(nodes):
  release_locks(dc_locks)
  release_locks(node_locks)
else:
  start_repair(table)
{noformat}

This is becoming a bit complex and there probably are some edge cases and/or 
starvation scenarios so we should think carefully about before jumping into 
implementation. What do you think about this approach? Should we stick to a 
simpler non-parallel version in the first pass or think this through and 
already support parallelism in the first version?

bq. It should be possible to extend the repair scheduler with subrange repairs

I like the token_division approach for supporting subrange repairs in addition 
to {{-pr}}, but we can think about this later.

bq. Agreed, are there any other scenarios that we might have to take into 
account?

I can only think of upgrades and range movements (bootstrap, move, removenode, 
etc) right now.

We should also think better about possible failure scenarios and network 
partitions. What happens if the node cannot renew locks in a remote DC due to a 
temporary network partition but the repair is still running ? We should 
probably cancel a repair if not able to renew the lock and also have some kind 
of garbage collector to kill ongoing repair sessions without associated locks 
to protect from disrespecting the configured {{dc_parallelism}} and 
{{node_paralellism}}.

> Automatic repair scheduling
> ---------------------------
>
>                 Key: CASSANDRA-10070
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10070
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Marcus Olsson
>            Assignee: Marcus Olsson
>            Priority: Minor
>             Fix For: 3.x
>
>         Attachments: Distributed Repair Scheduling.doc
>
>
> Scheduling and running repairs in a Cassandra cluster is most often a 
> required task, but this can both be hard for new users and it also requires a 
> bit of manual configuration. There are good tools out there that can be used 
> to simplify things, but wouldn't this be a good feature to have inside of 
> Cassandra? To automatically schedule and run repairs, so that when you start 
> up your cluster it basically maintains itself in terms of normal 
> anti-entropy, with the possibility for manual configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10070) Automatic repair scheduling

Reply via email to