[ https://issues.apache.org/jira/browse/CASSANDRA-10070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142861#comment-15142861 ]
Paulo Motta commented on CASSANDRA-10070: ----------------------------------------- Sorry for the delay, will try to be faster on next iterations. Below are some comments in your previous reply: bq. A problem with this table is that if we have a setup with two data centers and three replicas in each data center, then we have a total of six replicas and QUORUM would require four replicas to succeed. This would require that both data centers are available to be able to run repair. All data centers involved in a repair must be available for a repair to start/succeed, so if we make the lock resource dc-aware and try to create the lock by contacting a node in each involved data center with LOCAL_SERIAL consistency that should be sufficient to ensure correctness without the need for a global lock. This will also play along well with both dc_parallelism global option and with the {{\-\-local}} or {{\-\-dcs}} table repair options. I thought of something along those lines: {noformat} dc_locks = {} dcs = repair_dcs(keyspace, table) # this will depend on both keyspace settings and table repair settings (--local or --dcs) for dc in dcs: for i in 0..dc_parallelism(dc): if ((lock = get_node(dc).execute("INSERT INTO lock (resource) VALUES ('RepairResource-{dc}-{i}') IF NOT EXISTS USING TTL 30;", LOCAL_SERIAL) != nil) dc_locks[dc] = lock if len(dc_locks) != len(dcs): release_locks(dc_locks) else: start_repair(table) {noformat} bq. Just a questions regarding your suggestion with the node_repair_parallelism. Should it be used to specify the number of repairs a node can initiate or how many repairs the node can be an active part of in parallel? I guess the second alternative would be harder to implement, but it is probably what one would expect. The second alternative is probably the most desireable. Actually dc_parallelism by itself might cause problems, since we can have a situation where all repairs run in a single node or range, overloading those nodes. If we are to support concurrent repairs in the first pass, I think we need both dc_parallelism and node_parallelism options together. I thought we could extend the previous lock acquiring algorithm with: {noformat} dc_locks = previous algorithm if len(dc_locks) != len(dcs): release_locks(dc_locks) return; node_locks = {} nodes = repair_nodes(table, range) for node in nodes: for i in 0..node_parallelism(node): if ((lock = node.execute("INSERT INTO lock (resource) VALUES ('RepairResource-{node}-{i}') IF NOT EXISTS USING TTL 30;", LOCAL_SERIAL)) != nil) node_locks[node] = lock break; if len(node_locks) != len(nodes): release_locks(dc_locks) release_locks(node_locks) else: start_repair(table) {noformat} This is becoming a bit complex and there probably are some edge cases and/or starvation scenarios so we should think carefully about before jumping into implementation. What do you think about this approach? Should we stick to a simpler non-parallel version in the first pass or think this through and already support parallelism in the first version? bq. It should be possible to extend the repair scheduler with subrange repairs I like the token_division approach for supporting subrange repairs in addition to {{-pr}}, but we can think about this later. bq. Agreed, are there any other scenarios that we might have to take into account? I can only think of upgrades and range movements (bootstrap, move, removenode, etc) right now. We should also think better about possible failure scenarios and network partitions. What happens if the node cannot renew locks in a remote DC due to a temporary network partition but the repair is still running ? We should probably cancel a repair if not able to renew the lock and also have some kind of garbage collector to kill ongoing repair sessions without associated locks to protect from disrespecting the configured {{dc_parallelism}} and {{node_paralellism}}. > Automatic repair scheduling > --------------------------- > > Key: CASSANDRA-10070 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10070 > Project: Cassandra > Issue Type: Improvement > Reporter: Marcus Olsson > Assignee: Marcus Olsson > Priority: Minor > Fix For: 3.x > > Attachments: Distributed Repair Scheduling.doc > > > Scheduling and running repairs in a Cassandra cluster is most often a > required task, but this can both be hard for new users and it also requires a > bit of manual configuration. There are good tools out there that can be used > to simplify things, but wouldn't this be a good feature to have inside of > Cassandra? To automatically schedule and run repairs, so that when you start > up your cluster it basically maintains itself in terms of normal > anti-entropy, with the possibility for manual configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)