[ https://issues.apache.org/jira/browse/CASSANDRA-10070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134231#comment-15134231 ]
Marcus Olsson commented on CASSANDRA-10070: ------------------------------------------- [~yukim] [~pauloricardomg] Thanks for the comments, great questions/suggestions! Regarding your questions about the locking: {quote} * What would "lock resource" be like for repair scheduling? I think the value controls number of repair jobs running at given time in the whole cluster, and we don't want to run as many repair jobs at once. * I second Yuki Morishita's first question above, in that we need to better specify how is cluster-wide repair parallelism handled: is it fixed or configurable? can a node run repair for multiple ranges in parallel? Perhaps we should have a node_repair_paralellism (default 1) and dc_repair_parallelism (default 1) global config and reject starting repairs above those thresholds. {quote} The thought with the lock resource was that it could be something simple, like a table defined as: {noformat} CREATE TABLE lock ( resource text PRIMARY KEY ) {noformat} And then the different nodes would try to get the lock using LWT with TTL: {noformat} INSERT INTO lock (resource) VALUES ('RepairResource') IF NOT EXISTS USING TTL 30; {noformat} After that the node would have to continue to update the locked resource while running the repair to prevent that someone else gets the locked resource. The value "RepairResource" could just as easily be defined as "RepairResource-N", so that it would be possible to allow repairs to run in parallel. A problem with this table is that if we have a setup with two data centers and three replicas in each data center, then we have a total of six replicas and QUORUM would require four replicas to succeed. This would require that both data centers are available to be able to run repair. Since some of the keyspaces might not be replicated across both data centers we would still have to be able to run repair even if one of the data centers is unavailable. This also applies if we should "force" local dc repairs if a data center has been unavailable too long. There are two options as I see it on how to solve this: * Get the lock with local_serial during these scenarios. * Have a separate lock table for each data center *and* a global one. I guess the easiest solution would be to use local_serial, but I'm not sure if it might cause some unexpected behavior. If we would go for the other option with separate tables it would probably increase the overall complexity, but it would make it easier to restrict the number of parallel repairs in a single data center. Just a questions regarding your suggestion with the node_repair_parallelism. Should it be used to specify the number of repairs a node can initiate or how many repairs the node can be an active part of in parallel? I guess the second alternative would be harder to implement, but it is probably what one would expect. --- {quote} * It seems the scheduling only makes sense for repairing primary range of the node ('nodetool -pr') since we end up repairing all nodes eventually. Are you considering other options like subrange ('nodetool -st -et') repair? * For subrange repair, we could maybe have something similar to reaper's segmentCount option, but since this would add more complexity we could leave for a separate ticket. {quote} It should be possible to extend the repair scheduler with subrange repairs, either by having it as an option per table or by having a separate scheduler for it. The separate scheduler would just be another plugin that could replace the default repair scheduler. If we go for a table configuration it could be that the user either specifies pr or the number of segments to divide the token range in, something like: {noformat} repair_options = {..., token_division='pr'}; // Use primary range repair or repair_options = {..., token_division='2048'}; // Divide the token range in 2048 slices {noformat} If we would have a separate scheduler it could just be a configuration for it. Personally I would prefer to have it all in a single scheduler and I agree that it should probably be a separate ticket to keep the complexity of the base scheduler to a minimum. But I think this is a feature that will be very much needed both with non-vnode token assignment and also with the possibility to reduce the number of vnodes as of CASSANDRA-7032. --- {quote} * While pausing repair is a nice future for user-based interruptions, we could probably embed system known interruptions (such as when a bootstrap or upgrade is going on) in the default rejection logic. {quote} Agreed, are there any other scenarios that we might have to take into account? > Automatic repair scheduling > --------------------------- > > Key: CASSANDRA-10070 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10070 > Project: Cassandra > Issue Type: Improvement > Reporter: Marcus Olsson > Assignee: Marcus Olsson > Priority: Minor > Fix For: 3.x > > Attachments: Distributed Repair Scheduling.doc > > > Scheduling and running repairs in a Cassandra cluster is most often a > required task, but this can both be hard for new users and it also requires a > bit of manual configuration. There are good tools out there that can be used > to simplify things, but wouldn't this be a good feature to have inside of > Cassandra? To automatically schedule and run repairs, so that when you start > up your cluster it basically maintains itself in terms of normal > anti-entropy, with the possibility for manual configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)