[jira] [Commented] (CASSANDRA-14346) Scheduled Repair in Cassandra

Stefan Podkowinski (JIRA) Thu, 05 Apr 2018 04:35:21 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426773#comment-16426773
 ]


Stefan Podkowinski commented on CASSANDRA-14346:
------------------------------------------------

If we keep the scope of this ticket to schedule repairs in Cassandra, we should 
really talk a bit more about the different requirements users have and how 
using the described solution in practice would look like.

There are several aspect to consider for coming up with a working repair 
schedule:
 * number of tables (from a single table per cluster to hundreds of tables)
 * priority in repairing tables (some tables should be repaired more often, 
others never at all)
 * data size per table (large table should not block repairs for smaller more 
important ones)
 * predictable cluster load (try to schedule repairs off hours)
 * sustainable repair intensity (repair sessions should not leak into peak 
hours)
 * different gc_grace periods (plan intervals for each table so we can tolerate 
missing a repair run)

Repair schedules, which will take these aspects into account, require a certain 
flexibility and some more careful configuration. Tools, such as reaper, allow 
you to put together such plans already. Looking at the configuration options 
described in the design document, I'd probably still want to use such an 
external tool. That would be mostly due to the use of delays instead of 
recurring repair times and the way you'd have to configure repairs on table 
level, which probably gets a bit "messy" fast when you have a lot of tables. 
The lack of any reporting doesn't help either to further tune these config 
options afterwards.

I think the intention is to keep the scope of this ticket to "integrated repair 
scheduling and execution", so I'll spare you any of my thoughts about how we 
should coordinate and execute repairs differently in a post CASSANDRA-9143 
world. But if we want to solve scheduling on top of our existing repair 
implementation, we have to make sure that we can compete with existing 3rd 
party solutions.

So far it was already suggested to move on incrementally. But then we also have 
to think about how improvements could be implemented on top of the proposed 
solution. I'd assume that optimizations would be easier to implement in 
external tools or sidecars that communicates via an IPC interface, compared to 
a baked in solution, which is using the yaml config, table properties, or has 
to deal with upgrade paths. From my impression, 3rd party projects are probably 
also a better place to quickly iterate on these kind of problems.

> Scheduled Repair in Cassandra
> -----------------------------
>
>                 Key: CASSANDRA-14346
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14346
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Repair
>            Reporter: Joseph Lynch
>            Priority: Major
>              Labels: CommunityFeedbackRequested
>             Fix For: 4.0
>
>         Attachments: ScheduledRepairV1_20180327.pdf
>
>
> There have been many attempts to automate repair in Cassandra, which makes 
> sense given that it is necessary to give our users eventual consistency. Most 
> recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked 
> for ways to solve this problem.
> At Netflix we've built a scheduled repair service within Priam (our sidecar), 
> which we spoke about last year at NGCC. Given the positive feedback at NGCC 
> we focussed on getting it production ready and have now been using it in 
> production to repair hundreds of clusters, tens of thousands of nodes, and 
> petabytes of data for the past six months. Also based on feedback at NGCC we 
> have invested effort in figuring out how to integrate this natively into 
> Cassandra rather than open sourcing it as an external service (e.g. in Priam).
> As such, [~vinaykumarcse] and I would like to re-work and merge our 
> implementation into Cassandra, and have created a [design 
> document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing]
>  showing how we plan to make it happen, including the the user interface.
> As we work on the code migration from Priam to Cassandra, any feedback would 
> be greatly appreciated about the interface or v1 implementation features. I 
> have tried to call out in the document features which we explicitly consider 
> future work (as well as a path forward to implement them in the future) 
> because I would very much like to get this done before the 4.0 merge window 
> closes, and to do that I think aggressively pruning scope is going to be a 
> necessity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14346) Scheduled Repair in Cassandra

Reply via email to