[ 
https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16445431#comment-16445431
 ] 

Kurt Greaves commented on CASSANDRA-14346:
------------------------------------------

{quote}What I'm assuming Kurt meant is the fact that we rely on an unbroken jmx 
connection on the repair client side.
{quote}
This, although I'll note I'm not yet super familiar with all the repair 
improvements in 4.0, but I was still under the impression that we're really 
relying on stable jmx connections because we don't yet properly handle every 
failure case in a repair and also detect running repairs. I really haven't 
tested this on trunk though so I can't say that with absolute confidence.

But in general, I'm saying that any improvements in that domain should be our 
first step for this task and worry about the scheduling afterwards. This way we 
can ensure that stuff gets into 4.0 and that third party developers can also 
utilise it rather than everything being tied into the scheduling patch.

 bq. If so we can definitely do the splitting for incremental as well. We like 
splitting up the token ranges into similarly sized pieces because it makes the 
timeout logic much easier to reason about (long running repairs are super 
annoying to tell if they are stuck or not).
This was what I was getting at, but if you're doing subrange in trunk 
incremental is the default so this optimisation is already possible.

bq. but you really shouldn't do subrange incremental repairs unless you have a 
really good reason, since you'll do a lot of additional anti-compaction. 
Anyway, as long as you're running incremental repair regularly, you should be 
able to repair full token ranges in less than 30 min.
Yeah, this is a downside I'm mildly concerned about, however as noted if you 
repair regularly (which with scheduling why not?) this problem goes away. 
However it's mostly going to be the initial repair that's problematic, but 
there are options there (could be done as full repair). It's also worth noting 
that CASSANDRA-10540 would mostly fix this problem 


> Scheduled Repair in Cassandra
> -----------------------------
>
>                 Key: CASSANDRA-14346
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14346
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Repair
>            Reporter: Joseph Lynch
>            Priority: Major
>              Labels: CommunityFeedbackRequested
>             Fix For: 4.0
>
>         Attachments: ScheduledRepairV1_20180327.pdf
>
>
> There have been many attempts to automate repair in Cassandra, which makes 
> sense given that it is necessary to give our users eventual consistency. Most 
> recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked 
> for ways to solve this problem.
> At Netflix we've built a scheduled repair service within Priam (our sidecar), 
> which we spoke about last year at NGCC. Given the positive feedback at NGCC 
> we focussed on getting it production ready and have now been using it in 
> production to repair hundreds of clusters, tens of thousands of nodes, and 
> petabytes of data for the past six months. Also based on feedback at NGCC we 
> have invested effort in figuring out how to integrate this natively into 
> Cassandra rather than open sourcing it as an external service (e.g. in Priam).
> As such, [~vinaykumarcse] and I would like to re-work and merge our 
> implementation into Cassandra, and have created a [design 
> document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing]
>  showing how we plan to make it happen, including the the user interface.
> As we work on the code migration from Priam to Cassandra, any feedback would 
> be greatly appreciated about the interface or v1 implementation features. I 
> have tried to call out in the document features which we explicitly consider 
> future work (as well as a path forward to implement them in the future) 
> because I would very much like to get this done before the 4.0 merge window 
> closes, and to do that I think aggressively pruning scope is going to be a 
> necessity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to