[ 
https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16580715#comment-16580715
 ] 

Joseph Lynch edited comment on CASSANDRA-14346 at 8/15/18 5:07 AM:
-------------------------------------------------------------------

Ok, [~vinaykumarcse] and I have a Cassandra trunk patchset we believe is ready 
for review to start. We are still tying up a few loose ends, especially around 
integration tests (all of our e2e tests used the drivers CCMBridge system, not 
python dtests so we're still working on porting those over), but I think in the 
interest of giving folks like [~bdeggleston] or others a chance to start 
reviewing we should get it started. Below is the patch-set broken into 
hopefully easily reviewable 13 commits.
||trunk||
|[patch|https://github.com/apache/cassandra/compare/trunk...jolynch:trunk_CASSANDRA-14346_review]|
|[pull request for reviewing|https://github.com/jolynch/cassandra/pull/2]|
|[unit 
tests|https://circleci.com/gh/jolynch/workflows/cassandra/tree/trunk_CASSANDRA-14346_review]
 
[!https://circleci.com/gh/jolynch/cassandra/tree/trunk_CASSANDRA-14346_review.png?circle-token=
 
1102a59698d04899ec971dd36e925928f7b521f5!|https://circleci.com/gh/jolynch/workflows/cassandra/tree/trunk_CASSANDRA-14346_review]|

I believe that we implement all proposed scope for V1 of the repair scheduler 
as per the design document. A short summary of included functionality:
 # A basic HTTP sidecar integrated with the ant build (CASSANDRA-14395)
 # Circleci compiles the sidecar and runs the sidecar unit tests
 # A fully decentralized orchestration engine for running repair and hooks in a 
resilient, reliable manner
 # Pluggable interface for abstracting different repair APIs with 4.x 
implementation by default, we believe that 3.0, 3.11 and 4.x can all be 
supported via {{CassandraInteraction}} implementations.
 # Supports typical repair types: vnodes, no vnodes, incremental repair, 
full+subrange repair. These are also abstracted (see #4) so that maintenance 
should hopefully be low going. Range splits support splitting on partition 
count, size, or adaptive (auto).
 # Pluggable configuration with yaml implementation by default.
 # Pluggable post repair hooks (actions run after repair is done on all 
neighbors) with cleanup and compaction implementations by default (cleanup is 
enabled by default)
 # Additional repair scheduling metrics
 # Schema/infra support for schedules although only one schedule is supported 
at this time.

We are also missing some things, in particular we have to finish porting our 
dtests over to python (from {{CCMBridge}} and the rest of our unit tests to 
{{EmbeddedCasandra}} (from {{CassandraUnit}}) and there are still a few TODOs 
to track down. [~bdeggleston], [~kohlisankalp] if we can get help getting the 
review started that would be amazing as I imagine there is going to be a lot of 
feedback and we'll need time to incorporate it all and battle test it for 4.0 .


was (Author: jolynch):
Ok, [~vinaykumarcse] and I have a Cassandra trunk patchset we believe is ready 
for review to start. We are still tying up a few loose ends, especially around 
integration tests (all of our e2e tests used the drivers CCMBridge system, not 
python dtests so we're still working on porting those over), but I think in the 
interest of giving folks like [~bdeggleston] or others a chance to start 
reviewing we should get it started. Below is the patch-set broken into 
hopefully easily reviewable 13 commits.
||trunk||
|[patch|https://github.com/apache/cassandra/compare/trunk...jolynch:trunk_CASSANDRA-14346_review]|
|[pull request for reviewing|https://github.com/jolynch/cassandra/pull/2]|
|[unit 
tests|https://circleci.com/gh/jolynch/workflows/cassandra/tree/trunk_CASSANDRA-14346_review]
 
[!https://circleci.com/gh/jolynch/cassandra/tree/trunk_CASSANDRA-14346_review.png?circle-token=
 
1102a59698d04899ec971dd36e925928f7b521f5!|https://circleci.com/gh/jolynch/workflows/cassandra/tree/trunk_CASSANDRA-14346_review]|

I believe that we implement all proposed scope for V1 of the repair scheduler 
as per the design document. A short summary of included functionality:
 # A basic HTTP sidecar integrated with the ant build (CASSANDRA-14395)
 # Circleci compiles the sidecar and runs the sidecar unit tests
 # A fully decentralized orchestration engine for running repair and hooks in a 
resilient, reliable manner
 # Pluggable interface for abstracting different repair APIs with 4.x 
implementation by default, we believe that 3.0, 3.11 and 4.x can all be 
supported via {{CassandraInteraction}} implementations.
 # Supports typical repair types: vnodes, no vnodes, incremental repair, 
full+subrange repair. These are also abstracted (see #4) so that maintenance 
should hopefully be low going
 # Pluggable configuration with yaml implementation by default.
 # Pluggable post repair hooks (actions run after repair is done on all 
neighbors) with cleanup and compaction implementations by default (cleanup is 
enabled by default)
 # Additional repair scheduling metrics
 # Schema/infra support for schedules although only one schedule is supported 
at this time.

We are also missing some things, in particular we have to finish porting our 
dtests over to python (from {{CCMBridge}} and the rest of our unit tests to 
{{EmbeddedCasandra}} (from {{CassandraUnit}}) and there are still a few TODOs 
to track down. [~bdeggleston], [~kohlisankalp] if we can get help getting the 
review started that would be amazing as I imagine there is going to be a lot of 
feedback and we'll need time to incorporate it all and battle test it for 4.0 .

> Scheduled Repair in Cassandra
> -----------------------------
>
>                 Key: CASSANDRA-14346
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14346
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Repair
>            Reporter: Joseph Lynch
>            Assignee: Joseph Lynch
>            Priority: Major
>              Labels: 4.0-feature-freeze-review-requested, 
> CommunityFeedbackRequested
>             Fix For: 4.0
>
>         Attachments: ScheduledRepairV1_20180327.pdf
>
>
> There have been many attempts to automate repair in Cassandra, which makes 
> sense given that it is necessary to give our users eventual consistency. Most 
> recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked 
> for ways to solve this problem.
> At Netflix we've built a scheduled repair service within Priam (our sidecar), 
> which we spoke about last year at NGCC. Given the positive feedback at NGCC 
> we focussed on getting it production ready and have now been using it in 
> production to repair hundreds of clusters, tens of thousands of nodes, and 
> petabytes of data for the past six months. Also based on feedback at NGCC we 
> have invested effort in figuring out how to integrate this natively into 
> Cassandra rather than open sourcing it as an external service (e.g. in Priam).
> As such, [~vinaykumarcse] and I would like to re-work and merge our 
> implementation into Cassandra, and have created a [design 
> document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing]
>  showing how we plan to make it happen, including the the user interface.
> As we work on the code migration from Priam to Cassandra, any feedback would 
> be greatly appreciated about the interface or v1 implementation features. I 
> have tried to call out in the document features which we explicitly consider 
> future work (as well as a path forward to implement them in the future) 
> because I would very much like to get this done before the 4.0 merge window 
> closes, and to do that I think aggressively pruning scope is going to be a 
> necessity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to