+0 to sidecar, in order to make that work well we need to expose state that the node has so the sidecar can make good calls, if it runs in the node then nothing has to be exposed. One thing to flesh out is where do the “smarts” live? If the range has too many partitions, which system knows to subdivide the range and sequence the repairs (else you OOM)? “Should” repair itself be better and take all input and make sure it works correctly, so the caller just worries about scheduling? “Should” the scheduler understand limitations with repair and work around them?
> On Jul 25, 2023, at 11:26 AM, Jeremiah Jordan <jeremiah.jor...@gmail.com> > wrote: > > +1 for the side car being the right location. > > -Jeremiah > > On Jul 25, 2023 at 1:16:14 PM, Chris Lohfink <clohfin...@gmail.com > <mailto:clohfin...@gmail.com>> wrote: >> I think a CEP is the next step. Considering the number of companies >> involved, this might necessitate several drafts and rounds of discussions. I >> appreciate your initiative in starting this process, and I'm eager to >> contribute to the ensuing discussions. Maybe in a google docs or something >> initially for more interactive feedback? >> >> In regards to https://issues.apache.org/jira/browse/CASSANDRA-14346 we at >> Netflix are actually putting effort currently to move this into the sidecar >> as the idea was to start moving non-read/write path things into different >> process and jvms to not impact each other. >> >> I think the sidecar/in process discussion might be a bit contentious as I >> know even things like compaction some feel should be moved out of process in >> future. On a personal note, my primary interest lies in seeing the >> implementation realized, so I am willing to support whatever consensus >> emerges. Whichever direction these go we will help with the implementation. >> >> Chris >> >> On Tue, Jul 25, 2023 at 1:09 PM Jaydeep Chovatia <chovatia.jayd...@gmail.com >> <mailto:chovatia.jayd...@gmail.com>> wrote: >>> Sounds good, German. Feel free to let me know if you need my help in filing >>> CEP, adding supporting content to the CEP, etc. >>> As I mentioned previously, I have already been working (going through an >>> internal review) on creating a one-pager doc, code, etc., that has been >>> working for us for the last six years at an immense scale, and I will share >>> it soon on a private fork. >>> >>> Thanks, >>> Jaydeep >>> >>> On Tue, Jul 25, 2023 at 9:48 AM German Eichberger via dev >>> <dev@cassandra.apache.org <mailto:dev@cassandra.apache.org>> wrote: >>>> In [2] we suggested that the next step should be a CEP. >>>> >>>> I am happy to lend a hand to this effort as well. >>>> >>>> Thanks Jaydeep and David - really appreciated. >>>> >>>> German >>>> >>>> From: David Capwell <dcapw...@apple.com <mailto:dcapw...@apple.com>> >>>> Sent: Tuesday, July 25, 2023 8:32 AM >>>> To: dev <dev@cassandra.apache.org <mailto:dev@cassandra.apache.org>> >>>> Cc: German Eichberger <german.eichber...@microsoft.com >>>> <mailto:german.eichber...@microsoft.com>> >>>> Subject: [EXTERNAL] Re: [Discuss] Repair inside C* >>>> >>>> As someone who has done a lot of work trying to make repair stable, I >>>> approve of this message ^_^ >>>> >>>> More than glad to help mentor this work >>>> >>>> On Jul 24, 2023, at 6:29 PM, Jaydeep Chovatia <chovatia.jayd...@gmail.com >>>> <mailto:chovatia.jayd...@gmail.com>> wrote: >>>> >>>> To clarify the repair solution timing, the one we have listed in the >>>> article is not the recently developed one. We were hitting some >>>> high-priority production challenges back in early 2018, and to address >>>> that, we developed and rolled out the solution in production in just a few >>>> months. The timing-wise, the solution was developed and productized by Q3 >>>> 2018, of course, continued to evolve thereafter. Usually, we explore the >>>> existing solutions we can leverage, but when we started our journey in >>>> early 2018, most of the solutions were based on sidecar solutions. There >>>> is nothing against the sidecar solution; it was just a pure business >>>> decision, and in that, we wanted to avoid the sidecar to avoid a >>>> dependency on the control plane. Every solution developed has its deep >>>> context, merits, and pros and cons; they are all great solutions! >>>> >>>> An appeal to the community members is to think one more time about having >>>> repairs in the Open Source Cassandra itself. As mentioned in my previous >>>> email, any solution getting adopted is fine; the important aspect is to >>>> have a repair solution in the OSS Cassandra itself! >>>> >>>> Yours Faithfully, >>>> Jaydeep >>>> >>>> On Mon, Jul 24, 2023 at 3:46 PM Jaydeep Chovatia >>>> <chovatia.jayd...@gmail.com <mailto:chovatia.jayd...@gmail.com>> wrote: >>>> Hi German, >>>> >>>> The goal is always to backport our learnings back to the community. For >>>> example, I have already successfully backported the following two >>>> enhancements/bug fixes back to the Open Source Cassandra, which are >>>> described in the article. I am already currently working on open-source a >>>> few more enhancements mentioned in the article back to the open-source. >>>> https://issues.apache.org/jira/browse/CASSANDRA-18555 >>>> https://issues.apache.org/jira/browse/CASSANDRA-13740 >>>> There is definitely heavy interest in having the repair solution inside >>>> the Open Source Cassandra itself, very much like Compaction. As I write >>>> this email, we are internally working on a one-pager proposal doc to all >>>> the community members on having a repair inside the OSS Apache Cassandra >>>> along with our private fork - I will share it soon. >>>> >>>> Generally, we are ok with any solution getting adopted (either Joey's >>>> solution or our repair solution or any other solution). The primary >>>> motivation is to have the repair embedded inside the open-source Cassandra >>>> itself, so we can retire all various privately developed solutions >>>> eventually :) >>>> >>>> I am also happy to help (drive conversation, discussion, etc.) in any way >>>> to have a repair solution adopted inside Cassandra itself, please let me >>>> know. Happy to help! >>>> >>>> Yours Faithfully, >>>> Jaydeep >>>> >>>> On Mon, Jul 24, 2023 at 1:44 PM German Eichberger via dev >>>> <dev@cassandra.apache.org <mailto:dev@cassandra.apache.org>> wrote: >>>> All, >>>> >>>> We had a brief discussion in [2] about the Uber article [1] where they >>>> talk about having integrated repair into Cassandra and how great that is. >>>> I expressed my disappointment that they didn't work with the community on >>>> that (Uber, if you are listening time to make amends 🙂) and it turns out >>>> Joey already had the idea and wrote the code [3] - so I wanted to start a >>>> discussion to gauge interest and maybe how to revive that effort. >>>> >>>> Thanks, >>>> German >>>> >>>> [1] >>>> https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/ >>>> [2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619 >>>> [3] https://issues.apache.org/jira/browse/CASSANDRA-14346 >>>>