I and the rest of the Netflix Cassandra team share Dinesh's concerns. I was excited to work on this project precisely because we were taking only the best designs, techniques, and functionality out of the community sidecars such as Priam, Reaper, and any other community tool and building the simplest possible tool into Cassandra that could deliver the maximum value to our users with the minimal amount of technical debt. For example, a distributed, shared nothing architecture that communicates only through state transitions in Cassandra data itself seems to be the most robust and secure architecture (and indeed Reaper appears to be working towards refactoring towards that). Fundamental architecture is, in my experience, very hard to refactor, and often starting fresh with the lessons learned from the N previous iterations is the faster way to build real value. For example, Reaper was built to be a repair tool, it is baked into the core abstractions. It sounds like the community needs something more like a distributed task execution engine which is fully pluggable (plugin whatever ops task you want) and operates scheduled, oneshot, and daemon tasks.
What if we started with a basic framework as proposed in CASSANDRA-14395, maybe add a pluggable execution engine as the first few commits and then various community members can contribute plugins/modules that add various functionality such as repair, backup, distributed restarts, upgrades, etc..? We would be striving very hard not to reinvent the wheel, rather we would want to learn from previous iterations, keep what works well and leave the rest. Regarding Priam, we could offer to donate it but I think that the community shouldn't accept it because it is full of years of technical debt and decisions made by Netflix for Netflix. For example Priam currently has four different backup solutions (three used in production, the latest not used in production) that we have implemented over the years, and only the latest one that is not yet in production should be contributed to the official sidecar. The latest iteration is similar to the architecture of https://github.com/hashbrowncipher/cassandra-mirror which is capable of per minute, point in time backups; no previous iteration is capable of this. Yes the earlier versions are "battle hardened" but we know those architectures have fundamental flaws, are overly expensive, or simply won't scale to the next 10x requirement. We have learned from those previous iterations and are creating the next iteration that will scale another order of magnitude. I also wouldn’t want to burden reviewers with looking at the first three implementations or building the mental model all at once of how Priam works end to end. Practically speaking, I think it's much more logistically difficult to accept one of the sidecar projects as is than building a new one incrementally. The existing sidecars have dependencies that have to be vetted, technical debt that must be trimmed, tens of thousands of lines of code that have to be reviewed, and even if the community wants to make changes those changes might be prohibitively difficult as the underlying architecture has solidified. Furthermore, all of these tools were designed without the notion that they were shipping with Cassandra, which precluded them from being capable of next generation features like removing compaction entirely from the live request-response path into a separate process that can be limited with e.g. cgroups to ensure isolation. Also they have supported many versions of Cassandra over the years and therefore have layers of indirection and abstraction added simply for dealing with various different APIs and versions (I personally think the official sidecar should branch with Cassandra and support current plus previous versions of Cassandra just like the server does). I hope that we decide as a community to put all the options on the table in the open, learn from all of them, and pursue a solution that takes the best from all the solutions and is unencumbered by historical decisions. -Joey