I think Jordan and German had an interesting insight, or at least their comment made me think about this slightly differently, and I’m going to repeat it so it’s not lost in the discussion about zerocopy / sendfile.

The CEP treats this as “move a live instance from one machine to another”. I know why the author wants to do this.

If you think of it instead as “change backup/restore mechanism to be able to safely restore from a running instance”, you may end up with a cleaner abstraction that’s easier to think about (and may also be easier to generalize in clouds where you have other tools available ). 

I’m not familiar enough with the sidecar to know the state of orchestration for backup/restore, but “ensure the original source node isn’t running” , “migrate the config”, “choose and copy a snapshot” , maybe “forcibly exclude the original instance from the cluster” are all things the restore code is going to need to do anyway, and if restore doesn’t do that today, it seems like we can solve it once. 

Backup probably needs to be generalized to support many sources, too. Object storage is obvious (s3 download). Block storage is obvious (snapshot and reattach). Reading sstables from another sidecar seems reasonable, too.

It accomplishes the original goal, in largely the same fashion, it just makes the logic reusable for other purposes? 





On Apr 19, 2024, at 5:52 PM, Dinesh Joshi <djo...@apache.org> wrote:


On Thu, Apr 18, 2024 at 12:46 PM Ariel Weisberg <ar...@weisberg.ws> wrote:

If there is a faster/better way to replace a node why not  have Cassandra support that natively without the sidecar so people who aren’t running the sidecar can benefit?
 
I am not the author of the CEP so take whatever I say with a pinch of salt. Scott and Jordan have pointed out some benefits of doing this in the Sidecar vs Cassandra. 

Today Cassandra is able to do fast node replacements. However, this CEP is addressing an important corner case when Cassandra is unable to start up due to old / ailing hardware. Can we fix it in Cassandra so it doesn't die on old hardware? Sure. However, you would still need operator intervention to start it up in some special mode both on the old and new node so the new node can peer with the old node, copy over its data and join the ring. This would still require some orchestration outside the database. The Sidecar can do that orchestration for the operator. The point I'm making here is that the CEP addresses a real issue. The way it is currently built can improve over time with improvements in Cassandra.

Dinesh

Reply via email to