Hi all, I did not hear anything in the last 10+ days. I am taking it as a positive sign and proceeding to the voting stage for this CEP.
Thanks! Hari On Fri, Jun 7, 2024 at 10:26 PM Venkata Hari Krishna Nukala < n.v.harikrishna.apa...@gmail.com> wrote: > > Summarizing the discussion happened so far > > > > *Data copy using rsync vs SideCar* > Data copy via rsync is an incomplete solution and has to be executed > outside of the Cassandra ecosystem. Data copy via Sidecar is valuable for > Cassandra to have an ecosystem-native approach outside the streaming path > which excludes repairs, decommissions and bootstraps. Proposed solution > poses fewer security concerns than rsync. An ecosystem-native approach is > more instrumentable and measurable than rsync. Tooling can be built on top > of it. > > *File digest/checksum* > > Initial proposal mentioned that combination of file path and size is used > to verify that destination and source have the same set of data. Scott, Jon > and Dinesh expressed concerns about hitting corner cases where just > verifying path & size is not good enough. I had updated CEP-40 to have > binary level file verification using a digest algorithm. > > *Managing C* life cycle with Sidecar* > > The migration process proposed requires biring up and down the Cassandra > instances. This CEP called out that bringing the instances up/down is not > in scope. Jon and Jordan expressed that adding this ability to make this > entire workflow self managed is the biggest win. > > Managing C* lifecycle (safely start, stop & restart) is already considered > in scope for CEP-1 > <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652224#CEP1:ApacheCassandraManagementProcess(es)-3.Lifecycle(safelystart,stop,restartC*)>. > It can be leveraged when implemented as part of CEP-1. > > *Abstraction of how files get moved, backup and restore* > > Jordan & German mentioned that having an abstracttion how files get moved > / put in place would help allow others to plugin alternative means of data > movement like pulling down from backups/S3/any other source. Jeff added the > following points. 1) If you think of it instead as “change backup/restore > mechanism to be able to safely restore from a running instance”, you may > end up with a cleaner abstraction that’s easier to think about (and may > also be easier to generalize in clouds where you have other tools available > ). 2) “ensure the original source node isn’t running” , “migrate the > config”, “choose and copy a snapshot” , maybe “forcibly exclude the > original instance from the cluster” are all things the restore code is > going to need to do anyway, and if restore doesn’t do that today, it seems > like we can solve it once. It accomplishes the original goal, in largely > the same fashion, it just makes the logic reusable for other purposes? > > Jon and Jordan mentioned that framing of replacing a node as restoring a > node and then kicking off a replace node is an interesting perspective. The > data copy task mentioned in this CEP can be viewed as a restore task/job > which treats another running Sidecar as a source. When it is generalised to > support other sources like S3 or disk snapshots, support for many use cases > can be added like restoring from S3 or disk snapshots. > > Updated CEP-40 with the details how files get moved and put in place which > can be treated as default implementation for live migration. Having a > cleaner abstraction/interface for source is added as one of the goals. The > data copy task can be untied with live migration so that it can be used to > copy data from any source(remote) to local. This way it can be leveraged > across different use cases. Data copy task endpoint can be tailored to > accommodate different plugins during implementation. Francisco mentioned > that Sidecar has now the ability to restore data from S3 (for Analytics > library) and it can be extended for live migration, backup and restore, and > others. > > *Supporting Live migration with-in Cassandra process instead of sidecar* > > Paulo and Ariel raised a point about supporting migration in the main > process via entire sstable streaming which could also help people who > aren't running the Sidecar. > > Jon, Francisco, Jordan, Scott & Dinesh mentioned the following benefits of > doing live migration. Sidecar can be used for coordination to start and > stop instances or do things that require something out process. Sidecar > would be able to migrate from a Cassandra instance that is already dead and > cannot recover (not because of disk issues). If we are considering the main > process then we have to do some additional work to ensure that it doesn’t > put pressure on the JVM and introduce latency. The host replacement process > also puts a lot of stress on gossip and is a great way to encounter all > sorts of painful races if you perform it hundreds or thousands of times > (but shouldn’t be a problem in TCM-world). It is also valuable to have a > paved path implementation of a safe migration/forklift state machine when > you’re in a bind, or need to do this hundreds or thousands of times. > > *Migrating a specific keyspace to a dedicated cluster* > > Patrick brought up an interesting use case. In his words: In many cases, > multiple tenants present cause the cluster to overpressure. The best > solution in that case is to migrate the largest keyspace to a dedicated > cluster. Live migration but a bit more complicated. No chance of doing this > manually without some serious brain surgery on c* and downtime. > > With the proposed solution, keyspaces can be copied selectively by > supplying inclusion/exclusion filters to data copy task API. If the writes > for these keyspaces can't be stopped at the source, then destination may > not reach the point where 100% of the data of selected keyspaces match the > source. For me, it sounds doable assuming that the writes can be stopped > for sometime. As Patrick calledout, it is a bit more complicated when > downtime is not acceptable. Like Josh mentioned, it can be a stretch goal > or v2 of this CEP. > > *Live migration + TCM* > > Alex raised a point that cluster may lose both durability and availability > during the second phase of data copy. This migration can be done in a more > durable manner with TCM by leaving the source node as both a read and write > target, and allow the new node to be the target for writes. It can > eliminate the second phase of data copying. Jordan also agrees that it is a > more durable approach. Alex also felt that it would be good to have a good > understanding of availability and durability guarantees we want to provide > with it, and have it stated explicitly, for both "source node down" and > "source node up" cases. Me and Alex had an offline discussion, and brought > up a point that this may require extra care for making sure that initial > data copy sstables aren't involved in the regular node sstable lifecycle, > since in that case we may inadvertently remove or compact them. > > Alex is fine to do it either as part of the current CEP, or as a follow up. > > ---- > > Hope I have covered all points and addressed them in the CEP. Please call > out if I miss anything. I feel that we reached a point where we had a fair > amount of discussion and the discussions is fairly converged. Is it a good > time to call for voting? Do I have to wait or do anything else before > requesting votes? > > Thanks! > Hari > > On Thu, May 30, 2024 at 7:54 PM Alex Petrov <al...@coffeenco.de> wrote: > >> Alex, just want to make sure that I understand your point correctly. Are >> you suggesting this sequence of operations with TCM? >> >> * Make config changes >> * Do the initial data copy >> * Make destination part of write placements (same as source) >> * Start destination instance >> * Decommission the source >> * Enable reads for destination by making it part of read placements (as >> source) >> >> >> Almost. I am suggesting reuse the logic we have in TCM and already use >> for bootstraps and replacements. I think the way it'll be sequencing will >> be something like: >> * Make config changes >> * Start destination instance >> * Make destination part of write placements (same as source) >> * Do the initial data copy >> * Load sstables from the initial data copy >> * Enable reads for destination by making it part of read placements >> * Decommission the source >> >> We've also had a short discussion offline, and brought up a good point >> that this may require extra care for making sure that initial data copy >> sstables aren't involved in the regular node sstable lifecycle, since in >> that case we may inadvertently remove or compact them. >> >> > It is a fair point. It is good to have the understanding of >> availability and durability guarantees during migration. I can create a >> JIRA for it later. >> >> Sounds good. As I mentioned, I'm fine either way: if we do it as a part >> of CEP, or as a follow-up. >> >> On Sun, May 12, 2024, at 8:18 PM, Venkata Hari Krishna Nukala wrote: >> >> Replies from my side for the other points of the discussion: >> *Managing C* life cycle with Sidecar* >> >> >lifecycle / orchestration portion is the more challenging aspect. It >> would be nice to address that as well so we don’t end up with something >> like repair where the building blocks are there but the hard parts are left >> to the operator >> >> CEP-1 has lifecycle operations under scope. >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652224#CEP1:ApacheCassandraManagementProcess(es)-3.Lifecycle(safelystart,stop,restartC*). >> I think it can be leveraged when implemented as part of CEP-1. >> >> *On backup & restore use case* >> >> I see similarities between backup/restore & this migration. But I feel >> there will be considerable differences while implementing it and we might >> need to tailor the API to make it usable for backup & restore. I think >> making the code/logic reusable can be an implicit goal. Does calling backup >> & restore - a stretch goal or creating a separate CEP sounds fair? >> >> *Migrate the largest keyspace to a dedicated cluster* >> >> Parick, proposed API can help to copy specific keyspace data to another >> cluster. "No chance of doing this manually without some serious brain >> surgery on c* and downtime." - sounds a bit tricky to me. Since the >> clusters are independent, doing it without any coordination between >> clusters and downtime sounds like a case this CEP is not targeting at the >> moment. >> >> *Live migration + TCM* >> >> >We can implement CEP-40 using a similar approach: we can leave the >> source node as both a read and write target, and allow the new node to be a >> target for (pending) writes. Unfortunately, this does not help with >> availability (in fact, it decreases write availability, since we will have >> to collect 2+1 mandatory write responses instead of just 2), but increases >> durability, and I think helps to fully eliminate the second phase. This >> also increases read availability when the source node is up, since we can >> still use the source node as a part of the read quorum. >> >> Alex, just want to make sure that I understand your point correctly. Are >> you suggesting this sequence of operations with TCM? >> >> * Make config changes >> * Do the initial data copy >> * Make destination part of write placements (same as source) >> * Start destination instance >> * Decommission the source >> * Enable reads for destination by making it part of read placements (as >> source) >> >> >I am also not against to have this to be done post-factum, after >> implementation of CEP in its current form, but I think it would be good to >> have good understanding of availability and durability guarantees we want >> to provide with it, and have it stated explicitly, for both "source node >> down" and "source node up" cases. >> >> It is a fair point. It is good to have the understanding of availability >> and durability guarantees during migration. I can create a JIRA for it >> later. >> >> Thanks! >> Hari >> >> On Thu, May 2, 2024 at 12:30 PM Alex Petrov <al...@coffeenco.de> wrote: >> >> >> Thank you for input! >> >> > Would it be possible to create a new type of write target node? The >> new write target node is notified of writes (like any other write node) but >> does not participate in the write availability calculation. >> >> We could make a some kind of optional write, but unfortunately this way >> we can not codify our consistency level. Since we already use a notion of >> pending ranges that requires 1 extra ack, and we as a community are OK with >> it, I think for simplicity we should stick to the same notion. >> >> If there is a lot of interest in this kind of availability/durability >> tradeoff, we should discuss all implications in a separate CEP, but then it >> probably would make sense to make it available for all operations. >> >> My personal opinion is that if we can't guarantee/rely on the number of >> acks, this may accidentally mislead people as they would expect it to work >> and lead to surprises when it does not. >> >> On Wed, May 1, 2024, at 4:38 PM, Claude Warren, Jr via dev wrote: >> >> Alex, >> >> you write: >> >> We can implement CEP-40 using a similar approach: we can leave the source >> node as both a read and write target, and allow the new node to be a target >> for (pending) writes. Unfortunately, this does not help with availability >> (in fact, it decreases write availability, since we will have to collect >> 2+1 mandatory write responses instead of just 2), but increases durability, >> and I think helps to fully eliminate the second phase. This also increases >> read availability when the source node is up, since we can still use the >> source node as a part of read quorum. >> >> >> Would it be possible to create a new type of write target node? The new >> write target node is notified of writes (like any other write node) but >> does not participate in the write availability calculation. In this way a >> node this is being migrated to could receive writes and have minimal impact >> on the current operation of the cluster? >> >> Claude >> >> >> >> On Wed, May 1, 2024 at 12:33 PM Alex Petrov <al...@coffeenco.de> wrote: >> >> >> Thank you for submitting this CEP! >> >> Wanted to discuss this point from the description: >> >> > How to bring up/down Cassandra/Sidecar instances or making/applying >> config changes are outside the scope of this document. >> >> One advantage of doing migration via sidecar is the fact that we can >> stream sstables to the target node from the source node while the source >> node is down. Also if the source node is down, it does not matter if we >> can’t use it as a write target However, if we are replacing a live node, we >> do lose both durability and availability during the second copy phase. >> There are copious other advantages described by others in the thread above. >> >> For example, we have three adjacent nodes A,B,C and simple RF 3. C >> (source) is up and is being replaced with live-migrated D (destination). >> According to the described process in CEP-40, we perform streaming in 2 >> phases: first one is a full copy (similar to bootstrap/replacement in >> cassandra), and the second one is just a diff. The second phase is still >> going to take a non-trivial amount of time, and is likely to last at very >> least minutes. During this time, we only have nodes A and B as both read >> and write targets, with no alternatives: we have to have both of them >> present for any operation, and losing either one of them leaves us with >> only one copy of data. >> >> To contrast this, TCM bootstrap process is 4-step: between the old owner >> being phased out and the new owner brought in, we always ensure r/w quorum >> consistency and liveness of at least 2 nodes for the read quorum, 3 nodes >> available for reads in best case, and 2+1 pending replica for the write >> quorum, with 4 nodes (3 existing owners + 1 pending) being available for >> writes in best case. Replacement in TCM is implemented similarly, with the >> old node remaining an (unavailable) read target, but new node already being >> the target for (pending) writes. >> >> We can implement CEP-40 using a similar approach: we can leave the source >> node as both a read and write target, and allow the new node to be a target >> for (pending) writes. Unfortunately, this does not help with availability >> (in fact, it decreases write availability, since we will have to collect >> 2+1 mandatory write responses instead of just 2), but increases durability, >> and I think helps to fully eliminate the second phase. This also increases >> read availability when the source node is up, since we can still use the >> source node as a part of read quorum. >> >> I think if we want to call this feature "live migration", since this term >> is used in hypervisor community to describe an instant and uninterrupted >> instance migration from one host to the other without guest instance being >> able to notice as much as the time jump, we may want to provide similar >> guarantees. >> >> I am also not against to have this to be done post-factum, after >> implementation of CEP in its current form, but I think it would be good to >> have good understanding of availability and durability guarantees we want >> to provide with it, and have it stated explicitly, for both "source node >> down" and "source node up" cases. That said, since we will have to >> integrate CEP-40 with TCM, and will have to ensure correctness of sstable >> diffing for the second phase, it might make sense to consider reusing some >> of the existing replacement logic from TCM. Just to make sure this is >> mentioned explicitly, my proposal is only concerned with the second copy >> phase, without any implications about the first. >> >> Thank you, >> --Alex >> >> On Fri, Apr 5, 2024, at 12:46 PM, Venkata Hari Krishna Nukala wrote: >> >> Hi all, >> >> I have filed CEP-40 [1] for live migrating Cassandra instances using the >> Cassandra Sidecar. >> >> When someone needs to move all or a portion of the Cassandra nodes >> belonging to a cluster to different hosts, the traditional approach of >> Cassandra node replacement can be time-consuming due to repairs and the >> bootstrapping of new nodes. Depending on the volume of the storage service >> load, replacements (repair + bootstrap) may take anywhere from a few hours >> to days. >> >> Proposing a Sidecar based solution to address these challenges. This >> solution proposes transferring data from the old host (source) to the new >> host (destination) and then bringing up the Cassandra process at the >> destination, to enable fast instance migration. This approach would help to >> minimise node downtime, as it is based on a Sidecar solution for data >> transfer and avoids repairs and bootstrap. >> >> Looking forward to the discussions. >> >> [1] >> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-40%3A+Data+Transfer+Using+Cassandra+Sidecar+for+Live+Migrating+Instances >> >> Thanks! >> Hari >> >> >> >> >>