Re: [DISCUSS] CEP-40: Data Transfer Using Cassandra Sidecar for Live Migrating Instances

Venkata Hari Krishna Nukala Fri, 21 Jun 2024 08:05:07 -0700

Hi all,

I did not hear anything in the last 10+ days. I am taking it as a positive
sign and proceeding to the voting stage for this CEP.


Thanks!
Hari

On Fri, Jun 7, 2024 at 10:26 PM Venkata Hari Krishna Nukala <
n.v.harikrishna.apa...@gmail.com> wrote:

>
> Summarizing the discussion happened so far
>
>
>
> *Data copy using rsync vs SideCar*
> Data copy via rsync is an incomplete solution and has to be executed
> outside of the Cassandra ecosystem. Data copy via Sidecar is valuable for
> Cassandra to have an ecosystem-native approach outside the streaming path
> which excludes repairs, decommissions and bootstraps. Proposed solution
> poses fewer security concerns than rsync. An ecosystem-native approach is
> more instrumentable and measurable than rsync. Tooling can be built on top
> of it.
>
> *File digest/checksum*
>
> Initial proposal mentioned that combination of file path and size is used
> to verify that destination and source have the same set of data. Scott, Jon
> and Dinesh expressed concerns about hitting corner cases where just
> verifying path & size is not good enough. I had updated CEP-40 to have
> binary level file verification using a digest algorithm.
>
> *Managing C* life cycle with Sidecar*
>
> The migration process proposed requires biring up and down the Cassandra
> instances. This CEP called out that bringing the instances up/down is not
> in scope. Jon and Jordan expressed that adding this ability to make this
> entire workflow self managed is the biggest win.
>
> Managing C* lifecycle (safely start, stop & restart) is already considered
> in scope for CEP-1
> <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652224#CEP1:ApacheCassandraManagementProcess(es)-3.Lifecycle(safelystart,stop,restartC*)>.
> It can be leveraged when implemented as part of CEP-1.
>
> *Abstraction of how files get moved, backup and restore*
>
> Jordan & German mentioned that having an abstracttion how files get moved
> / put in place would help allow others to plugin alternative means of data
> movement like pulling down from backups/S3/any other source. Jeff added the
> following points. 1) If you think of it instead as “change backup/restore
> mechanism to be able to safely restore from a running instance”, you may
> end up with a cleaner abstraction that’s easier to think about (and may
> also be easier to generalize in clouds where you have other tools available
> ). 2) “ensure the original source node isn’t running” , “migrate the
> config”, “choose and copy a snapshot” , maybe “forcibly exclude the
> original instance from the cluster” are all things the restore code is
> going to need to do anyway, and if restore doesn’t do that today, it seems
> like we can solve it once. It accomplishes the original goal, in largely
> the same fashion, it just makes the logic reusable for other purposes?
>
> Jon and Jordan mentioned that framing of replacing a node as restoring a
> node and then kicking off a replace node is an interesting perspective. The
> data copy task mentioned in this CEP can be viewed as a restore task/job
> which treats another running Sidecar as a source. When it is generalised to
> support other sources like S3 or disk snapshots, support for many use cases
> can be added like restoring from S3 or disk snapshots.
>
> Updated CEP-40 with the details how files get moved and put in place which
> can be treated as default implementation for live migration. Having a
> cleaner abstraction/interface for source is added as one of the goals. The
> data copy task can be untied with live migration so that it can be used to
> copy data from any source(remote) to local. This way it can be leveraged
> across different use cases. Data copy task endpoint can be tailored to
> accommodate different plugins during implementation. Francisco mentioned
> that Sidecar has now the ability to restore data from S3 (for Analytics
> library) and it can be extended for live migration, backup and restore, and
> others.
>
> *Supporting Live migration with-in Cassandra process instead of sidecar*
>
> Paulo and Ariel raised a point about supporting migration in the main
> process via entire sstable streaming which could also help people who
> aren't running the Sidecar.
>
> Jon, Francisco, Jordan, Scott & Dinesh mentioned the following benefits of
> doing live migration. Sidecar can be used for coordination to start and
> stop instances or do things that require something out process. Sidecar
> would be able to migrate from a Cassandra instance that is already dead and
> cannot recover (not because of disk issues). If we are considering the main
> process then we have to do some additional work to ensure that it doesn’t
> put pressure on the JVM and introduce latency. The host replacement process
> also puts a lot of stress on gossip and is a great way to encounter all
> sorts of painful races if you perform it hundreds or thousands of times
> (but shouldn’t be a problem in TCM-world). It is also valuable to have a
> paved path implementation of a safe migration/forklift state machine when
> you’re in a bind, or need to do this hundreds or thousands of times.
>
> *Migrating a specific keyspace to a dedicated cluster*
>
> Patrick brought up an interesting use case. In his words: In many cases,
> multiple tenants present cause the cluster to overpressure. The best
> solution in that case is to migrate the largest keyspace to a dedicated
> cluster. Live migration but a bit more complicated. No chance of doing this
> manually without some serious brain surgery on c* and downtime.
>
> With the proposed solution, keyspaces can be copied selectively by
> supplying inclusion/exclusion filters to data copy task API. If the writes
> for these keyspaces can't be stopped at the source, then destination may
> not reach the point where 100% of the data of selected keyspaces match the
> source. For me, it sounds doable assuming that the writes can be stopped
> for sometime. As Patrick calledout, it is a bit more complicated when
> downtime is not acceptable. Like Josh mentioned, it can be a stretch goal
> or v2 of this CEP.
>
> *Live migration + TCM*
>
> Alex raised a point that cluster may lose both durability and availability
> during the second phase of data copy. This migration can be done in a more
> durable manner with TCM by leaving the source node as both a read and write
> target, and allow the new node to be the target for writes. It can
> eliminate the second phase of data copying. Jordan also agrees that it is a
> more durable approach. Alex also felt that it would be good to have a good
> understanding of availability and durability guarantees we want to provide
> with it, and have it stated explicitly, for both "source node down" and
> "source node up" cases. Me and Alex had an offline discussion, and brought
> up a point that this may require extra care for making sure that initial
> data copy sstables aren't involved in the regular node sstable lifecycle,
> since in that case we may inadvertently remove or compact them.
>
> Alex is fine to do it either as part of the current CEP, or as a follow up.
>
> ----
>
> Hope I have covered all points and addressed them in the CEP. Please call
> out if I miss anything. I feel that we reached a point where we had a fair
> amount of discussion and the discussions is fairly converged. Is it a good
> time to call for voting? Do I have to wait or do anything else before
> requesting votes?
>
> Thanks!
> Hari
>
> On Thu, May 30, 2024 at 7:54 PM Alex Petrov <al...@coffeenco.de> wrote:
>
>> Alex, just want to make sure that I understand your point correctly. Are
>> you suggesting this sequence of operations with TCM?
>>
>> * Make config changes
>> * Do the initial data copy
>> * Make destination part of write placements (same as source)
>> * Start destination instance
>> * Decommission the source
>> * Enable reads for destination by making it part of read placements (as
>> source)
>>
>>
>> Almost. I am suggesting reuse the logic we have in TCM and already use
>> for bootstraps and replacements. I think the way it'll be sequencing will
>> be something like:
>>   * Make config changes
>>   * Start destination instance
>>   * Make destination part of write placements (same as source)
>>   * Do the initial data copy
>>   * Load sstables from the initial data copy
>>   * Enable reads for destination by making it part of read placements
>>   * Decommission the source
>>
>> We've also had a short discussion offline, and brought up a good point
>> that this may require extra care for making sure that initial data copy
>> sstables aren't involved in the regular node sstable lifecycle, since in
>> that case we may inadvertently remove or compact them.
>>
>> > It is a fair point. It is good to have the understanding of
>> availability and durability guarantees during migration. I can create a
>> JIRA for it later.
>>
>> Sounds good. As I mentioned, I'm fine either way: if we do it as a part
>> of CEP, or as a follow-up.
>>
>> On Sun, May 12, 2024, at 8:18 PM, Venkata Hari Krishna Nukala wrote:
>>
>> Replies from my side for the other points of the discussion:
>> *Managing C* life cycle with Sidecar*
>>
>> >lifecycle / orchestration portion is the more challenging aspect. It
>> would be nice to address that as well so we don’t end up with something
>> like repair where the building blocks are there but the hard parts are left
>> to the operator
>>
>> CEP-1 has lifecycle operations under scope.
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95652224#CEP1:ApacheCassandraManagementProcess(es)-3.Lifecycle(safelystart,stop,restartC*).
>> I think it can be leveraged when implemented as part of CEP-1.
>>
>> *On backup & restore use case*
>>
>> I see similarities between backup/restore & this migration. But I feel
>> there will be considerable differences while implementing it and we might
>> need to tailor the API to make it usable for backup & restore. I think
>> making the code/logic reusable can be an implicit goal. Does calling backup
>> & restore - a stretch goal or creating a separate CEP sounds fair?
>>
>> *Migrate the largest keyspace to a dedicated cluster*
>>
>> Parick, proposed API can help to copy specific keyspace data to another
>> cluster. "No chance of doing this manually without some serious brain
>> surgery on c* and downtime." - sounds a bit tricky to me. Since the
>> clusters are independent, doing it without any coordination between
>> clusters and downtime sounds like a case this CEP is not targeting at the
>> moment.
>>
>> *Live migration + TCM*
>>
>> >We can implement CEP-40 using a similar approach: we can leave the
>> source node as both a read and write target, and allow the new node to be a
>> target for (pending) writes. Unfortunately, this does not help with
>> availability (in fact, it decreases write availability, since we will have
>> to collect 2+1 mandatory write responses instead of just 2), but increases
>> durability, and I think helps to fully eliminate the second phase. This
>> also increases read availability when the source node is up, since we can
>> still use the source node as a part of the read quorum.
>>
>> Alex, just want to make sure that I understand your point correctly. Are
>> you suggesting this sequence of operations with TCM?
>>
>> * Make config changes
>> * Do the initial data copy
>> * Make destination part of write placements (same as source)
>> * Start destination instance
>> * Decommission the source
>> * Enable reads for destination by making it part of read placements (as
>> source)
>>
>> >I am also not against to have this to be done post-factum, after
>> implementation of CEP in its current form, but I think it would be good to
>> have good understanding of availability and durability guarantees we want
>> to provide with it, and have it stated explicitly, for both "source node
>> down" and "source node up" cases.
>>
>> It is a fair point. It is good to have the understanding of availability
>> and durability guarantees during migration. I can create a JIRA for it
>> later.
>>
>> Thanks!
>> Hari
>>
>> On Thu, May 2, 2024 at 12:30 PM Alex Petrov <al...@coffeenco.de> wrote:
>>
>>
>> Thank you for input!
>>
>> > Would it be possible to create a new type of write target node?  The
>> new write target node is notified of writes (like any other write node) but
>> does not participate in the write availability calculation.
>>
>> We could make a some kind of optional write, but unfortunately this way
>> we can not codify our consistency level. Since we already use a notion of
>> pending ranges that requires 1 extra ack, and we as a community are OK with
>> it, I think for simplicity we should stick to the same notion.
>>
>> If there is a lot of interest in this kind of availability/durability
>> tradeoff, we should discuss all implications in a separate CEP, but then it
>> probably would make sense to make it available for all operations.
>>
>> My personal opinion is that if we can't guarantee/rely on the number of
>> acks, this may accidentally mislead people as they would expect it to work
>> and lead to surprises when it does not.
>>
>> On Wed, May 1, 2024, at 4:38 PM, Claude Warren, Jr via dev wrote:
>>
>> Alex,
>>
>>  you write:
>>
>> We can implement CEP-40 using a similar approach: we can leave the source
>> node as both a read and write target, and allow the new node to be a target
>> for (pending) writes. Unfortunately, this does not help with availability
>> (in fact, it decreases write availability, since we will have to collect
>> 2+1 mandatory write responses instead of just 2), but increases durability,
>> and I think helps to fully eliminate the second phase. This also increases
>> read availability when the source node is up, since we can still use the
>> source node as a part of read quorum.
>>
>>
>> Would it be possible to create a new type of write target node?  The new
>> write target node is notified of writes (like any other write node) but
>> does not participate in the write availability calculation.  In this way a
>> node this is being migrated to could receive writes and have minimal impact
>> on the current operation of the cluster?
>>
>> Claude
>>
>>
>>
>> On Wed, May 1, 2024 at 12:33 PM Alex Petrov <al...@coffeenco.de> wrote:
>>
>>
>> Thank you for submitting this CEP!
>>
>> Wanted to discuss this point from the description:
>>
>> > How to bring up/down Cassandra/Sidecar instances or making/applying
>> config changes are outside the scope of this document.
>>
>> One advantage of doing migration via sidecar is the fact that we can
>> stream sstables to the target node from the source node while the source
>> node is down. Also if the source node is down, it does not matter if we
>> can’t use it as a write target However, if we are replacing a live node, we
>> do lose both durability and availability during the second copy phase.
>> There are copious other advantages described by others in the thread above.
>>
>> For example, we have three adjacent nodes A,B,C and simple RF 3. C
>> (source) is up and is being replaced with live-migrated D (destination).
>> According to the described process in CEP-40, we perform streaming in 2
>> phases: first one is a full copy (similar to bootstrap/replacement in
>> cassandra), and the second one is just a diff. The second phase is still
>> going to take a non-trivial amount of time, and is likely to last at very
>> least minutes. During this time, we only have nodes A and B as both read
>> and write targets, with no alternatives: we have to have both of them
>> present for any operation, and losing either one of them leaves us with
>> only one copy of data.
>>
>> To contrast this, TCM bootstrap process is 4-step: between the old owner
>> being phased out and the new owner brought in, we always ensure r/w quorum
>> consistency and liveness of at least 2 nodes for the read quorum, 3 nodes
>> available for reads in best case, and 2+1 pending replica for the write
>> quorum, with 4 nodes (3 existing owners + 1 pending) being available for
>> writes in best case. Replacement in TCM is implemented similarly, with the
>> old node remaining an (unavailable) read target, but new node already being
>> the target for (pending) writes.
>>
>> We can implement CEP-40 using a similar approach: we can leave the source
>> node as both a read and write target, and allow the new node to be a target
>> for (pending) writes. Unfortunately, this does not help with availability
>> (in fact, it decreases write availability, since we will have to collect
>> 2+1 mandatory write responses instead of just 2), but increases durability,
>> and I think helps to fully eliminate the second phase. This also increases
>> read availability when the source node is up, since we can still use the
>> source node as a part of read quorum.
>>
>> I think if we want to call this feature "live migration", since this term
>> is used in hypervisor community to describe an instant and uninterrupted
>> instance migration from one host to the other without guest instance being
>> able to notice as much as the time jump, we may want to provide similar
>> guarantees.
>>
>> I am also not against to have this to be done post-factum, after
>> implementation of CEP in its current form, but I think it would be good to
>> have good understanding of availability and durability guarantees we want
>> to provide with it, and have it stated explicitly, for both "source node
>> down" and "source node up" cases. That said, since we will have to
>> integrate CEP-40 with TCM, and will have to ensure correctness of sstable
>> diffing for the second phase, it might make sense to consider reusing some
>> of the existing replacement logic from TCM. Just to make sure this is
>> mentioned explicitly, my proposal is only concerned with the second copy
>> phase, without any implications about the first.
>>
>> Thank you,
>> --Alex
>>
>> On Fri, Apr 5, 2024, at 12:46 PM, Venkata Hari Krishna Nukala wrote:
>>
>> Hi all,
>>
>> I have filed CEP-40 [1] for live migrating Cassandra instances using the
>> Cassandra Sidecar.
>>
>> When someone needs to move all or a portion of the Cassandra nodes
>> belonging to a cluster to different hosts, the traditional approach of
>> Cassandra node replacement can be time-consuming due to repairs and the
>> bootstrapping of new nodes. Depending on the volume of the storage service
>> load, replacements (repair + bootstrap) may take anywhere from a few hours
>> to days.
>>
>> Proposing a Sidecar based solution to address these challenges. This
>> solution proposes transferring data from the old host (source) to the new
>> host (destination) and then bringing up the Cassandra process at the
>> destination, to enable fast instance migration. This approach would help to
>> minimise node downtime, as it is based on a Sidecar solution for data
>> transfer and avoids repairs and bootstrap.
>>
>> Looking forward to the discussions.
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-40%3A+Data+Transfer+Using+Cassandra+Sidecar+for+Live+Migrating+Instances
>>
>> Thanks!
>> Hari
>>
>>
>>
>>
>>

Re: [DISCUSS] CEP-40: Data Transfer Using Cassandra Sidecar for Live Migrating Instances

Reply via email to