JoaoJandre opened a new issue, #9524:
URL: https://github.com/apache/cloudstack/issues/9524

   ##### ISSUE TYPE
    * Feature Idea
   
   ##### COMPONENT NAME
   ~~~
   VM Snapshot
   ~~~
   
   ##### CLOUDSTACK VERSION
   
   ~~~
   4.20/main
   ~~~
   
   ##### CONFIGURATION
   
   ##### OS / ENVIRONMENT
   
   KVM, file storage (NFS, Shared mountpoint, local storage)
   
   ##### SUMMARY
   
   This spec addresses an update to the disk-only VM snapshot feature on the KVM
   
   # 1. Problem Description
   
   Currently, using KVM as the hypervisor, CloudStack does not support 
disk-only snapshots of VMs with volumes in NFS or local storage, CloudStack 
also does not support VM snapshots for stopped VMs; this means that if the user 
needs some sort of snapshot of their volumes, they must use the volume 
snapshot/backup feature. Furthermore, the current implementation relies on the 
same workflows as volume snapshots/backups: 
   
   1. The VM will be frozen (ignoring the quiesce parameter);
   2. Each volume will be processed individually using the volume snapshot 
workflow;
   3. Once all the snapshots are done, the VM will be resumed. 
   
   However, this approach is flawed: as we not only create the snapshots, but 
also copy all of them to another directory, there will be a lot of downtime, as 
the VM is frozen during this whole process. This downtime might be extremely 
long if the volumes are big.
   
   Moreover, as the snapshots will be copied to another directory in the 
primary storage, the revert takes some time as we need to copy the snapshot 
back.
   
   ### 1.1 Basic Definitions
   
   Here are some basic definitions that will be used throughout this spec:
   
   - Backing file: a read-only file that will be read when data is not found on 
the top file.
   - Delta: a file that stores data that was changed in comparison to its 
backing file(s). When a snapshot is taken, the current delta/file will become 
the backing file of a new delta, thus preserving the data on the file being 
snapshotted.
   - Backing chain: a chain of backing files.
   - Current delta/snapshot: the current delta that is being written to. It may 
have a parent snapshot and siblings.
   - Parent snapshot: the immediate predecessor of a given snapshot/delta in 
its backing chain. A snapshot/delta can only have one parent.
   - Child snapshot: the immediate successor of a given snapshot in the backing 
chain. A snapshot might have multiple children. The current delta cannot have 
children.
   - Sibling snapshot: a snapshot that is the child of a given snapshot/delta's 
parent.
   
   
   # 2. Proposed Changes
   
   To address the described problems, we propose to extend the VM snapshot 
feature on KVM to allow disk-only VM snapshots for NFS and local storage; other 
types of storage, such as shared-mount-point, already support disk-only VM 
snapshot. Furthermore, we intend to change the disk-only VM snapshot process 
for all other file-based storages (local, NFS and shared-mount-point): 
   
   1. We will take all the snapshots at the same time, instead of one at a time.
   2. Unlike volume snapshots, the disk-only VM snapshots will not be copied to 
another directory, they will stay as is after taken and be part of the volumes' 
backing chains. This makes reverting a snapshot much faster as we only have to 
change the paths that will be pointed to in the VM's DOM.
   3. The VM will only be frozen if the `quiesceVM` parameter is `true`.
   
   ### 2.0.2. Limitations
   
   - This proposal will not change the process for non-file based storages, 
such as LVM and RBD (Ceph). 
   - The snapshots will be external, that is, they will generate a new file 
when created, instead of internal, where the snapshots are inside the single 
volume file. This is a limitation with Qemu: active Qemu domains require 
external disk snapshots when not saving the state of the VM.
   - As Libvirt does not support reverting external snapshots, the reversion 
will have to be done by ACS while the VM is stopped. 
   - To avoid adding complexity to the volume migration, we will not allow 
migrating a volume when they have disk-only VM snapshots; this limitation is 
already present in the current implementation for VM snapshots with memory, 
although for different reasons.
   - After taking a disk-only VM snapshot, attaching/detaching volumes will not 
be permitted. This limitation already exists for other VM snapshot types.
   - This feature will be mutually exclusive with volume snapshots. When taking 
a volume snapshot of a volume that is attached to a VM with disk-only VM 
snapshots, the volume snapshot will fail, furthermore, the user will not be 
able to create snapshot policies for the volume as well. Conversely, if one of 
the VM's volumes has a snapshot or snapshot policy, the user will not be able 
to create a disk-only VM snapshot. There are multiple reasons to limit the 
interaction between these features:
       1. When reverting volume snapshots we have two choices: create a new 
tree for that volume that will have the reverted snapshot as base; invalidate 
the VM snapshots. The first option adds complexity to the implementation by 
adding lots of trees that have to be taken care of, while the second option 
will make the user lose data.
       2. The KVM incremental volume snapshot feature uses an API that is not 
compatible with `virDomainSnapshotCreateXML` (the API used for this feature). 
With this in mind, allowing volume and disk-only VM snapshots to coexist would 
create edge cases for failure, for example: if the user has a snapshot policy, 
and the `kvm.incremental.snapshot` is changed to true, the volume snapshots 
will suddenly begin to fail.
   - This feature will also be mutually exclusive with VM snapshots with 
memory. When reverting an external snapshot, all internal that were taken after 
the creation of the snapshot being reverted will be lost, causing data loss. As 
the VM snapshot with memory uses internal snapshots and this feature uses 
external snapshots, we will not allow both to coexist to prevent data loss.
   
   # 2.1. Disk-only VM Snapshot Creation
   <section id="snap-creation"></section>
   
   The proposed disk-only VM snapshot creation workflow is summarized in the 
following diagram.
   
   <img 
src="https://res.cloudinary.com/sc-clouds/image/upload/v1715878023/specs/cloudstack/disk-only-vm-snapshot/vm_snapshot_creation_2_tftl1d.png";
       alt="create-snapshot"
       style="width: 100%;
       height: auto;">
   
   * If the VM is running, we will call the `virDomainSnapshotCreateXML` API, 
informing all the VM's volumes with the `snapshot` key and the `external` 
value, and using the flags:
           1. `VIR_DOMAIN_SNAPSHOT_CREATE_ATOMIC`: to make the snapshot atomic 
across all the volumes;
           2. `VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY`: to make the snapshot 
disk-only;
           3. `VIR_DOMAIN_SNAPSHOT_CREATE_NO_METADATA`: to tell Libvirt not to 
save any metadata for the snapshot. This flag will be informed because we do 
not need Libvirt to save any metadata, all the other processes regarding the VM 
snapshots will be done manually using qemu-img.
           4. `VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE`: if `quiesceVM` is `true`, 
this flag will be informed as well to keep the VM frozen during the snapshot 
process, once the snapshot is done it will be already thawed.
   * Otherwise, we will call `qemu-img create` for every volume of the VM, to 
create a delta on top of the current file, which will become our snapshot.
   
   Unlike the volume snapshots, the disk-only VM snapshots are not designed to 
be backups; thus, we will not copy the disk-only VM snapshots to another 
directory or storage. We want the disk-only snapshots to be fast to revert 
whenever needed, and keeping them in the volumes backing-chain is the best way 
to achieve this.
   
   Currently, the VM is always frozen and resumed during the snapshot process, 
regardless of what is informed in the `quiesceVM` parameter. This process will 
be changed, the VM will only be frozen if the `quiesceVM` is informed. 
Furthermore, the downtime of the proposed process will be orders of magnitude 
smaller then the current implementation, as there will not be any copy while 
the VM is frozen.
   
   During the VM snapshot process, the snapshot job is queued alongside the 
other VM jobs; therefore, we do not have to worry about the VM being 
stopped/started during the snapshot, as each job is processed sequentially for 
each given VM. Furthermore, after creating the VM snapshot, ACS already forbids 
detaching volumes from the VM, so we do not need to worry about this case as 
well.
   
   # 2.2. VM Snapshot Reversion
   <section id="snap-reversion"></section>
   
   The proposed disk-only VM snapshot restore process is summarized in the 
diagram below. The process will be repeated for all the VM's volumes.
   
   <img 
src="https://res.cloudinary.com/sc-clouds/image/upload/v1715707430/specs/cloudstack/disk-only-vm-snapshot/vm_snapshot_reversion_1_iw5dej.png";
       alt="revert-snapshot"
       style="width: 100%;
       height: auto;">
   
   1. If the VM is running, we throw an exception, otherwise, we continue;
   2. If the current delta's parent is dead, we:
       1. Merge our sibling and our parent;
       2. Rebase our sibling's children (if any) to point to the merged 
snapshot;
       3. Remove our sibling's old file, as it is now empty;
       4. Change our sibling's path in the DB so that it points to the merged 
file;
   3. Delete the current delta that is being written to. It only contains 
changes on the disk that will be reverted as part of the process;
   4. Create a new delta on top of the snapshot that is being reverted to, so 
that we do not write directly into it and are able to return to it again later; 
   5. Update the volume path to point to the newly created delta.
   
   The proposed process will allow us to go back and forth on snapshots if need 
be. Furthermore, this process will be much faster than reverting a volume 
snapshot, as the bottleneck here is deleting the top delta that will not be 
used anymore; which should be much faster than copying a volume snapshot from 
another storage and replacing the old volume. 
   
   The process done in step 2 was added to cover an edge case where dead 
snapshots would be left in the storage until the VM was expunged. Here's a 
simple example of why it's needed:
   
   - Let's imagine the below image represents the current state of the snapshot 
tree: with `Current` being the current delta that is being written to, `Snap 1` 
the parent of `Current` and `Snap 2`. If we delete `Snap 1`, following the 
diagram on the <a href="#snap-deletion" class="internal-link">snapshot deletion 
section</a>, we can see that it will be marked as destroyed, but will not be 
deleted nor merged, as none of these operations can be done in this situation.
   
   <img 
src="https://res.cloudinary.com/sc-clouds/image/upload/v1715784150/specs/cloudstack/disk-only-vm-snapshot/Drawing_1_gsu9vr.png";
       style="width: 10%;
       display: block;
       margin-left: auto;
       margin-right: auto;
       height: auto;">
   
   - Then, we decide to revert to `Snap 2`, thus the old `Current` will be 
deleted and a new delta will be created on top of `Snap 2`. The problem is that 
now we have a dead snapshot, that will not be removed by any other process, as 
the user will not see it, and none of the processes of the snapshot deletion 
will delete it:
   
   <img 
src="https://res.cloudinary.com/sc-clouds/image/upload/v1715784167/specs/cloudstack/disk-only-vm-snapshot/Drawing_2_vmmjhl.png";
       style="width: 5%;
       display: block;
       margin-left: auto;
       margin-right: auto;
       height: auto;">
   
   - Now adding step 2 of the proposed reversion workflow: during reversion we 
will merge `Snap 1` and `Snap 2` (using `qemu-img commit`) and be left with 
only `Snap 2`, and our current delta:
   
   <img 
src="https://res.cloudinary.com/sc-clouds/image/upload/v1715784171/specs/cloudstack/disk-only-vm-snapshot/Drawing_3_jlf7xx.png";
       alt="revert-snapshot-ex3"
       style="width: 5%;
       display: block;
       margin-left: auto;
       margin-right: auto;
       height: auto;">
       
   # 2.3. VM Snapshot Deletion
   <section id="snap-deletion"></section>
   
   In order to keep the snapshot tree consistent and with the least amount of 
dead nodes, the snapshot deletion process will always try to manipulate the 
snapshot tree to remove any unneeded nodes while keeping the ones that are 
still needed; even if they were removed by the user, in these cases, they'll be 
marked as deleted on the DB, but will remain on the storage primary until they 
can be merged with another snapshot. The diagram below summarizes the snapshot 
deletion process, this process will be repeated for all the VM's volumes:
   
   <img 
src="https://res.cloudinary.com/sc-clouds/image/upload/v1715869488/specs/cloudstack/disk-only-vm-snapshot/vm_snapshot_deletion_6_iobpxf.png";
       alt="snapshot-deletion"
       style="width: 100%;
       height: auto;">
   
   As this diagram has several branches, each branch will be explained 
separately:
   
   * If our snapshot has a single child: 
       1. we will merge it with its child. If the VM is running we use 
Libvirt's `virDomainBlockCommit`; else use `qemu-img commit`, to commit the 
child to it;
       2. rebase its grandchildren to the merged file, if they exist. If the VM 
is running, the `virDomainBlockCommit` API already does this for us;
       3. remove the child's old file;
       4. change the child's path on the DB to the merged file;
   * If our snapshot has more than one child:
       1. we mark it as removed on the DB, but keep it in storage until it can 
be deleted.
   * If our snapshot has no children:
       1. we delete the snapshot file
       2. if the snapshot has a dead parent and only one sibling:
           1. we merge the sibling with its parent. If the VM is running we use 
Libvirt's `virDomainBlockCommit`; else use `qemu-img commit`, to commit the 
sibling to the parent;
           2. rebase the sibling's children to the merged file. If the VM is 
running, the `virDomainBlockCommit` API already does this for us;
           3. remove the sibling's old file;
           4. change the sibling's path on the DB to the merged file;
   
   The proposed deletion process leaves room for one edge case, which can lead 
to a dead node that would only be removed when the volume was deleted: If we 
revert to a snapshot that has one other child and then delete it, using the 
above algorithm, the deleted snapshot will end up only marked as removed on the 
DB. If we revert to another snapshot, this will leave a dead node on the tree 
that would not be removed (the snapshot that was previously deleted). To solve 
this edge case, when this specific situation happens, we will do as explained 
in the <a href="#snap-reversion" class="internal-link">snapshot reversion 
section</a> and merge the dead node with its child.
   
   # 2.4. Template Creation from Volume
   
   The current process of creating a template from a volume does not need to be 
changed. We already convert the volume when creating a template, so the 
volume's backing chain will be merged when creating a template.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to