Re: [PR] feat(backup): incremental NAS backup support for KVM [cloudstack]

via GitHub Wed, 01 Jul 2026 11:24:05 -0700


jmsperu commented on code in PR #13074:
URL: https://github.com/apache/cloudstack/pull/13074#discussion_r3508213029



##########
scripts/vm/hypervisor/kvm/nasbackup.sh:
##########
@@ -113,20 +122,104 @@ backup_running_vm() {
   mount_operation
   mkdir -p "$dest" || { echo "Failed to create backup directory $dest"; exit 
1; }
 
+  # Determine effective mode for this run.
+  # Legacy callers (no -M argument) get the original full-only behavior with 
no checkpoint.
+  # The Java wrapper (LibvirtTakeBackupCommandWrapper) pre-validates required 
args before
+  # invoking the script; the case below is a defensive fallback for direct 
invocations.
+  local effective_mode="${MODE:-legacy-full}"
+  local make_checkpoint=0
+  case "$effective_mode" in
+    incremental|full)
+      make_checkpoint=1
+      ;;
+    legacy-full)
+      make_checkpoint=0
+      ;;
+    *)
+      echo "Unknown mode: $effective_mode"
+      cleanup
+      exit 1
+      ;;
+  esac
+
+  # When incremental, make sure the parent checkpoint is registered with 
libvirt. CloudStack
+  # rebuilds the domain XML on every VM start, which wipes libvirt's in-memory 
checkpoint
+  # registry, while the dirty bitmap persists on the qcow2 (QEMU re-loads it 
on start). A
+  # fresh checkpoint-create cannot be used then — QEMU reports "Bitmap already 
exists" — so the
+  # parent must be re-registered with --redefine. libvirt only needs the 
checkpoint name and a
+  # creationTime for a redefine (the value need not be accurate — checkpoints 
are ephemeral),
+  # so we synthesize a minimal XML on the fly instead of persisting the full 
checkpoint dump.
+  #
+  # First verify the parent bitmap actually exists on the running qcow2 — it 
can be absent after
+  # a migration even though the orchestrator's active_checkpoint says it 
should be there. If it
+  # is gone, fall back to a full backup rather than letting backup-begin fail 
below.
+  if [[ "$effective_mode" == "incremental" ]]; then
+    # The parent bitmap must be present on EVERY disk's qcow2, not just one of 
them. A volume
+    # snapshot restore (or a partial migration) can wipe the bitmap on some 
disks while leaving
+    # it on others; a plain "is the name anywhere in query-block" check passes 
in that case and
+    # backup-begin then fails on the disk that is missing the bitmap. Require 
the bitmap on all
+    # disks: compare the disk count to the number of disks reporting the 
bitmap (tests 17/19).
+    disk_count=$(virsh -c qemu:///system domblklist "$VM" --details 
2>/dev/null | awk '$2=="disk"{c++} END{print c+0}')
+    bitmap_count=$(virsh -c qemu:///system qemu-monitor-command "$VM" 
'{"execute":"query-block"}' 2>/dev/null | grep -o "\"$BITMAP_PARENT\"" | wc -l 
| tr -d ' ')

Review Comment:
   Good catch, fixed in ae2a6b2. The probe no longer aborts on a no-match: I 
replaced the `grep | wc` count with a per-device parse that returns 0 when 
nothing matches and ends in `|| echo 0`, so a snapshot restore that clears the 
bitmap on all disks now falls through to the full-backup fallback instead of 
exiting under `set -eo pipefail`. Verified tests 17 and 18.



##########
scripts/vm/hypervisor/kvm/nasbackup.sh:
##########
@@ -113,20 +122,104 @@ backup_running_vm() {
   mount_operation
   mkdir -p "$dest" || { echo "Failed to create backup directory $dest"; exit 
1; }
 
+  # Determine effective mode for this run.
+  # Legacy callers (no -M argument) get the original full-only behavior with 
no checkpoint.
+  # The Java wrapper (LibvirtTakeBackupCommandWrapper) pre-validates required 
args before
+  # invoking the script; the case below is a defensive fallback for direct 
invocations.
+  local effective_mode="${MODE:-legacy-full}"
+  local make_checkpoint=0
+  case "$effective_mode" in
+    incremental|full)
+      make_checkpoint=1
+      ;;
+    legacy-full)
+      make_checkpoint=0
+      ;;
+    *)
+      echo "Unknown mode: $effective_mode"
+      cleanup
+      exit 1
+      ;;
+  esac
+
+  # When incremental, make sure the parent checkpoint is registered with 
libvirt. CloudStack
+  # rebuilds the domain XML on every VM start, which wipes libvirt's in-memory 
checkpoint
+  # registry, while the dirty bitmap persists on the qcow2 (QEMU re-loads it 
on start). A
+  # fresh checkpoint-create cannot be used then — QEMU reports "Bitmap already 
exists" — so the
+  # parent must be re-registered with --redefine. libvirt only needs the 
checkpoint name and a
+  # creationTime for a redefine (the value need not be accurate — checkpoints 
are ephemeral),
+  # so we synthesize a minimal XML on the fly instead of persisting the full 
checkpoint dump.
+  #
+  # First verify the parent bitmap actually exists on the running qcow2 — it 
can be absent after
+  # a migration even though the orchestrator's active_checkpoint says it 
should be there. If it
+  # is gone, fall back to a full backup rather than letting backup-begin fail 
below.
+  if [[ "$effective_mode" == "incremental" ]]; then
+    # The parent bitmap must be present on EVERY disk's qcow2, not just one of 
them. A volume
+    # snapshot restore (or a partial migration) can wipe the bitmap on some 
disks while leaving
+    # it on others; a plain "is the name anywhere in query-block" check passes 
in that case and
+    # backup-begin then fails on the disk that is missing the bitmap. Require 
the bitmap on all
+    # disks: compare the disk count to the number of disks reporting the 
bitmap (tests 17/19).
+    disk_count=$(virsh -c qemu:///system domblklist "$VM" --details 
2>/dev/null | awk '$2=="disk"{c++} END{print c+0}')
+    bitmap_count=$(virsh -c qemu:///system qemu-monitor-command "$VM" 
'{"execute":"query-block"}' 2>/dev/null | grep -o "\"$BITMAP_PARENT\"" | wc -l 
| tr -d ' ')

Review Comment:
   Right, `query-block` lists the bitmap under more than one node per disk, so 
`grep -o name | wc -l` double-counted and the 1-of-2 case read as 2. ae2a6b2 
now counts one per distinct `inserted.file` whose `dirty-bitmaps` holds the 
parent, so that case returns 1 < 2 disks and correctly falls back to full. 
Verified test 19.



##########
scripts/vm/hypervisor/kvm/nasbackup.sh:
##########
@@ -113,20 +122,104 @@ backup_running_vm() {
   mount_operation
   mkdir -p "$dest" || { echo "Failed to create backup directory $dest"; exit 
1; }
 
+  # Determine effective mode for this run.
+  # Legacy callers (no -M argument) get the original full-only behavior with 
no checkpoint.
+  # The Java wrapper (LibvirtTakeBackupCommandWrapper) pre-validates required 
args before
+  # invoking the script; the case below is a defensive fallback for direct 
invocations.
+  local effective_mode="${MODE:-legacy-full}"
+  local make_checkpoint=0
+  case "$effective_mode" in
+    incremental|full)
+      make_checkpoint=1
+      ;;
+    legacy-full)
+      make_checkpoint=0
+      ;;
+    *)
+      echo "Unknown mode: $effective_mode"
+      cleanup
+      exit 1
+      ;;
+  esac
+
+  # When incremental, make sure the parent checkpoint is registered with 
libvirt. CloudStack
+  # rebuilds the domain XML on every VM start, which wipes libvirt's in-memory 
checkpoint
+  # registry, while the dirty bitmap persists on the qcow2 (QEMU re-loads it 
on start). A
+  # fresh checkpoint-create cannot be used then — QEMU reports "Bitmap already 
exists" — so the
+  # parent must be re-registered with --redefine. libvirt only needs the 
checkpoint name and a
+  # creationTime for a redefine (the value need not be accurate — checkpoints 
are ephemeral),
+  # so we synthesize a minimal XML on the fly instead of persisting the full 
checkpoint dump.
+  #
+  # First verify the parent bitmap actually exists on the running qcow2 — it 
can be absent after
+  # a migration even though the orchestrator's active_checkpoint says it 
should be there. If it
+  # is gone, fall back to a full backup rather than letting backup-begin fail 
below.
+  if [[ "$effective_mode" == "incremental" ]]; then
+    # The parent bitmap must be present on EVERY disk's qcow2, not just one of 
them. A volume
+    # snapshot restore (or a partial migration) can wipe the bitmap on some 
disks while leaving
+    # it on others; a plain "is the name anywhere in query-block" check passes 
in that case and
+    # backup-begin then fails on the disk that is missing the bitmap. Require 
the bitmap on all
+    # disks: compare the disk count to the number of disks reporting the 
bitmap (tests 17/19).
+    disk_count=$(virsh -c qemu:///system domblklist "$VM" --details 
2>/dev/null | awk '$2=="disk"{c++} END{print c+0}')
+    bitmap_count=$(virsh -c qemu:///system qemu-monitor-command "$VM" 
'{"execute":"query-block"}' 2>/dev/null | grep -o "\"$BITMAP_PARENT\"" | wc -l 
| tr -d ' ')

Review Comment:
   Done. The in-script probe now mirrors `getVmDiskPathHasFromCheckpointMap()`: 
it parses `query-block` per device (`inserted.file` + `inserted.dirty-bitmaps`) 
instead of counting raw name matches, so the push-mode path uses the same logic 
as the pull-mode wrapper. ae2a6b2.



##########
server/src/main/java/org/apache/cloudstack/backup/BackupManagerImpl.java:
##########
@@ -1701,6 +1701,14 @@ private boolean deleteCheckedBackup(Boolean forced, 
BackupProvider backupProvide
                      reservationDao, resourceLimitMgr)) {
             boolean result = backupProvider.deleteBackup(backup, forced);
             if (result) {
+                // Chain-aware providers (e.g. NAS) physically remove several 
backups per call
+                // (leaf + swept delete-pending ancestors) and decrement 
resource count/usage and
+                // remove each DB row themselves, exactly once per removed 
backup. Decrementing or
+                // removing again here would double-handle and destroy 
delete-pending tombstones,
+                // so defer entirely to the provider for those.
+                if (backupProvider.handlesChainDeleteResourceAccounting()) {
+                    return true;

Review Comment:
   Addressed in ae2a6b2, see the per-device count and set-e-safe fallback 
described in the replies above.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(backup): incremental NAS backup support for KVM [cloudstack]

Reply via email to