And one more thought, by the way. There's a cool new feature - asynchronous backup (https://cwiki.apache.org/confluence/display/CLOUDSTACK/Separate+creation+and+backup+operations+for+a+volume+snapshot). It allows to create a snapshot at one moment and back it up in another. It would be amazing if it gave opportunity to perform the snapshot deletion procedure (I mean deletion from a primary storage) as a separate operation. So I could check if I/O-activity is low before to _delete_ a snapshot from a primary storage, not only before to _create_ it, it could be a nice workaround.
Dear colleagues, what do you think, is it doable? Thank you! Best regards, a big CloudStack fan :) ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Monday, 4 February 2019 07:46, cloudstack-fan <cloudstack-...@protonmail.com> wrote: > By the way, RedHat recommended to suspend a VM before deleting a snapshot > too: https://bugzilla.redhat.com/show_bug.cgi?id=920020. I'll quote it here: > >> 1. Pause the VM >> 2. Take an internal snapshot with the 'savevm' command of the qemu monitor >> of the running VM, not with an external qemu-img process. virsh may or >> may >> not provide an interface for this. >> 3. You can resume the VM now >> 4. qemu-img convert -f qcow2 -O qcow2 -s "$SNAPDATE" $i $i-snapshot >> 5. Pause the VM again >> 6. 'delvm' in the qemu monitor >> 7. Resume the VM > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > On Monday, 4 February 2019 07:36, cloudstack-fan > <cloudstack-...@protonmail.com> wrote: > >> I'd also like to add another detail, if no one minds. >> >> Sometimes one can run into this issue without shutting down a VM. The >> disaster might occur right after a snapshot is copied to a secondary storage >> and deleted from the VM's image on the primary storage. I saw it a couple of >> times, when it happened to the VMs being monitored. The monitoring suite >> showed that these VMs were working fine right until the final phase (apart >> from a short pause of the snapshot creating stage). >> >> I also noticed that a VM is always suspended when a snapshot is being >> created and `virsh list` shows it's in the "paused" state, but when a >> snapshot is being deleted from the image the same command always shows the >> "running" state, although the VM doesn't respond to anything during the >> snapshot deletion phase. >> >> It seems to be a bug of KVM/QEMU itself, I think. Proxmox users also face >> the same issue (see >> https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/, >> >> https://forum.proxmox.com/threads/proxmox-3-4-11qcow2-image-is-corrupt.25953/ >> and other similar threads), but it also would be great to make some >> workaround for ACS. Maybe, just as you proposed, it would be wise to suspend >> the VM before snapshot deletion and resume it after that. It would give ACS >> a serious advantage over other orchestration systems. :-) >> >> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ >> On Friday, 1 February 2019 22:25, Ivan Kudryavtsev >> <kudryavtsev...@bw-sw.com> wrote: >> >>> Yes, only after the VM shutdown, the image is corrupted. >>> >>> пт, 1 февр. 2019 г., 15:01 Sean Lair sl...@ippathways.com: >>> >>>> Hello, >>>> >>>> We are using NFS storage. It is actually native NFS mounts on a NetApp >>>> storage system. We haven't seen those log entries, but we also don't >>>> always know when a VM gets corrupted... When we finally get a call that a >>>> VM is having issues, we've found that it was corrupted a while ago. >>>> >>>> -----Original Message----- >>>> From: cloudstack-fan [mailto:cloudstack-...@protonmail.com.INVALID] >>>> Sent: Sunday, January 27, 2019 1:45 PM >>>> To: us...@cloudstack.apache.org >>>> Cc: dev@cloudstack.apache.org >>>> Subject: Re: Snapshots on KVM corrupting disk images >>>> >>>> Hello Sean, >>>> >>>> It seems that you've encountered the same issue that I've been facing >>>> during the last 5-6 years of using ACS with KVM hosts (see this thread, if >>>> you're interested in additional details: >>>> https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser). >>>> >>>> I'd like to state that creating snapshots of a running virtual machine is >>>> a bit risky. I've implemented some workarounds in my environment, but I'm >>>> still not sure that they are 100% effective. >>>> >>>> I have a couple of questions, if you don't mind. What kind of storage do >>>> you use, if it's not a secret? Does you storage use XFS as a filesystem? >>>> Did you see something like this in your log-files? >>>> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size >>>> 65552 in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible >>>> memory allocation deadlock size 65552 in kmem_realloc (mode:0x250) >>>> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size >>>> 65552 in kmem_realloc (mode:0x250) Did you see any unusual messages in >>>> your log-file when the disaster happened? >>>> >>>> I hope, things will be well. Wish you good luck and all the best! >>>> >>>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ >>>> On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote: >>>> >>>>> Hi all, >>>>> >>>>> We had some instances where VM disks are becoming corrupted when using >>>>> KVM snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7. >>>>> >>>>> The first time was when someone mass-enabled scheduled snapshots on a lot >>>>> of large number VMs and secondary storage filled up. We had to restore >>>>> all those VM disks... But believed it was just our fault with letting >>>>> secondary storage fill up. >>>>> >>>>> Today we had an instance where a snapshot failed and now the disk image >>>>> is corrupted and the VM can't boot. here is the output of some commands: >>>>> >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ------------------------------------------------ >>>>> >>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check >>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': >>>>> Could not read snapshots: File too large >>>>> >>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info >>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': >>>>> Could not read snapshots: File too large >>>>> >>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh >>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>>> -rw-r--r--. 1 root root 73G Jan 22 11:04 >>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>>> >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ----------------------------------------------------------- >>>>> >>>>> We tried restoring to before the snapshot failure, but still have strange >>>>> errors: >>>>> >>>>> ---------------------------------------------------------------------- >>>>> -------------- >>>>> >>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh >>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>>> -rw-r--r--. 1 root root 73G Jan 22 11:04 >>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>>> >>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info >>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>>> image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>>> file format: qcow2 >>>>> virtual size: 50G (53687091200 bytes) >>>>> disk size: 73G >>>>> cluster_size: 65536 >>>>> Snapshot list: >>>>> ID TAG VM SIZE DATE VM CLOCK >>>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43 >>>>> 3099:35:55.242 >>>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16 >>>>> 3431:52:23.942 Format specific information: >>>>> compat: 1.1 >>>>> lazy refcounts: false >>>>> >>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check >>>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>>> tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3 >>>>> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc >>>>> 0x55d16ddf2541 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db >>>>> 0x55d16de373e6 0x7fb9c63a3c05 0x55d16ddd9f7d No errors were found on the >>>>> image. >>>>> >>>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img >>>>> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>>> Snapshot list: >>>>> ID TAG VM SIZE DATE VM CLOCK >>>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43 >>>>> 3099:35:55.242 >>>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16 >>>>> 3431:52:23.942 >>>>> >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> ---------------------------------------------------------------------- >>>>> --------------------------------------------------------------- >>>>> >>>>> Everyone is now extremely hesitant to use snapshots in KVM.... We tried >>>>> deleting the snapshots in the restored disk image, but it errors out... >>>>> >>>>> Does anyone else have issues with KVM snapshots? We are considering just >>>>> disabling this functionality now... >>>>> >>>>> Thanks >>>>> Sean