By the way, RedHat recommended to suspend a VM before deleting a snapshot too: https://bugzilla.redhat.com/show_bug.cgi?id=920020. I'll quote it here:
> 1. Pause the VM > 2. Take an internal snapshot with the 'savevm' command of the qemu monitor > of the running VM, not with an external qemu-img process. virsh may or > may > not provide an interface for this. > 3. You can resume the VM now > 4. qemu-img convert -f qcow2 -O qcow2 -s "$SNAPDATE" $i $i-snapshot > 5. Pause the VM again > 6. 'delvm' in the qemu monitor > 7. Resume the VM ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Monday, 4 February 2019 07:36, cloudstack-fan <cloudstack-...@protonmail.com> wrote: > I'd also like to add another detail, if no one minds. > > Sometimes one can run into this issue without shutting down a VM. The > disaster might occur right after a snapshot is copied to a secondary storage > and deleted from the VM's image on the primary storage. I saw it a couple of > times, when it happened to the VMs being monitored. The monitoring suite > showed that these VMs were working fine right until the final phase (apart > from a short pause of the snapshot creating stage). > > I also noticed that a VM is always suspended when a snapshot is being created > and `virsh list` shows it's in the "paused" state, but when a snapshot is > being deleted from the image the same command always shows the "running" > state, although the VM doesn't respond to anything during the snapshot > deletion phase. > > It seems to be a bug of KVM/QEMU itself, I think. Proxmox users also face the > same issue (see > https://forum.proxmox.com/threads/qcow2-corruption-after-snapshot-or-heavy-disk-i-o.32865/, > > https://forum.proxmox.com/threads/proxmox-3-4-11qcow2-image-is-corrupt.25953/ > and other similar threads), but it also would be great to make some > workaround for ACS. Maybe, just as you proposed, it would be wise to suspend > the VM before snapshot deletion and resume it after that. It would give ACS a > serious advantage over other orchestration systems. :-) > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > On Friday, 1 February 2019 22:25, Ivan Kudryavtsev <kudryavtsev...@bw-sw.com> > wrote: > >> Yes, only after the VM shutdown, the image is corrupted. >> >> пт, 1 февр. 2019 г., 15:01 Sean Lair sl...@ippathways.com: >> >>> Hello, >>> >>> We are using NFS storage. It is actually native NFS mounts on a NetApp >>> storage system. We haven't seen those log entries, but we also don't >>> always know when a VM gets corrupted... When we finally get a call that a >>> VM is having issues, we've found that it was corrupted a while ago. >>> >>> -----Original Message----- >>> From: cloudstack-fan [mailto:cloudstack-...@protonmail.com.INVALID] >>> Sent: Sunday, January 27, 2019 1:45 PM >>> To: users@cloudstack.apache.org >>> Cc: d...@cloudstack.apache.org >>> Subject: Re: Snapshots on KVM corrupting disk images >>> >>> Hello Sean, >>> >>> It seems that you've encountered the same issue that I've been facing >>> during the last 5-6 years of using ACS with KVM hosts (see this thread, if >>> you're interested in additional details: >>> https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser). >>> >>> I'd like to state that creating snapshots of a running virtual machine is a >>> bit risky. I've implemented some workarounds in my environment, but I'm >>> still not sure that they are 100% effective. >>> >>> I have a couple of questions, if you don't mind. What kind of storage do >>> you use, if it's not a secret? Does you storage use XFS as a filesystem? >>> Did you see something like this in your log-files? >>> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 >>> in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory >>> allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: >>> qemu-kvm(***) possible memory allocation deadlock size 65552 in >>> kmem_realloc (mode:0x250) Did you see any unusual messages in your log-file >>> when the disaster happened? >>> >>> I hope, things will be well. Wish you good luck and all the best! >>> >>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ >>> On Tuesday, 22 January 2019 18:30, Sean Lair <sl...@ippathways.com> wrote: >>> >>>> Hi all, >>>> >>>> We had some instances where VM disks are becoming corrupted when using KVM >>>> snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7. >>>> >>>> The first time was when someone mass-enabled scheduled snapshots on a lot >>>> of large number VMs and secondary storage filled up. We had to restore all >>>> those VM disks... But believed it was just our fault with letting >>>> secondary storage fill up. >>>> >>>> Today we had an instance where a snapshot failed and now the disk image is >>>> corrupted and the VM can't boot. here is the output of some commands: >>>> >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ------------------------------------------------ >>>> >>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check >>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': >>>> Could not read snapshots: File too large >>>> >>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info >>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>> qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': >>>> Could not read snapshots: File too large >>>> >>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh >>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>> -rw-r--r--. 1 root root 73G Jan 22 11:04 >>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>> >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ----------------------------------------------------------- >>>> >>>> We tried restoring to before the snapshot failure, but still have strange >>>> errors: >>>> >>>> ---------------------------------------------------------------------- >>>> -------------- >>>> >>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh >>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>> -rw-r--r--. 1 root root 73G Jan 22 11:04 >>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>> >>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info >>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>> image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>> file format: qcow2 >>>> virtual size: 50G (53687091200 bytes) >>>> disk size: 73G >>>> cluster_size: 65536 >>>> Snapshot list: >>>> ID TAG VM SIZE DATE VM CLOCK >>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43 >>>> 3099:35:55.242 >>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16 >>>> 3431:52:23.942 Format specific information: >>>> compat: 1.1 >>>> lazy refcounts: false >>>> >>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check >>>> ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>> tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3 >>>> 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 >>>> 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 >>>> 0x55d16ddd9f7d No errors were found on the image. >>>> >>>> [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img >>>> snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 >>>> Snapshot list: >>>> ID TAG VM SIZE DATE VM CLOCK >>>> 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43 >>>> 3099:35:55.242 >>>> 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16 >>>> 3431:52:23.942 >>>> >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> ---------------------------------------------------------------------- >>>> --------------------------------------------------------------- >>>> >>>> Everyone is now extremely hesitant to use snapshots in KVM.... We tried >>>> deleting the snapshots in the restored disk image, but it errors out... >>>> >>>> Does anyone else have issues with KVM snapshots? We are considering just >>>> disabling this functionality now... >>>> >>>> Thanks >>>> Sean