Yes, that's the scariest thing: you never know that the image is corrupted on the same day. Usually, a week or a fortnight could pass before one gets to know about a problem (and all old snapshots are successfully removed by that time).
Some time ago I implemented a simple script that runs `qemu-img check` on each image on a daily basis, but then I had to give this idea up, because `qemu-img check` usually can show a lot of errors on a running instance's volume, it could show some truth only when the instance is stopped. :-( Here is a bit of advice. 1. First of all, never make a snapshot when the VM shows high I/O activity. I implemented an SNMP-agent that shows I/O activity of all VMs under a certain MIB, but I also had to implement another application to manage snapshots, it creates a new snapshot only when it's pretty sure that the VM doesn't write a lot of data to the storage. I'd gladly share it, but implementing all these things could be a bit tricky thing, I need some time to document it. Of course, you always can implement your own solution for that. Maybe it would be a nice idea to implement this in ACS itself. :) 2. Consider dropping caches every hour (`/bin/echo 1 > /proc/sys/vm/drop_caches`). I found some correlation between corrupting images and cache's overflow. I'm still not 100% sure it can guarantee you calm sleeping in the night, but my statistics (~600 VMs on different hosts, clusters, pods and zones) show that implementing these things was a correct step (knocking on wood, spitting over the left shoulder, etc.). Good luck! ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Friday, 1 February 2019 22:01, Sean Lair <sl...@ippathways.com> wrote: > Hello, > > We are using NFS storage. It is actually native NFS mounts on a NetApp > storage system. We haven't seen those log entries, but we also don't always > know when a VM gets corrupted... When we finally get a call that a VM is > having issues, we've found that it was corrupted a while ago. > > -----Original Message----- > From: cloudstack-fan [mailto:cloudstack-...@protonmail.com.INVALID] > Sent: Sunday, January 27, 2019 1:45 PM > To: users@cloudstack.apache.org > Cc: d...@cloudstack.apache.org > Subject: Re: Snapshots on KVM corrupting disk images > > Hello Sean, > > It seems that you've encountered the same issue that I've been facing during > the last 5-6 years of using ACS with KVM hosts (see this thread, if you're > interested in additional details: > https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser). > > I'd like to state that creating snapshots of a running virtual machine is a > bit risky. I've implemented some workarounds in my environment, but I'm still > not sure that they are 100% effective. > > I have a couple of questions, if you don't mind. What kind of storage do you > use, if it's not a secret? Does you storage use XFS as a filesystem? Did you > see something like this in your log-files? > [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 > in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory > allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: > qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc > (mode:0x250) Did you see any unusual messages in your log-file when the > disaster happened? > > I hope, things will be well. Wish you good luck and all the best! > > ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ > On Tuesday, 22 January 2019 18:30, Sean Lair sl...@ippathways.com wrote: > > > Hi all, > > We had some instances where VM disks are becoming corrupted when using KVM > > snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7. > > The first time was when someone mass-enabled scheduled snapshots on a lot > > of large number VMs and secondary storage filled up. We had to restore all > > those VM disks... But believed it was just our fault with letting secondary > > storage fill up. > > Today we had an instance where a snapshot failed and now the disk image is > > corrupted and the VM can't boot. here is the output of some commands: > > > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 > > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': > > Could not read snapshots: File too large > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 > > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80': > > Could not read snapshots: File too large > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 > > -rw-r--r--. 1 root root 73G Jan 22 11:04 > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 > > > > We tried restoring to before the snapshot failure, but still have strange > > errors: > > > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 > > -rw-r--r--. 1 root root 73G Jan 22 11:04 > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 > > image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 > > file format: qcow2 > > virtual size: 50G (53687091200 bytes) > > disk size: 73G > > cluster_size: 65536 > > Snapshot list: > > ID TAG VM SIZE DATE VM CLOCK > > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43 > > 3099:35:55.242 > > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16 > > 3431:52:23.942 Format specific information: > > compat: 1.1 > > lazy refcounts: false > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check > > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 > > tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3 > > 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 > > 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 > > 0x55d16ddd9f7d No errors were found on the image. > > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img > > snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80 > > Snapshot list: > > ID TAG VM SIZE DATE VM CLOCK > > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43 > > 3099:35:55.242 > > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16 > > 3431:52:23.942 > > > > Everyone is now extremely hesitant to use snapshots in KVM.... We tried > > deleting the snapshots in the restored disk image, but it errors out... > > Does anyone else have issues with KVM snapshots? We are considering just > > disabling this functionality now... > > Thanks > > Sean