RE: Snapshots on KVM corrupting disk images

cloudstack-fan Sun, 03 Feb 2019 02:54:56 -0800

Yes, that's the scariest thing: you never know that the image is corrupted on 
the same day. Usually, a week or a fortnight could pass before one gets to know 
about a problem (and all old snapshots are successfully removed by that time).


Some time ago I implemented a simple script that runs `qemu-img check` on each 
image on a daily basis, but then I had to give this idea up, because `qemu-img 
check` usually can show a lot of errors on a running instance's volume, it 
could show some truth only when the instance is stopped. :-(

Here is a bit of advice.
1. First of all, never make a snapshot when the VM shows high I/O activity. I 
implemented an SNMP-agent that shows I/O activity of all VMs under a certain 
MIB, but I also had to implement another application to manage snapshots, it 
creates a new snapshot only when it's pretty sure that the VM doesn't write a 
lot of data to the storage. I'd gladly share it, but implementing all these 
things could be a bit tricky thing, I need some time to document it. Of course, 
you always can implement your own solution for that. Maybe it would be a nice 
idea to implement this in ACS itself. :)
2. Consider dropping caches every hour (`/bin/echo 1 > 
/proc/sys/vm/drop_caches`). I found some correlation between corrupting images 
and cache's overflow.

I'm still not 100% sure it can guarantee you calm sleeping in the night, but my 
statistics (~600 VMs on different hosts, clusters, pods and zones) show that 
implementing these things was a correct step (knocking on wood, spitting over 
the left shoulder, etc.).

Good luck!


‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Friday, 1 February 2019 22:01, Sean Lair <[email protected]> wrote:

> Hello,
>
> We are using NFS storage. It is actually native NFS mounts on a NetApp 
> storage system. We haven't seen those log entries, but we also don't always 
> know when a VM gets corrupted... When we finally get a call that a VM is 
> having issues, we've found that it was corrupted a while ago.
>
> -----Original Message-----
> From: cloudstack-fan [mailto:[email protected]]
> Sent: Sunday, January 27, 2019 1:45 PM
> To: [email protected]
> Cc: [email protected]
> Subject: Re: Snapshots on KVM corrupting disk images
>
> Hello Sean,
>
> It seems that you've encountered the same issue that I've been facing during 
> the last 5-6 years of using ACS with KVM hosts (see this thread, if you're 
> interested in additional details: 
> https://mail-archives.apache.org/mod_mbox/cloudstack-users/201807.mbox/browser).
>
> I'd like to state that creating snapshots of a running virtual machine is a 
> bit risky. I've implemented some workarounds in my environment, but I'm still 
> not sure that they are 100% effective.
>
> I have a couple of questions, if you don't mind. What kind of storage do you 
> use, if it's not a secret? Does you storage use XFS as a filesystem? Did you 
> see something like this in your log-files?
> [***.***] XFS: qemu-kvm(***) possible memory allocation deadlock size 65552 
> in kmem_realloc (mode:0x250) [***.***] XFS: qemu-kvm(***) possible memory 
> allocation deadlock size 65552 in kmem_realloc (mode:0x250) [***.***] XFS: 
> qemu-kvm(***) possible memory allocation deadlock size 65552 in kmem_realloc 
> (mode:0x250) Did you see any unusual messages in your log-file when the 
> disaster happened?
>
> I hope, things will be well. Wish you good luck and all the best!
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Tuesday, 22 January 2019 18:30, Sean Lair [email protected] wrote:
>
> > Hi all,
> > We had some instances where VM disks are becoming corrupted when using KVM 
> > snapshots. We are running CloudStack 4.9.3 with KVM on CentOS 7.
> > The first time was when someone mass-enabled scheduled snapshots on a lot 
> > of large number VMs and secondary storage filled up. We had to restore all 
> > those VM disks... But believed it was just our fault with letting secondary 
> > storage fill up.
> > Today we had an instance where a snapshot failed and now the disk image is 
> > corrupted and the VM can't boot. here is the output of some commands:
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
> > Could not read snapshots: File too large
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > qemu-img: Could not open './184aa458-9d4b-4c1b-a3c6-23d28ea28e80':
> > Could not read snapshots: File too large
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> >
> > We tried restoring to before the snapshot failure, but still have strange 
> > errors:
> >
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# ls -lh
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > -rw-r--r--. 1 root root 73G Jan 22 11:04
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img info
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > image: ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > file format: qcow2
> > virtual size: 50G (53687091200 bytes)
> > disk size: 73G
> > cluster_size: 65536
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> > 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> > 3431:52:23.942 Format specific information:
> > compat: 1.1
> > lazy refcounts: false
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img check
> > ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > tcmalloc: large alloc 1539750010880 bytes == (nil) @ 0x7fb9cbbf7bf3
> > 0x7fb9cbc19488 0x7fb9cb71dc56 0x55d16ddf1c77 0x55d16ddf1edc 0x55d16ddf2541 
> > 0x55d16ddf465e 0x55d16ddf8ad1 0x55d16de336db 0x55d16de373e6 0x7fb9c63a3c05 
> > 0x55d16ddd9f7d No errors were found on the image.
> > [root@cloudkvm02 c3be0ae5-2248-3ed6-a0c7-acffe25cc8d3]# qemu-img
> > snapshot -l ./184aa458-9d4b-4c1b-a3c6-23d28ea28e80
> > Snapshot list:
> > ID TAG VM SIZE DATE VM CLOCK
> > 1 a8fdf99f-8219-4032-a9c8-87a6e09e7f95 3.7G 2018-12-23 11:01:43
> > 3099:35:55.242
> > 2 b4d74338-b0e3-4eeb-8bf8-41f6f75d9abd 3.8G 2019-01-06 11:03:16
> > 3431:52:23.942
> >
> > Everyone is now extremely hesitant to use snapshots in KVM.... We tried 
> > deleting the snapshots in the restored disk image, but it errors out...
> > Does anyone else have issues with KVM snapshots? We are considering just 
> > disabling this functionality now...
> > Thanks
> > Sean

RE: Snapshots on KVM corrupting disk images

Reply via email to