Re: [Qemu-devel] Massive read only kvm guests when backing file was missing
On Mon, Mar 31, 2014 at 09:51:23PM -0300, Alejandro Comisario wrote: Again, thanks to everyone. Did you reach a conclusion or is there still a problem that might be a bug in KVM? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Massive read only kvm guests when backing file was missing
The conclusion is that the backing file stored on NFS that is the same for all 950 hosts / 10500 guests was deleted and immediatelly raised a read-only filesystem on the guest, seems that there's no way to avoid that. We developed a script to recover from that scenario if the same happens. Basically doing: * virsh stop * qemu-ndb connect * fsck * qemu-ndb disconnect * virsh start best regards. Alejandro Comisario MercadoLibre Cloud Services Arias 3751, Piso 7 (C1430CRG) Ciudad de Buenos Aires - Argentina Cel: +549(11) 15-3770-1857 Tel : +54(11) 4640-8443 On Tue, Apr 1, 2014 at 10:52 AM, Stefan Hajnoczi stefa...@gmail.com wrote: On Mon, Mar 31, 2014 at 09:51:23PM -0300, Alejandro Comisario wrote: Again, thanks to everyone. Did you reach a conclusion or is there still a problem that might be a bug in KVM? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Massive read only kvm guests when backing file was missing
Thanks Stefan and thanks Michael also. That situation regarding the IRC was very special, since i didnt wanted to tell Michael hey, everyone in the mailing list got it and im here chatting with you and you didn't so i assumed the IRC was 9 times more pro than the mailing list so i decided to keep my head down and assume the communication error was on my side. Still, IMHO, i really believe that if you are a user willing to give KVM a chance enought to make a query on the IRC, you might feel you are not geek enought to be there, and i dont mean be there on IRC, but trying to use the community to support you while you try KVM. In my case, while was very important to understant what were my chances regarding this issue, i knew i would find my answer no matter what because i was decided to find it, i could get mad with 10.5K guests running on my back, yes my experience was more from the virsh stop; virsh start side, but still i felt i needed you guys to try to find this out. Again, thanks to everyone. best. Alejandro Comisario On Fri, Mar 28, 2014 at 5:47 AM, Stefan Hajnoczi stefa...@gmail.com wrote: On Fri, Mar 28, 2014 at 11:01:00AM +0400, Michael Tokarev wrote: 27.03.2014 20:14, Alejandro Comisario wrote: Seems like virtio (kvm 1.0) doesnt expose timeout on the guest side (ubuntu 12.04 on host and guest). So, how can i adjust the tinmeout on the guest ? After a bit more talks on IRC yesterday, it turned out that the situation is _much_ more interesting than originally described. The OP claims to have 10500 guests running off an NFS server, and that after NFS server downtime, the backing files were disappeared (whatever it means), so they had to restore those files. More, the OP didn't even bother to look at the guest's dmesg, being busy rebooting all 10500 guests. This solution is the most logical one, but i cannot apply it! thanks for all the responses! I suggested the OP to actually describe the _real_ situation, instead of giving random half-pictures, and actually take a look at the actual problem as reported in various places (most importantly the guest kernel log), and reoirt _those_ hints to the list. I also mentioned that, at least for some NFS servers, if a client has a file open on the server, and this file is deleted, the server will report error to the client when client tries to access that file, and this has nothing at all to do with timeouts of any kind. Thanks for the update and for taking time to help on IRC. I feel you're being harsh on Alejandro though. Improving the quality of bug reports is important but it shouldn't be at the expense of quality of communication. We can't assume that everyone is an expert in troubleshooting KVM or Linux. Therefore we can't blame them, which will only drive people away and detract from the community. TL;DR post logs and error messages +1, berate him -1 Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Massive read only kvm guests when backing file was missing
27.03.2014 20:14, Alejandro Comisario wrote: Seems like virtio (kvm 1.0) doesnt expose timeout on the guest side (ubuntu 12.04 on host and guest). So, how can i adjust the tinmeout on the guest ? After a bit more talks on IRC yesterday, it turned out that the situation is _much_ more interesting than originally described. The OP claims to have 10500 guests running off an NFS server, and that after NFS server downtime, the backing files were disappeared (whatever it means), so they had to restore those files. More, the OP didn't even bother to look at the guest's dmesg, being busy rebooting all 10500 guests. This solution is the most logical one, but i cannot apply it! thanks for all the responses! I suggested the OP to actually describe the _real_ situation, instead of giving random half-pictures, and actually take a look at the actual problem as reported in various places (most importantly the guest kernel log), and reoirt _those_ hints to the list. I also mentioned that, at least for some NFS servers, if a client has a file open on the server, and this file is deleted, the server will report error to the client when client tries to access that file, and this has nothing at all to do with timeouts of any kind. Thanks, /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Massive read only kvm guests when backing file was missing
On Fri, Mar 28, 2014 at 11:01:00AM +0400, Michael Tokarev wrote: 27.03.2014 20:14, Alejandro Comisario wrote: Seems like virtio (kvm 1.0) doesnt expose timeout on the guest side (ubuntu 12.04 on host and guest). So, how can i adjust the tinmeout on the guest ? After a bit more talks on IRC yesterday, it turned out that the situation is _much_ more interesting than originally described. The OP claims to have 10500 guests running off an NFS server, and that after NFS server downtime, the backing files were disappeared (whatever it means), so they had to restore those files. More, the OP didn't even bother to look at the guest's dmesg, being busy rebooting all 10500 guests. This solution is the most logical one, but i cannot apply it! thanks for all the responses! I suggested the OP to actually describe the _real_ situation, instead of giving random half-pictures, and actually take a look at the actual problem as reported in various places (most importantly the guest kernel log), and reoirt _those_ hints to the list. I also mentioned that, at least for some NFS servers, if a client has a file open on the server, and this file is deleted, the server will report error to the client when client tries to access that file, and this has nothing at all to do with timeouts of any kind. Thanks for the update and for taking time to help on IRC. I feel you're being harsh on Alejandro though. Improving the quality of bug reports is important but it shouldn't be at the expense of quality of communication. We can't assume that everyone is an expert in troubleshooting KVM or Linux. Therefore we can't blame them, which will only drive people away and detract from the community. TL;DR post logs and error messages +1, berate him -1 Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Massive read only kvm guests when backing file was missing
Michael S. Tsirkin m...@redhat.com writes: On Wed, Mar 26, 2014 at 11:08:03PM -0300, Alejandro Comisario wrote: Hi List! Hope some one can help me, we had a big issue in our cloud the other day, a couple of our openstack regions ( +2000 kvm guests with qcow2 ) went read only filesystem from the guest side because the backing files directory (the openstack _base directory) was compromised and the data was lost, when we realized the data was lost, it took us 5 mins to restore the backup of the backing files, but by that time all the kvm guests received some kind of IO error from the hypervisor layer, and went read only on root filesystem. My question would be, is there a way to hold the IO operations against the backing files ( i thought that would be 99% READ operations ) for a little longer ( im asking this because i dont quite understand what is the process and when it raises the error ) in a case the backing files are missing (no IO possible) but is recoverable within minutes ? Any tip on how to achieve this if possible, or information about how backing files works on kvm, will be amazing. Waiting for feedback! kindest regards. Alejandro Comisario I'm guessing this is what happened: guests timed out meanwhile. You can increase the timeout within the guest: echo 600 /sys/block/sda/device/timeout to timeout after 10 minutes. If you have installed qemu guest agent on your system, you can do this from the host. Unfortunately by default it's memory can be pushed out to swap and then on disk error access there might will fail :( Maybe we should consider mlock on all its memory at least as an option. You could pause your guests, restart them after the issue is resolved, and we could I guess add functionality to pause VM on disk errors automatically. Stefan? Would -drive rerror=stop do? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Massive read only kvm guests when backing file was missing
On Thu, Mar 27, 2014 at 08:36:57AM +0100, Markus Armbruster wrote: Michael S. Tsirkin m...@redhat.com writes: On Wed, Mar 26, 2014 at 11:08:03PM -0300, Alejandro Comisario wrote: Hi List! Hope some one can help me, we had a big issue in our cloud the other day, a couple of our openstack regions ( +2000 kvm guests with qcow2 ) went read only filesystem from the guest side because the backing files directory (the openstack _base directory) was compromised and the data was lost, when we realized the data was lost, it took us 5 mins to restore the backup of the backing files, but by that time all the kvm guests received some kind of IO error from the hypervisor layer, and went read only on root filesystem. My question would be, is there a way to hold the IO operations against the backing files ( i thought that would be 99% READ operations ) for a little longer ( im asking this because i dont quite understand what is the process and when it raises the error ) in a case the backing files are missing (no IO possible) but is recoverable within minutes ? Any tip on how to achieve this if possible, or information about how backing files works on kvm, will be amazing. Waiting for feedback! kindest regards. Alejandro Comisario I'm guessing this is what happened: guests timed out meanwhile. You can increase the timeout within the guest: echo 600 /sys/block/sda/device/timeout to timeout after 10 minutes. If you have installed qemu guest agent on your system, you can do this from the host. Unfortunately by default it's memory can be pushed out to swap and then on disk error access there might will fail :( Maybe we should consider mlock on all its memory at least as an option. You could pause your guests, restart them after the issue is resolved, and we could I guess add functionality to pause VM on disk errors automatically. Stefan? Would -drive rerror=stop do? I think it will. It's a pity it doesn't appear in --help output - would make it easier to find. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Massive read only kvm guests when backing file was missing
On Thu, Mar 27, 2014 at 10:10:40AM +0200, Michael S. Tsirkin wrote: On Thu, Mar 27, 2014 at 08:36:57AM +0100, Markus Armbruster wrote: Michael S. Tsirkin m...@redhat.com writes: On Wed, Mar 26, 2014 at 11:08:03PM -0300, Alejandro Comisario wrote: Hi List! Hope some one can help me, we had a big issue in our cloud the other day, a couple of our openstack regions ( +2000 kvm guests with qcow2 ) went read only filesystem from the guest side because the backing files directory (the openstack _base directory) was compromised and the data was lost, when we realized the data was lost, it took us 5 mins to restore the backup of the backing files, but by that time all the kvm guests received some kind of IO error from the hypervisor layer, and went read only on root filesystem. My question would be, is there a way to hold the IO operations against the backing files ( i thought that would be 99% READ operations ) for a little longer ( im asking this because i dont quite understand what is the process and when it raises the error ) in a case the backing files are missing (no IO possible) but is recoverable within minutes ? Any tip on how to achieve this if possible, or information about how backing files works on kvm, will be amazing. Waiting for feedback! kindest regards. Alejandro Comisario I'm guessing this is what happened: guests timed out meanwhile. You can increase the timeout within the guest: echo 600 /sys/block/sda/device/timeout to timeout after 10 minutes. If you have installed qemu guest agent on your system, you can do this from the host. Unfortunately by default it's memory can be pushed out to swap and then on disk error access there might will fail :( Maybe we should consider mlock on all its memory at least as an option. You could pause your guests, restart them after the issue is resolved, and we could I guess add functionality to pause VM on disk errors automatically. Stefan? Would -drive rerror=stop do? I think it will. It's a pity it doesn't appear in --help output - would make it easier to find. It is documented on the man page. I'll send a patch to document it in the --help output too. But there's still a problem because the guest can have a shorter timeout or the image may be NFS mounted on the host. In that case the guest may give up on the request before the host. Then there is nothing QEMU can do to avoid an error being returned to the application or the guest file system going into read-only mode. So make sure the timeout inside the guest is high. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Massive read only kvm guests when backing file was missing
Seems like virtio (kvm 1.0) doesnt expose timeout on the guest side (ubuntu 12.04 on host and guest). So, how can i adjust the tinmeout on the guest ? This solution is the most logical one, but i cannot apply it! thanks for all the responses! regards Alejandro Comisario MercadoLibre Cloud Services Arias 3751, Piso 7 (C1430CRG) Ciudad de Buenos Aires - Argentina Cel: +549(11) 15-3770-1857 Tel : +54(11) 4640-8443 On Thu, Mar 27, 2014 at 5:53 AM, Stefan Hajnoczi stefa...@gmail.com wrote: On Thu, Mar 27, 2014 at 10:10:40AM +0200, Michael S. Tsirkin wrote: On Thu, Mar 27, 2014 at 08:36:57AM +0100, Markus Armbruster wrote: Michael S. Tsirkin m...@redhat.com writes: On Wed, Mar 26, 2014 at 11:08:03PM -0300, Alejandro Comisario wrote: Hi List! Hope some one can help me, we had a big issue in our cloud the other day, a couple of our openstack regions ( +2000 kvm guests with qcow2 ) went read only filesystem from the guest side because the backing files directory (the openstack _base directory) was compromised and the data was lost, when we realized the data was lost, it took us 5 mins to restore the backup of the backing files, but by that time all the kvm guests received some kind of IO error from the hypervisor layer, and went read only on root filesystem. My question would be, is there a way to hold the IO operations against the backing files ( i thought that would be 99% READ operations ) for a little longer ( im asking this because i dont quite understand what is the process and when it raises the error ) in a case the backing files are missing (no IO possible) but is recoverable within minutes ? Any tip on how to achieve this if possible, or information about how backing files works on kvm, will be amazing. Waiting for feedback! kindest regards. Alejandro Comisario I'm guessing this is what happened: guests timed out meanwhile. You can increase the timeout within the guest: echo 600 /sys/block/sda/device/timeout to timeout after 10 minutes. If you have installed qemu guest agent on your system, you can do this from the host. Unfortunately by default it's memory can be pushed out to swap and then on disk error access there might will fail :( Maybe we should consider mlock on all its memory at least as an option. You could pause your guests, restart them after the issue is resolved, and we could I guess add functionality to pause VM on disk errors automatically. Stefan? Would -drive rerror=stop do? I think it will. It's a pity it doesn't appear in --help output - would make it easier to find. It is documented on the man page. I'll send a patch to document it in the --help output too. But there's still a problem because the guest can have a shorter timeout or the image may be NFS mounted on the host. In that case the guest may give up on the request before the host. Then there is nothing QEMU can do to avoid an error being returned to the application or the guest file system going into read-only mode. So make sure the timeout inside the guest is high. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html