Re: [Qemu-devel] Massive read only kvm guests when backing file was missing

2014-04-01 Thread Stefan Hajnoczi
On Mon, Mar 31, 2014 at 09:51:23PM -0300, Alejandro Comisario wrote:
 Again, thanks to everyone.

Did you reach a conclusion or is there still a problem that might be a
bug in KVM?

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Massive read only kvm guests when backing file was missing

2014-04-01 Thread Alejandro Comisario
The conclusion is that the backing file stored on NFS that is the same
for all 950 hosts / 10500 guests was deleted and immediatelly raised a
read-only filesystem on the guest, seems that there's no way to avoid
that.

We developed a script to recover from that scenario if the same happens.
Basically doing:

* virsh stop
* qemu-ndb connect
* fsck
* qemu-ndb disconnect
* virsh start

best regards.


Alejandro Comisario
MercadoLibre Cloud Services
Arias 3751, Piso 7 (C1430CRG)
Ciudad de Buenos Aires - Argentina
Cel: +549(11) 15-3770-1857
Tel : +54(11) 4640-8443


On Tue, Apr 1, 2014 at 10:52 AM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Mon, Mar 31, 2014 at 09:51:23PM -0300, Alejandro Comisario wrote:
 Again, thanks to everyone.

 Did you reach a conclusion or is there still a problem that might be a
 bug in KVM?

 Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Massive read only kvm guests when backing file was missing

2014-03-31 Thread Alejandro Comisario
Thanks Stefan and thanks Michael also.

That situation regarding the IRC was very special, since i didnt
wanted to tell Michael hey, everyone in the mailing list got it and
im here chatting with you and you didn't so i assumed the IRC was
9 times more pro than the mailing list so i decided to
keep my head down and assume the communication error was on my side.

Still, IMHO, i really believe that if you are a user willing to give
KVM a chance enought to make a query on the IRC, you might feel you
are not geek enought to be there, and i dont mean be there on IRC, but
trying to use the community to support you while you try KVM.

In my case, while was very important to understant what were my
chances regarding this issue, i knew i would find my answer no matter
what because i was decided to find it, i could get mad with 10.5K
guests running on my back, yes my experience was more from the virsh
stop; virsh start side, but still i felt i needed you guys to try to
find this out.

Again, thanks to everyone.

best.
Alejandro Comisario


On Fri, Mar 28, 2014 at 5:47 AM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Fri, Mar 28, 2014 at 11:01:00AM +0400, Michael Tokarev wrote:
 27.03.2014 20:14, Alejandro Comisario wrote:
  Seems like virtio (kvm 1.0) doesnt expose timeout on the guest side
  (ubuntu 12.04 on host and guest).
  So, how can i adjust the tinmeout on the guest ?

 After a bit more talks on IRC yesterday, it turned out that the situation
 is _much_ more interesting than originally described.  The OP claims to
 have 10500 guests running off an NFS server, and that after NFS server
 downtime, the backing files were disappeared (whatever it means), so
 they had to restore those files.  More, the OP didn't even bother to look
 at the guest's dmesg, being busy rebooting all 10500 guests.

  This solution is the most logical one, but i cannot apply it!
  thanks for all the responses!

 I suggested the OP to actually describe the _real_ situation, instead of
 giving random half-pictures, and actually take a look at the actual problem
 as reported in various places (most importantly the guest kernel log), and
 reoirt _those_ hints to the list.  I also mentioned that, at least for some
 NFS servers, if a client has a file open on the server, and this file is
 deleted, the server will report error to the client when client tries to
 access that file, and this has nothing at all to do with timeouts of any
 kind.

 Thanks for the update and for taking time to help on IRC.  I feel you're
 being harsh on Alejandro though.

 Improving the quality of bug reports is important but it shouldn't be at
 the expense of quality of communication.  We can't assume that everyone
 is an expert in troubleshooting KVM or Linux.  Therefore we can't blame
 them, which will only drive people away and detract from the community.

 TL;DR post logs and error messages +1, berate him -1

 Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Massive read only kvm guests when backing file was missing

2014-03-28 Thread Michael Tokarev
27.03.2014 20:14, Alejandro Comisario wrote:
 Seems like virtio (kvm 1.0) doesnt expose timeout on the guest side
 (ubuntu 12.04 on host and guest).
 So, how can i adjust the tinmeout on the guest ?

After a bit more talks on IRC yesterday, it turned out that the situation
is _much_ more interesting than originally described.  The OP claims to
have 10500 guests running off an NFS server, and that after NFS server
downtime, the backing files were disappeared (whatever it means), so
they had to restore those files.  More, the OP didn't even bother to look
at the guest's dmesg, being busy rebooting all 10500 guests.

 This solution is the most logical one, but i cannot apply it!
 thanks for all the responses!

I suggested the OP to actually describe the _real_ situation, instead of
giving random half-pictures, and actually take a look at the actual problem
as reported in various places (most importantly the guest kernel log), and
reoirt _those_ hints to the list.  I also mentioned that, at least for some
NFS servers, if a client has a file open on the server, and this file is
deleted, the server will report error to the client when client tries to
access that file, and this has nothing at all to do with timeouts of any
kind.

Thanks,

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Massive read only kvm guests when backing file was missing

2014-03-28 Thread Stefan Hajnoczi
On Fri, Mar 28, 2014 at 11:01:00AM +0400, Michael Tokarev wrote:
 27.03.2014 20:14, Alejandro Comisario wrote:
  Seems like virtio (kvm 1.0) doesnt expose timeout on the guest side
  (ubuntu 12.04 on host and guest).
  So, how can i adjust the tinmeout on the guest ?
 
 After a bit more talks on IRC yesterday, it turned out that the situation
 is _much_ more interesting than originally described.  The OP claims to
 have 10500 guests running off an NFS server, and that after NFS server
 downtime, the backing files were disappeared (whatever it means), so
 they had to restore those files.  More, the OP didn't even bother to look
 at the guest's dmesg, being busy rebooting all 10500 guests.
 
  This solution is the most logical one, but i cannot apply it!
  thanks for all the responses!
 
 I suggested the OP to actually describe the _real_ situation, instead of
 giving random half-pictures, and actually take a look at the actual problem
 as reported in various places (most importantly the guest kernel log), and
 reoirt _those_ hints to the list.  I also mentioned that, at least for some
 NFS servers, if a client has a file open on the server, and this file is
 deleted, the server will report error to the client when client tries to
 access that file, and this has nothing at all to do with timeouts of any
 kind.

Thanks for the update and for taking time to help on IRC.  I feel you're
being harsh on Alejandro though.

Improving the quality of bug reports is important but it shouldn't be at
the expense of quality of communication.  We can't assume that everyone
is an expert in troubleshooting KVM or Linux.  Therefore we can't blame
them, which will only drive people away and detract from the community.

TL;DR post logs and error messages +1, berate him -1

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Massive read only kvm guests when backing file was missing

2014-03-27 Thread Markus Armbruster
Michael S. Tsirkin m...@redhat.com writes:

 On Wed, Mar 26, 2014 at 11:08:03PM -0300, Alejandro Comisario wrote:
 Hi List!
 Hope some one can help me, we had a big issue in our cloud the other
 day, a couple of our openstack regions ( +2000 kvm guests with qcow2 )
 went read only filesystem from the guest side because the backing
 files directory (the openstack _base directory) was compromised and
 the data was lost, when we realized the data was lost, it took us 5
 mins to restore the backup of the backing files, but by that time all
 the kvm guests received some kind of IO error from the hypervisor
 layer, and went read only on root filesystem.
 
 My question would be, is there a way to hold the IO operations against
 the backing files ( i thought that would be 99% READ operations ) for
 a little longer ( im asking this because i dont quite understand what
 is the process and when it raises the error ) in a case the backing
 files are missing (no IO possible) but is recoverable within minutes ?
 
 Any tip  on how to achieve this if possible, or information about how
 backing files works on kvm, will be amazing.
 Waiting for feedback!
 
 kindest regards.
 Alejandro Comisario


 I'm guessing this is what happened: guests timed out meanwhile.
 You can increase the timeout within the guest:
 echo 600  /sys/block/sda/device/timeout
 to timeout after 10 minutes.

 If you have installed qemu guest agent on your system, you can do this
 from the host. Unfortunately by default it's memory can be pushed out to swap
 and then on disk error access there might will fail :(
 Maybe we should consider mlock on all its memory at least as an option.

 You could pause your guests, restart them after the issue is resolved,
 and we could I guess add functionality to pause VM on disk errors
 automatically.
 Stefan?

Would -drive rerror=stop do?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Massive read only kvm guests when backing file was missing

2014-03-27 Thread Michael S. Tsirkin
On Thu, Mar 27, 2014 at 08:36:57AM +0100, Markus Armbruster wrote:
 Michael S. Tsirkin m...@redhat.com writes:
 
  On Wed, Mar 26, 2014 at 11:08:03PM -0300, Alejandro Comisario wrote:
  Hi List!
  Hope some one can help me, we had a big issue in our cloud the other
  day, a couple of our openstack regions ( +2000 kvm guests with qcow2 )
  went read only filesystem from the guest side because the backing
  files directory (the openstack _base directory) was compromised and
  the data was lost, when we realized the data was lost, it took us 5
  mins to restore the backup of the backing files, but by that time all
  the kvm guests received some kind of IO error from the hypervisor
  layer, and went read only on root filesystem.
  
  My question would be, is there a way to hold the IO operations against
  the backing files ( i thought that would be 99% READ operations ) for
  a little longer ( im asking this because i dont quite understand what
  is the process and when it raises the error ) in a case the backing
  files are missing (no IO possible) but is recoverable within minutes ?
  
  Any tip  on how to achieve this if possible, or information about how
  backing files works on kvm, will be amazing.
  Waiting for feedback!
  
  kindest regards.
  Alejandro Comisario
 
 
  I'm guessing this is what happened: guests timed out meanwhile.
  You can increase the timeout within the guest:
  echo 600  /sys/block/sda/device/timeout
  to timeout after 10 minutes.
 
  If you have installed qemu guest agent on your system, you can do this
  from the host. Unfortunately by default it's memory can be pushed out to 
  swap
  and then on disk error access there might will fail :(
  Maybe we should consider mlock on all its memory at least as an option.
 
  You could pause your guests, restart them after the issue is resolved,
  and we could I guess add functionality to pause VM on disk errors
  automatically.
  Stefan?
 
 Would -drive rerror=stop do?

I think it will. It's a pity it doesn't appear in --help output -
would make it easier to find.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Massive read only kvm guests when backing file was missing

2014-03-27 Thread Stefan Hajnoczi
On Thu, Mar 27, 2014 at 10:10:40AM +0200, Michael S. Tsirkin wrote:
 On Thu, Mar 27, 2014 at 08:36:57AM +0100, Markus Armbruster wrote:
  Michael S. Tsirkin m...@redhat.com writes:
  
   On Wed, Mar 26, 2014 at 11:08:03PM -0300, Alejandro Comisario wrote:
   Hi List!
   Hope some one can help me, we had a big issue in our cloud the other
   day, a couple of our openstack regions ( +2000 kvm guests with qcow2 )
   went read only filesystem from the guest side because the backing
   files directory (the openstack _base directory) was compromised and
   the data was lost, when we realized the data was lost, it took us 5
   mins to restore the backup of the backing files, but by that time all
   the kvm guests received some kind of IO error from the hypervisor
   layer, and went read only on root filesystem.
   
   My question would be, is there a way to hold the IO operations against
   the backing files ( i thought that would be 99% READ operations ) for
   a little longer ( im asking this because i dont quite understand what
   is the process and when it raises the error ) in a case the backing
   files are missing (no IO possible) but is recoverable within minutes ?
   
   Any tip  on how to achieve this if possible, or information about how
   backing files works on kvm, will be amazing.
   Waiting for feedback!
   
   kindest regards.
   Alejandro Comisario
  
  
   I'm guessing this is what happened: guests timed out meanwhile.
   You can increase the timeout within the guest:
   echo 600  /sys/block/sda/device/timeout
   to timeout after 10 minutes.
  
   If you have installed qemu guest agent on your system, you can do this
   from the host. Unfortunately by default it's memory can be pushed out to 
   swap
   and then on disk error access there might will fail :(
   Maybe we should consider mlock on all its memory at least as an option.
  
   You could pause your guests, restart them after the issue is resolved,
   and we could I guess add functionality to pause VM on disk errors
   automatically.
   Stefan?
  
  Would -drive rerror=stop do?
 
 I think it will. It's a pity it doesn't appear in --help output -
 would make it easier to find.

It is documented on the man page.  I'll send a patch to document it in
the --help output too.

But there's still a problem because the guest can have a shorter timeout
or the image may be NFS mounted on the host.  In that case the guest may
give up on the request before the host.  Then there is nothing QEMU can
do to avoid an error being returned to the application or the guest file
system going into read-only mode.

So make sure the timeout inside the guest is high.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Massive read only kvm guests when backing file was missing

2014-03-27 Thread Alejandro Comisario
Seems like virtio (kvm 1.0) doesnt expose timeout on the guest side
(ubuntu 12.04 on host and guest).
So, how can i adjust the tinmeout on the guest ?

This solution is the most logical one, but i cannot apply it!
thanks for all the responses!

regards


Alejandro Comisario
MercadoLibre Cloud Services
Arias 3751, Piso 7 (C1430CRG)
Ciudad de Buenos Aires - Argentina
Cel: +549(11) 15-3770-1857
Tel : +54(11) 4640-8443


On Thu, Mar 27, 2014 at 5:53 AM, Stefan Hajnoczi stefa...@gmail.com wrote:
 On Thu, Mar 27, 2014 at 10:10:40AM +0200, Michael S. Tsirkin wrote:
 On Thu, Mar 27, 2014 at 08:36:57AM +0100, Markus Armbruster wrote:
  Michael S. Tsirkin m...@redhat.com writes:
 
   On Wed, Mar 26, 2014 at 11:08:03PM -0300, Alejandro Comisario wrote:
   Hi List!
   Hope some one can help me, we had a big issue in our cloud the other
   day, a couple of our openstack regions ( +2000 kvm guests with qcow2 )
   went read only filesystem from the guest side because the backing
   files directory (the openstack _base directory) was compromised and
   the data was lost, when we realized the data was lost, it took us 5
   mins to restore the backup of the backing files, but by that time all
   the kvm guests received some kind of IO error from the hypervisor
   layer, and went read only on root filesystem.
  
   My question would be, is there a way to hold the IO operations against
   the backing files ( i thought that would be 99% READ operations ) for
   a little longer ( im asking this because i dont quite understand what
   is the process and when it raises the error ) in a case the backing
   files are missing (no IO possible) but is recoverable within minutes ?
  
   Any tip  on how to achieve this if possible, or information about how
   backing files works on kvm, will be amazing.
   Waiting for feedback!
  
   kindest regards.
   Alejandro Comisario
  
  
   I'm guessing this is what happened: guests timed out meanwhile.
   You can increase the timeout within the guest:
   echo 600  /sys/block/sda/device/timeout
   to timeout after 10 minutes.
  
   If you have installed qemu guest agent on your system, you can do this
   from the host. Unfortunately by default it's memory can be pushed out to 
   swap
   and then on disk error access there might will fail :(
   Maybe we should consider mlock on all its memory at least as an option.
  
   You could pause your guests, restart them after the issue is resolved,
   and we could I guess add functionality to pause VM on disk errors
   automatically.
   Stefan?
 
  Would -drive rerror=stop do?

 I think it will. It's a pity it doesn't appear in --help output -
 would make it easier to find.

 It is documented on the man page.  I'll send a patch to document it in
 the --help output too.

 But there's still a problem because the guest can have a shorter timeout
 or the image may be NFS mounted on the host.  In that case the guest may
 give up on the request before the host.  Then there is nothing QEMU can
 do to avoid an error being returned to the application or the guest file
 system going into read-only mode.

 So make sure the timeout inside the guest is high.

 Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html