Re: writes to a virtio block device hungs

2008-09-26 Thread Michael Tokarev

Marcelo Tosatti wrote:

On Tue, Sep 23, 2008 at 11:06:11AM +0400, Michael Tokarev wrote:

(both host and guests are linux machines), I placed
one virtual machine into production use, and almost
immediately come... issues.  Here's how it looks like
from the guest:

Sep 21 10:35:52 hobbit kernel: INFO: task cleanup:20535 blocked for more than 
120 seconds.
Sep 21 10:35:52 hobbit kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 10:35:52 hobbit kernel: cleanup   D  0 20535   1570
Sep 21 10:35:52 hobbit kernel:f73b39c0 00200086   
c3a2ba48  f7022e00 
Sep 21 10:35:52 hobbit kernel:dbc48ed4 f789c000 c0399080 c0157e48 
000e  d05e1b80 d05e1ce4
Sep 21 10:35:52 hobbit kernel:0002 00200286 c01322f7 d05e1ce4 
c0131ef0 dbc48ec8 00200286 c0132486
Sep 21 10:35:52 hobbit kernel: Call Trace:
Sep 21 10:35:52 hobbit kernel:  [] find_get_pages_tag+0x38/0x80
Sep 21 10:35:52 hobbit kernel:  [] lock_timer_base+0x27/0x60
Sep 21 10:35:52 hobbit kernel:  [] process_timeout+0x0/0x10
Sep 21 10:35:52 hobbit kernel:  [] __mod_timer+0x86/0xa0
Sep 21 10:35:52 hobbit kernel:  [] schedule_timeout+0x58/0xb0
Sep 21 10:35:52 hobbit kernel:  [] process_timeout+0x0/0x10
Sep 21 10:35:52 hobbit kernel:  [] journal_stop+0xa4/0x1b0 [jbd]
Sep 21 10:35:52 hobbit kernel:  [] journal_start+0x88/0xc0 [jbd]
Sep 21 10:35:52 hobbit kernel:  [] ext3_write_inode+0x0/0x40 [ext3]
Sep 21 10:35:52 hobbit kernel:  [] ext3_write_inode+0x0/0x40 [ext3]
Sep 21 10:35:52 hobbit kernel:  [] 
__writeback_single_inode+0x282/0x390
Sep 21 10:35:52 hobbit kernel:  [] generic_writepages+0x20/0x30
Sep 21 10:35:52 hobbit kernel:  [] do_writepages+0x49/0x50
Sep 21 10:35:52 hobbit kernel:  [] 
__filemap_fdatawrite_range+0x71/0x90
Sep 21 10:35:52 hobbit kernel:  [] sync_inode+0x21/0x40
Sep 21 10:35:52 hobbit kernel:  [] ext3_sync_file+0x9e/0xc0 [ext3]
Sep 21 10:35:52 hobbit kernel:  [] do_fsync+0x6e/0xb0
Sep 21 10:35:52 hobbit kernel:  [] __do_fsync+0x27/0x50
Sep 21 10:35:52 hobbit kernel:  [] sysenter_past_esp+0x78/0xb1
Sep 21 10:35:52 hobbit kernel:  ===

It's almost always after fsync, but I guess it's due to the fact that
cleanup (from Postfix) process is the one who does that most often.


I'm waiting for opportunity to install a new kernel with new kvm...
in a hope still.


Meanwhile I installed kvm-75, which did NOT change anything, -- the system
still hangs.  What really changed things is switching guest to single
processor (was 2 before, from 4-core Phenom).


Are you using ext3 in the host as the filesystem to back the guest
image? If so, try writeback instead of ordered mode:


On the host there's an MD device (raid1) that hold complete "raw" disk
image for the guest.  It was in my email:

>> The device in question is a virtio block device (vda), which is on top
>> op a raid1 device on the host (/dev/md_d5, partitioned).  [...]

I'm trying to set up a test system to debug the case further,
because it's impossible to do that on production machine.

/mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: writes to a virtio block device hungs

2008-09-23 Thread Michael Tokarev

[Replying to my own email...]

Michael Tokarev wrote:

Hello!  It's my first email to this list.. ;)

After experimenting for some time with KVM on linux
(both host and guests are linux machines), I placed
one virtual machine into production use, and almost
immediately come... issues.  Here's how it looks like
from the guest:

Sep 21 10:35:52 hobbit kernel: INFO: task cleanup:20535 blocked for more than 
120 seconds.
Sep 21 10:35:52 hobbit kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 10:35:52 hobbit kernel: cleanup   D  0 20535   1570
Sep 21 10:35:52 hobbit kernel:f73b39c0 00200086   
c3a2ba48  f7022e00 
Sep 21 10:35:52 hobbit kernel:dbc48ed4 f789c000 c0399080 c0157e48 
000e  d05e1b80 d05e1ce4
Sep 21 10:35:52 hobbit kernel:0002 00200286 c01322f7 d05e1ce4 
c0131ef0 dbc48ec8 00200286 c0132486
Sep 21 10:35:52 hobbit kernel: Call Trace:
Sep 21 10:35:52 hobbit kernel:  [] find_get_pages_tag+0x38/0x80
Sep 21 10:35:52 hobbit kernel:  [] lock_timer_base+0x27/0x60
Sep 21 10:35:52 hobbit kernel:  [] process_timeout+0x0/0x10
Sep 21 10:35:52 hobbit kernel:  [] __mod_timer+0x86/0xa0
Sep 21 10:35:52 hobbit kernel:  [] schedule_timeout+0x58/0xb0
Sep 21 10:35:52 hobbit kernel:  [] process_timeout+0x0/0x10
Sep 21 10:35:52 hobbit kernel:  [] journal_stop+0xa4/0x1b0 [jbd]
Sep 21 10:35:52 hobbit kernel:  [] journal_start+0x88/0xc0 [jbd]
Sep 21 10:35:52 hobbit kernel:  [] ext3_write_inode+0x0/0x40 [ext3]
Sep 21 10:35:52 hobbit kernel:  [] ext3_write_inode+0x0/0x40 [ext3]
Sep 21 10:35:52 hobbit kernel:  [] 
__writeback_single_inode+0x282/0x390
Sep 21 10:35:52 hobbit kernel:  [] generic_writepages+0x20/0x30
Sep 21 10:35:52 hobbit kernel:  [] do_writepages+0x49/0x50
Sep 21 10:35:52 hobbit kernel:  [] 
__filemap_fdatawrite_range+0x71/0x90
Sep 21 10:35:52 hobbit kernel:  [] sync_inode+0x21/0x40
Sep 21 10:35:52 hobbit kernel:  [] ext3_sync_file+0x9e/0xc0 [ext3]
Sep 21 10:35:52 hobbit kernel:  [] do_fsync+0x6e/0xb0
Sep 21 10:35:52 hobbit kernel:  [] __do_fsync+0x27/0x50
Sep 21 10:35:52 hobbit kernel:  [] sysenter_past_esp+0x78/0xb1
Sep 21 10:35:52 hobbit kernel:  ===

It's almost always after fsync, but I guess it's due to the fact that
cleanup (from Postfix) process is the one who does that most often.

After first such message (after which corresponding process will sleep
forever), no write to the corresponding device will succeed - all will
stall the same way.  It looks like kvm just "forgets" about each and
every write, effectively turning the device into a black hole -- but
only writes, reads are all ok.

Obviously the system will not reboot in that state, only force-reboot
(echo b > /proc/sysrq-trigger), or a "power-off" from the guest will
help.

The device in question is a virtio block device (vda), which is on top
op a raid1 device on the host (/dev/md_d5, partitioned).  The problem
happens after some up-time, from several hours to 2 days, usually under
heavy load.

The system is Asus M3A-H/HDMI motherboard (AMD 780G/SB700 chipset),
with AMD Phenom 9750 CPU and 8Gb ECC memory.  Stock 2.6.26.5 kernel,
with KVM optimizations (KVM_TIME etc) turned on in guest.  kvm-72.

I'm running it with IDE emulation right now, to see if it will change
something or not.


With IDE (as opposed to virtio) the situation is exactly the same,
switching if=virtio to if=ide didn't change anything at all.


The question is -- should I try with later kvm (kernel and userspace)
first?  The thing is that it's production machine so any downtime is
not good...


I'm waiting for opportunity to install a new kernel with new kvm...
in a hope still.


Thank you!

/mjt

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


writes to a virtio block device hungs

2008-09-22 Thread Michael Tokarev
Hello!  It's my first email to this list.. ;)

After experimenting for some time with KVM on linux
(both host and guests are linux machines), I placed
one virtual machine into production use, and almost
immediately come... issues.  Here's how it looks like
from the guest:

Sep 21 10:35:52 hobbit kernel: INFO: task cleanup:20535 blocked for more than 
120 seconds.
Sep 21 10:35:52 hobbit kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 21 10:35:52 hobbit kernel: cleanup   D  0 20535   1570
Sep 21 10:35:52 hobbit kernel:f73b39c0 00200086   
c3a2ba48  f7022e00 
Sep 21 10:35:52 hobbit kernel:dbc48ed4 f789c000 c0399080 c0157e48 
000e  d05e1b80 d05e1ce4
Sep 21 10:35:52 hobbit kernel:0002 00200286 c01322f7 d05e1ce4 
c0131ef0 dbc48ec8 00200286 c0132486
Sep 21 10:35:52 hobbit kernel: Call Trace:
Sep 21 10:35:52 hobbit kernel:  [] find_get_pages_tag+0x38/0x80
Sep 21 10:35:52 hobbit kernel:  [] lock_timer_base+0x27/0x60
Sep 21 10:35:52 hobbit kernel:  [] process_timeout+0x0/0x10
Sep 21 10:35:52 hobbit kernel:  [] __mod_timer+0x86/0xa0
Sep 21 10:35:52 hobbit kernel:  [] schedule_timeout+0x58/0xb0
Sep 21 10:35:52 hobbit kernel:  [] process_timeout+0x0/0x10
Sep 21 10:35:52 hobbit kernel:  [] journal_stop+0xa4/0x1b0 [jbd]
Sep 21 10:35:52 hobbit kernel:  [] journal_start+0x88/0xc0 [jbd]
Sep 21 10:35:52 hobbit kernel:  [] ext3_write_inode+0x0/0x40 [ext3]
Sep 21 10:35:52 hobbit kernel:  [] ext3_write_inode+0x0/0x40 [ext3]
Sep 21 10:35:52 hobbit kernel:  [] 
__writeback_single_inode+0x282/0x390
Sep 21 10:35:52 hobbit kernel:  [] generic_writepages+0x20/0x30
Sep 21 10:35:52 hobbit kernel:  [] do_writepages+0x49/0x50
Sep 21 10:35:52 hobbit kernel:  [] 
__filemap_fdatawrite_range+0x71/0x90
Sep 21 10:35:52 hobbit kernel:  [] sync_inode+0x21/0x40
Sep 21 10:35:52 hobbit kernel:  [] ext3_sync_file+0x9e/0xc0 [ext3]
Sep 21 10:35:52 hobbit kernel:  [] do_fsync+0x6e/0xb0
Sep 21 10:35:52 hobbit kernel:  [] __do_fsync+0x27/0x50
Sep 21 10:35:52 hobbit kernel:  [] sysenter_past_esp+0x78/0xb1
Sep 21 10:35:52 hobbit kernel:  ===

It's almost always after fsync, but I guess it's due to the fact that
cleanup (from Postfix) process is the one who does that most often.

After first such message (after which corresponding process will sleep
forever), no write to the corresponding device will succeed - all will
stall the same way.  It looks like kvm just "forgets" about each and
every write, effectively turning the device into a black hole -- but
only writes, reads are all ok.

Obviously the system will not reboot in that state, only force-reboot
(echo b > /proc/sysrq-trigger), or a "power-off" from the guest will
help.

The device in question is a virtio block device (vda), which is on top
op a raid1 device on the host (/dev/md_d5, partitioned).  The problem
happens after some up-time, from several hours to 2 days, usually under
heavy load.

The system is Asus M3A-H/HDMI motherboard (AMD 780G/SB700 chipset),
with AMD Phenom 9750 CPU and 8Gb ECC memory.  Stock 2.6.26.5 kernel,
with KVM optimizations (KVM_TIME etc) turned on in guest.  kvm-72.

I'm running it with IDE emulation right now, to see if it will change
something or not.

The question is -- should I try with later kvm (kernel and userspace)
first?  The thing is that it's production machine so any downtime is
not good...

Thank you!

/mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html