Bug#671860: marked as done (Disk freeze on 2.6.32 and 3.x kernels: blkback blocked for more than 120 seconds)

Debian Bug Tracking System Tue, 19 Jun 2012 18:45:23 -0700

Your message dated Wed, 20 Jun 2012 02:42:19 +0100
with message-id <[email protected]>
and subject line Re: Disk freeze on 2.6.32 and 3.x kernels: blkback blocked for 
more than 120 seconds
has caused the Debian Bug report #671860,
regarding Disk freeze on 2.6.32 and 3.x kernels: blkback blocked for more than 
120 seconds
to be marked as done.


This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact [email protected]
immediately.)


-- 
671860: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=671860
Debian Bug Tracking System
Contact [email protected] with problems

--- Begin Message ---

Package: linux-image-3.2.0-0.bpo.1-amd64
Version: 3.2.1-2~bpo60+1
Severity: grave

Actually, all kernels from 2.6.32 to (at least) 3.2.1 seem to be affected.

I run two identical machines:
Debian6 AMD64, raid1 software mirror for data disks, on /dev/mdx is lvm2
configured. The /dev/mapper devices are synced to the other machine
using drbd.
These drb-devices are used for virtual machines.

Sometimes, one machine degrades. Apparently the mdX_resync process and
blkback collide. Until now, I only observed this on Windows HVM used
disk devices.

Details:
A DomU running Windows Server 2003 (usually) or 2008R2 Server (rare)
(both with GPLPV, but different versions) get stuck accessing their
harddisk provided by xen from the drb device. Other linux domUs on the
same machine continue to work. The dom0 kern.log shows that the blkback
driver is blocked for more than 120 seconds.
The windows domU can't be stopped regularly with xm shutdown or xm
destroy. Graceful shutdown of the xen host isn't possible either; only
"echo b > /proc/sysrq-trigger" is possible.
This problem reappears every few weeks, usually in the night Saturday to
Sunday.
                
The Windows domUs worked previously on older xen hosts with no problem,
then were moved to the current machine pair with Debian 6 with Xen; all
versions from debian-stable i.e. kernel 2.6.32 and xen 4.0. After the
problem arose, I gradually updated the versions (xen 4.0->4.1, then
kernel 2.6.32->3.1->3.2). Currently, both machines run
linux-image-3.2.0-0.bpo.1-amd64 and xen-hypervisor-4.1-amd64 4.1.1-1
incl xen-tools-4.1; problem persists.
                
Today I realized that the main trigger for the failure seems the
periodic check from /etc/cron.d/mdadm, which checks on the first day of
a month at 0:57.
Last month, the check didn't trigger the error, but yesterday it hit me
again after 58d uptime. For now, I hot-fixed this by setting
/etc/default/mdadm to autocheck=false, but other access patterns seem to
trigger the failure too.

Here's a snippet from the kern.log; this pattern will reapperar every 2
minutes until hard-resetted.

May  6 01:06:24 lady kernel: [4979042.044157] INFO: task blkback.12.hdb:16636 
blocked for more than 120 seconds.
May  6 01:06:24 lady kernel: [4979042.044255] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  6 01:06:24 lady kernel: [4979042.044345] blkback.12.hdb  D 
ffff880002c817e0     0 16636      2 0x00000000
May  6 01:06:24 lady kernel: [4979042.044355]  ffff880002c817e0 
0000000000000246 ffff880000000000 ffffffff8160d020
May  6 01:06:24 lady kernel: [4979042.044366]  0000000000013540 
ffff880007dbffd8 ffff880007dbffd8 0000000000013540
May  6 01:06:24 lady kernel: [4979042.044375]  ffff880002c817e0 
ffff880007dbe010 ffffffff81013949 0000000107cacc78
May  6 01:06:24 lady kernel: [4979042.044385] Call Trace:
May  6 01:06:24 lady kernel: [4979042.044398]  [<ffffffff81013949>] ? 
sched_clock+0x5/0x8
May  6 01:06:24 lady kernel: [4979042.044423]  [<ffffffffa0103673>] ? 
wait_barrier+0x94/0xcd [raid1]
May  6 01:06:24 lady kernel: [4979042.044432]  [<ffffffff81045e84>] ? 
try_to_wake_up+0x190/0x190
May  6 01:06:24 lady kernel: [4979042.044441]  [<ffffffffa0104a94>] ? 
make_request+0x11d/0x1689 [raid1]
May  6 01:06:24 lady kernel: [4979042.044454]  [<ffffffffa010cc7b>] ? 
__split_and_process_bio+0x520/0x532 [dm_mod]
May  6 01:06:24 lady kernel: [4979042.044463]  [<ffffffff810068e5>] ? 
xen_force_evtchn_callback+0x9/0xa
May  6 01:06:24 lady kernel: [4979042.044469]  [<ffffffff81006f92>] ? 
check_events+0x12/0x20
May  6 01:06:24 lady kernel: [4979042.044481]  [<ffffffffa00d7a9b>] ? 
md_make_request+0xbe/0x1b1 [md_mod]
May  6 01:06:24 lady kernel: [4979042.044490]  [<ffffffff8135ab25>] ? 
_raw_spin_unlock_irqrestore+0x10/0x11
May  6 01:06:24 lady kernel: [4979042.044498]  [<ffffffff811a807e>] ? 
generic_make_request+0x8e/0xcd
May  6 01:06:24 lady kernel: [4979042.044504]  [<ffffffff811a8196>] ? 
submit_bio+0xd9/0xf7
May  6 01:06:24 lady kernel: [4979042.044512]  [<ffffffff8112caf5>] ? 
bio_alloc_bioset+0x44/0xb3
May  6 01:06:24 lady kernel: [4979042.044520]  [<ffffffffa0410a5b>] ? 
dispatch_rw_block_io+0x49a/0x546 [xen_blkback]
May  6 01:06:24 lady kernel: [4979042.044527]  [<ffffffff810068e5>] ? 
xen_force_evtchn_callback+0x9/0xa
May  6 01:06:24 lady kernel: [4979042.044533]  [<ffffffff81006f92>] ? 
check_events+0x12/0x20
May  6 01:06:24 lady kernel: [4979042.044539]  [<ffffffff81006f7f>] ? 
xen_restore_fl_direct_reloc+0x4/0x4
May  6 01:06:24 lady kernel: [4979042.044546]  [<ffffffff81006f7f>] ? 
xen_restore_fl_direct_reloc+0x4/0x4
May  6 01:06:24 lady kernel: [4979042.044552]  [<ffffffff81004299>] ? 
xen_mc_flush+0x12b/0x158
May  6 01:06:24 lady kernel: [4979042.044558]  [<ffffffff81006f7f>] ? 
xen_restore_fl_direct_reloc+0x4/0x4
May  6 01:06:24 lady kernel: [4979042.044565]  [<ffffffffa0410d57>] ? 
__do_block_io_op+0x250/0x276 [xen_blkback]
May  6 01:06:24 lady kernel: [4979042.044573]  [<ffffffffa041112a>] ? 
xen_blkif_schedule+0x302/0x3d0 [xen_blkback]
May  6 01:06:24 lady kernel: [4979042.044580]  [<ffffffff810636c5>] ? 
wake_up_bit+0x20/0x20
May  6 01:06:24 lady kernel: [4979042.044587]  [<ffffffffa0410e28>] ? 
print_stats+0x95/0x95 [xen_blkback]
May  6 01:06:24 lady kernel: [4979042.044593]  [<ffffffff81063289>] ? 
kthread+0x7a/0x82
May  6 01:06:24 lady kernel: [4979042.044600]  [<ffffffff81362274>] ? 
kernel_thread_helper+0x4/0x10
May  6 01:06:24 lady kernel: [4979042.044607]  [<ffffffff81360333>] ? 
int_ret_from_sys_call+0x7/0x1b
May  6 01:06:24 lady kernel: [4979042.044613]  [<ffffffff8135aebc>] ? 
retint_restore_args+0x5/0x6
May  6 01:06:24 lady kernel: [4979042.044618]  [<ffffffff81362270>] ? 
gs_change+0x13/0x13
May  6 01:06:24 lady kernel: [4979042.044623] INFO: task blkback.12.hdd:16637 
blocked for more than 120 seconds.
May  6 01:06:24 lady kernel: [4979042.044714] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  6 01:06:24 lady kernel: [4979042.044804] blkback.12.hdd  D 
ffff880005384300     0 16637      2 0x00000000
May  6 01:06:24 lady kernel: [4979042.044811]  ffff880005384300 
0000000000000246 ffff880000000000 ffffffff8160d020
May  6 01:06:24 lady kernel: [4979042.044821]  0000000000013540 
ffff8800061f7fd8 ffff8800061f7fd8 0000000000013540
May  6 01:06:24 lady kernel: [4979042.044830]  ffff880005384300 
ffff8800061f6010 ffff8800053846b8 00000001061f6000
May  6 01:06:24 lady kernel: [4979042.044840] Call Trace:
May  6 01:06:24 lady kernel: [4979042.044848]  [<ffffffffa0103673>] ? 
wait_barrier+0x94/0xcd [raid1]
May  6 01:06:24 lady kernel: [4979042.044854]  [<ffffffff81045e84>] ? 
try_to_wake_up+0x190/0x190
May  6 01:06:24 lady kernel: [4979042.044862]  [<ffffffffa0104a94>] ? 
make_request+0x11d/0x1689 [raid1]
May  6 01:06:24 lady kernel: [4979042.044869]  [<ffffffff8112caf5>] ? 
bio_alloc_bioset+0x44/0xb3
May  6 01:06:24 lady kernel: [4979042.044879]  [<ffffffffa010cc7b>] ? 
__split_and_process_bio+0x520/0x532 [dm_mod]
May  6 01:06:24 lady kernel: [4979042.044885]  [<ffffffff811a7ffe>] ? 
generic_make_request+0xe/0xcd
May  6 01:06:24 lady kernel: [4979042.044895]  [<ffffffffa00d7a9b>] ? 
md_make_request+0xbe/0x1b1 [md_mod]
May  6 01:06:24 lady kernel: [4979042.044901]  [<ffffffff811a807e>] ? 
generic_make_request+0x8e/0xcd
May  6 01:06:24 lady kernel: [4979042.044907]  [<ffffffff811a8196>] ? 
submit_bio+0xd9/0xf7
May  6 01:06:24 lady kernel: [4979042.044913]  [<ffffffff8112caf5>] ? 
bio_alloc_bioset+0x44/0xb3
May  6 01:06:24 lady kernel: [4979042.044920]  [<ffffffffa0410a5b>] ? 
dispatch_rw_block_io+0x49a/0x546 [xen_blkback]
May  6 01:06:24 lady kernel: [4979042.044927]  [<ffffffff8136239b>] ? 
xen_hypervisor_callback+0x1b/0x20
May  6 01:06:24 lady kernel: [4979042.044933]  [<ffffffff810068e5>] ? 
xen_force_evtchn_callback+0x9/0xa
May  6 01:06:24 lady kernel: [4979042.044939]  [<ffffffff81006f92>] ? 
check_events+0x12/0x20
May  6 01:06:24 lady kernel: [4979042.044945]  [<ffffffff81006f7f>] ? 
xen_restore_fl_direct_reloc+0x4/0x4
May  6 01:06:24 lady kernel: [4979042.044952]  [<ffffffff81006f7f>] ? 
xen_restore_fl_direct_reloc+0x4/0x4
May  6 01:06:24 lady kernel: [4979042.044957]  [<ffffffff81004299>] ? 
xen_mc_flush+0x12b/0x158
May  6 01:06:24 lady kernel: [4979042.044963]  [<ffffffff81006f7f>] ? 
xen_restore_fl_direct_reloc+0x4/0x4
May  6 01:06:24 lady kernel: [4979042.044971]  [<ffffffffa0410d57>] ? 
__do_block_io_op+0x250/0x276 [xen_blkback]
May  6 01:06:24 lady kernel: [4979042.044978]  [<ffffffffa041112a>] ? 
xen_blkif_schedule+0x302/0x3d0 [xen_blkback]
May  6 01:06:24 lady kernel: [4979042.044984]  [<ffffffff810636c5>] ? 
wake_up_bit+0x20/0x20
May  6 01:06:24 lady kernel: [4979042.044991]  [<ffffffffa0410e28>] ? 
print_stats+0x95/0x95 [xen_blkback]
May  6 01:06:24 lady kernel: [4979042.044997]  [<ffffffff81063289>] ? 
kthread+0x7a/0x82
May  6 01:06:24 lady kernel: [4979042.045002]  [<ffffffff81362274>] ? 
kernel_thread_helper+0x4/0x10
May  6 01:06:24 lady kernel: [4979042.045008]  [<ffffffff81360333>] ? 
int_ret_from_sys_call+0x7/0x1b
May  6 01:06:24 lady kernel: [4979042.045014]  [<ffffffff8135aebc>] ? 
retint_restore_args+0x5/0x6
May  6 01:06:24 lady kernel: [4979042.045020]  [<ffffffff81362270>] ? 
gs_change+0x13/0x13
May  6 01:06:24 lady kernel: [4979042.045031] INFO: task md4_resync:8724 
blocked for more than 120 seconds.
May  6 01:06:24 lady kernel: [4979042.045102] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
May  6 01:06:24 lady kernel: [4979042.045195] md4_resync      D 
ffff8800088f69a0     0  8724      2 0x00000000
May  6 01:06:24 lady kernel: [4979042.045203]  ffff8800088f69a0 
0000000000000246 ffff880000000000 ffff880014cbf560
May  6 01:06:24 lady kernel: [4979042.045213]  0000000000013540 
ffff880001a9dfd8 ffff880001a9dfd8 0000000000013540
May  6 01:06:24 lady kernel: [4979042.045222]  ffff8800088f69a0 
ffff880001a9c010 dead000000100100 0000000100200200
May  6 01:06:24 lady kernel: [4979042.045231] Call Trace:
May  6 01:06:24 lady kernel: [4979042.045239]  [<ffffffffa010356e>] ? 
raise_barrier+0x126/0x15b [raid1]
May  6 01:06:24 lady kernel: [4979042.045245]  [<ffffffff81045e84>] ? 
try_to_wake_up+0x190/0x190
May  6 01:06:24 lady kernel: [4979042.045253]  [<ffffffffa0103c15>] ? 
sync_request+0x19b/0x71c [raid1]
May  6 01:06:24 lady kernel: [4979042.045265]  [<ffffffffa00d8e74>] ? 
md_do_sync+0x78a/0xb98 [md_mod]
May  6 01:06:24 lady kernel: [4979042.045272]  [<ffffffff810636c5>] ? 
wake_up_bit+0x20/0x20
May  6 01:06:24 lady kernel: [4979042.045282]  [<ffffffffa00d9507>] ? 
md_thread+0x105/0x123 [md_mod]
May  6 01:06:24 lady kernel: [4979042.045292]  [<ffffffffa00d9402>] ? 
md_rdev_init+0xea/0xea [md_mod]
May  6 01:06:24 lady kernel: [4979042.045298]  [<ffffffff81063289>] ? 
kthread+0x7a/0x82
May  6 01:06:24 lady kernel: [4979042.045303]  [<ffffffff81362274>] ? 
kernel_thread_helper+0x4/0x10
May  6 01:06:24 lady kernel: [4979042.045310]  [<ffffffff81360333>] ? 
int_ret_from_sys_call+0x7/0x1b
May  6 01:06:24 lady kernel: [4979042.045315]  [<ffffffff8135aebc>] ? 
retint_restore_args+0x5/0x6
May  6 01:06:24 lady kernel: [4979042.045321]  [<ffffffff81362270>] ? 
gs_change+0x13/0x13

--- End Message ---

--- Begin Message ---

Version: 3.2.17-1

On Sun, 2012-06-17 at 11:20 +0200, Andreas Pflug wrote:
> I've been upgrading the machines to 3.2.0-0.bpo.2 (3.2.17), and
> rescheduled the checkarray to occur every weekend. So far, after two
> weekends, no more freeze occurrences. Previously, every 1-2 checkarray
> runs the machine used to stall. Seems the bug is fixed now.

OK, closing the bug.

If you want this fixed in the standard stable kernel package as well,
you (or someone else) will need to test the patch as I requested in
<http://bugs.debian.org/671860#18>.

Ben.

-- 
Ben Hutchings
It is easier to change the specification to fit the program than vice versa.

signature.asc
Description: This is a digitally signed message part

--- End Message ---

Bug#671860: marked as done (Disk freeze on 2.6.32 and 3.x kernels: blkback blocked for more than 120 seconds)

Reply via email to