Your message dated Wed, 20 Jun 2012 02:42:19 +0100 with message-id <1340156539.6871.95.ca...@deadeye.wl.decadent.org.uk> and subject line Re: Disk freeze on 2.6.32 and 3.x kernels: blkback blocked for more than 120 seconds has caused the Debian Bug report #671860, regarding Disk freeze on 2.6.32 and 3.x kernels: blkback blocked for more than 120 seconds to be marked as done.
This means that you claim that the problem has been dealt with. If this is not the case it is now your responsibility to reopen the Bug report if necessary, and/or fix the problem forthwith. (NB: If you are a system administrator and have no idea what this message is talking about, this may indicate a serious mail system misconfiguration somewhere. Please contact ow...@bugs.debian.org immediately.) -- 671860: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=671860 Debian Bug Tracking System Contact ow...@bugs.debian.org with problems
--- Begin Message ---Package: linux-image-3.2.0-0.bpo.1-amd64 Version: 3.2.1-2~bpo60+1 Severity: grave Actually, all kernels from 2.6.32 to (at least) 3.2.1 seem to be affected. I run two identical machines: Debian6 AMD64, raid1 software mirror for data disks, on /dev/mdx is lvm2 configured. The /dev/mapper devices are synced to the other machine using drbd. These drb-devices are used for virtual machines. Sometimes, one machine degrades. Apparently the mdX_resync process and blkback collide. Until now, I only observed this on Windows HVM used disk devices. Details: A DomU running Windows Server 2003 (usually) or 2008R2 Server (rare) (both with GPLPV, but different versions) get stuck accessing their harddisk provided by xen from the drb device. Other linux domUs on the same machine continue to work. The dom0 kern.log shows that the blkback driver is blocked for more than 120 seconds. The windows domU can't be stopped regularly with xm shutdown or xm destroy. Graceful shutdown of the xen host isn't possible either; only "echo b > /proc/sysrq-trigger" is possible. This problem reappears every few weeks, usually in the night Saturday to Sunday. The Windows domUs worked previously on older xen hosts with no problem, then were moved to the current machine pair with Debian 6 with Xen; all versions from debian-stable i.e. kernel 2.6.32 and xen 4.0. After the problem arose, I gradually updated the versions (xen 4.0->4.1, then kernel 2.6.32->3.1->3.2). Currently, both machines run linux-image-3.2.0-0.bpo.1-amd64 and xen-hypervisor-4.1-amd64 4.1.1-1 incl xen-tools-4.1; problem persists. Today I realized that the main trigger for the failure seems the periodic check from /etc/cron.d/mdadm, which checks on the first day of a month at 0:57. Last month, the check didn't trigger the error, but yesterday it hit me again after 58d uptime. For now, I hot-fixed this by setting /etc/default/mdadm to autocheck=false, but other access patterns seem to trigger the failure too. Here's a snippet from the kern.log; this pattern will reapperar every 2 minutes until hard-resetted. May 6 01:06:24 lady kernel: [4979042.044157] INFO: task blkback.12.hdb:16636 blocked for more than 120 seconds. May 6 01:06:24 lady kernel: [4979042.044255] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 6 01:06:24 lady kernel: [4979042.044345] blkback.12.hdb D ffff880002c817e0 0 16636 2 0x00000000 May 6 01:06:24 lady kernel: [4979042.044355] ffff880002c817e0 0000000000000246 ffff880000000000 ffffffff8160d020 May 6 01:06:24 lady kernel: [4979042.044366] 0000000000013540 ffff880007dbffd8 ffff880007dbffd8 0000000000013540 May 6 01:06:24 lady kernel: [4979042.044375] ffff880002c817e0 ffff880007dbe010 ffffffff81013949 0000000107cacc78 May 6 01:06:24 lady kernel: [4979042.044385] Call Trace: May 6 01:06:24 lady kernel: [4979042.044398] [<ffffffff81013949>] ? sched_clock+0x5/0x8 May 6 01:06:24 lady kernel: [4979042.044423] [<ffffffffa0103673>] ? wait_barrier+0x94/0xcd [raid1] May 6 01:06:24 lady kernel: [4979042.044432] [<ffffffff81045e84>] ? try_to_wake_up+0x190/0x190 May 6 01:06:24 lady kernel: [4979042.044441] [<ffffffffa0104a94>] ? make_request+0x11d/0x1689 [raid1] May 6 01:06:24 lady kernel: [4979042.044454] [<ffffffffa010cc7b>] ? __split_and_process_bio+0x520/0x532 [dm_mod] May 6 01:06:24 lady kernel: [4979042.044463] [<ffffffff810068e5>] ? xen_force_evtchn_callback+0x9/0xa May 6 01:06:24 lady kernel: [4979042.044469] [<ffffffff81006f92>] ? check_events+0x12/0x20 May 6 01:06:24 lady kernel: [4979042.044481] [<ffffffffa00d7a9b>] ? md_make_request+0xbe/0x1b1 [md_mod] May 6 01:06:24 lady kernel: [4979042.044490] [<ffffffff8135ab25>] ? _raw_spin_unlock_irqrestore+0x10/0x11 May 6 01:06:24 lady kernel: [4979042.044498] [<ffffffff811a807e>] ? generic_make_request+0x8e/0xcd May 6 01:06:24 lady kernel: [4979042.044504] [<ffffffff811a8196>] ? submit_bio+0xd9/0xf7 May 6 01:06:24 lady kernel: [4979042.044512] [<ffffffff8112caf5>] ? bio_alloc_bioset+0x44/0xb3 May 6 01:06:24 lady kernel: [4979042.044520] [<ffffffffa0410a5b>] ? dispatch_rw_block_io+0x49a/0x546 [xen_blkback] May 6 01:06:24 lady kernel: [4979042.044527] [<ffffffff810068e5>] ? xen_force_evtchn_callback+0x9/0xa May 6 01:06:24 lady kernel: [4979042.044533] [<ffffffff81006f92>] ? check_events+0x12/0x20 May 6 01:06:24 lady kernel: [4979042.044539] [<ffffffff81006f7f>] ? xen_restore_fl_direct_reloc+0x4/0x4 May 6 01:06:24 lady kernel: [4979042.044546] [<ffffffff81006f7f>] ? xen_restore_fl_direct_reloc+0x4/0x4 May 6 01:06:24 lady kernel: [4979042.044552] [<ffffffff81004299>] ? xen_mc_flush+0x12b/0x158 May 6 01:06:24 lady kernel: [4979042.044558] [<ffffffff81006f7f>] ? xen_restore_fl_direct_reloc+0x4/0x4 May 6 01:06:24 lady kernel: [4979042.044565] [<ffffffffa0410d57>] ? __do_block_io_op+0x250/0x276 [xen_blkback] May 6 01:06:24 lady kernel: [4979042.044573] [<ffffffffa041112a>] ? xen_blkif_schedule+0x302/0x3d0 [xen_blkback] May 6 01:06:24 lady kernel: [4979042.044580] [<ffffffff810636c5>] ? wake_up_bit+0x20/0x20 May 6 01:06:24 lady kernel: [4979042.044587] [<ffffffffa0410e28>] ? print_stats+0x95/0x95 [xen_blkback] May 6 01:06:24 lady kernel: [4979042.044593] [<ffffffff81063289>] ? kthread+0x7a/0x82 May 6 01:06:24 lady kernel: [4979042.044600] [<ffffffff81362274>] ? kernel_thread_helper+0x4/0x10 May 6 01:06:24 lady kernel: [4979042.044607] [<ffffffff81360333>] ? int_ret_from_sys_call+0x7/0x1b May 6 01:06:24 lady kernel: [4979042.044613] [<ffffffff8135aebc>] ? retint_restore_args+0x5/0x6 May 6 01:06:24 lady kernel: [4979042.044618] [<ffffffff81362270>] ? gs_change+0x13/0x13 May 6 01:06:24 lady kernel: [4979042.044623] INFO: task blkback.12.hdd:16637 blocked for more than 120 seconds. May 6 01:06:24 lady kernel: [4979042.044714] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 6 01:06:24 lady kernel: [4979042.044804] blkback.12.hdd D ffff880005384300 0 16637 2 0x00000000 May 6 01:06:24 lady kernel: [4979042.044811] ffff880005384300 0000000000000246 ffff880000000000 ffffffff8160d020 May 6 01:06:24 lady kernel: [4979042.044821] 0000000000013540 ffff8800061f7fd8 ffff8800061f7fd8 0000000000013540 May 6 01:06:24 lady kernel: [4979042.044830] ffff880005384300 ffff8800061f6010 ffff8800053846b8 00000001061f6000 May 6 01:06:24 lady kernel: [4979042.044840] Call Trace: May 6 01:06:24 lady kernel: [4979042.044848] [<ffffffffa0103673>] ? wait_barrier+0x94/0xcd [raid1] May 6 01:06:24 lady kernel: [4979042.044854] [<ffffffff81045e84>] ? try_to_wake_up+0x190/0x190 May 6 01:06:24 lady kernel: [4979042.044862] [<ffffffffa0104a94>] ? make_request+0x11d/0x1689 [raid1] May 6 01:06:24 lady kernel: [4979042.044869] [<ffffffff8112caf5>] ? bio_alloc_bioset+0x44/0xb3 May 6 01:06:24 lady kernel: [4979042.044879] [<ffffffffa010cc7b>] ? __split_and_process_bio+0x520/0x532 [dm_mod] May 6 01:06:24 lady kernel: [4979042.044885] [<ffffffff811a7ffe>] ? generic_make_request+0xe/0xcd May 6 01:06:24 lady kernel: [4979042.044895] [<ffffffffa00d7a9b>] ? md_make_request+0xbe/0x1b1 [md_mod] May 6 01:06:24 lady kernel: [4979042.044901] [<ffffffff811a807e>] ? generic_make_request+0x8e/0xcd May 6 01:06:24 lady kernel: [4979042.044907] [<ffffffff811a8196>] ? submit_bio+0xd9/0xf7 May 6 01:06:24 lady kernel: [4979042.044913] [<ffffffff8112caf5>] ? bio_alloc_bioset+0x44/0xb3 May 6 01:06:24 lady kernel: [4979042.044920] [<ffffffffa0410a5b>] ? dispatch_rw_block_io+0x49a/0x546 [xen_blkback] May 6 01:06:24 lady kernel: [4979042.044927] [<ffffffff8136239b>] ? xen_hypervisor_callback+0x1b/0x20 May 6 01:06:24 lady kernel: [4979042.044933] [<ffffffff810068e5>] ? xen_force_evtchn_callback+0x9/0xa May 6 01:06:24 lady kernel: [4979042.044939] [<ffffffff81006f92>] ? check_events+0x12/0x20 May 6 01:06:24 lady kernel: [4979042.044945] [<ffffffff81006f7f>] ? xen_restore_fl_direct_reloc+0x4/0x4 May 6 01:06:24 lady kernel: [4979042.044952] [<ffffffff81006f7f>] ? xen_restore_fl_direct_reloc+0x4/0x4 May 6 01:06:24 lady kernel: [4979042.044957] [<ffffffff81004299>] ? xen_mc_flush+0x12b/0x158 May 6 01:06:24 lady kernel: [4979042.044963] [<ffffffff81006f7f>] ? xen_restore_fl_direct_reloc+0x4/0x4 May 6 01:06:24 lady kernel: [4979042.044971] [<ffffffffa0410d57>] ? __do_block_io_op+0x250/0x276 [xen_blkback] May 6 01:06:24 lady kernel: [4979042.044978] [<ffffffffa041112a>] ? xen_blkif_schedule+0x302/0x3d0 [xen_blkback] May 6 01:06:24 lady kernel: [4979042.044984] [<ffffffff810636c5>] ? wake_up_bit+0x20/0x20 May 6 01:06:24 lady kernel: [4979042.044991] [<ffffffffa0410e28>] ? print_stats+0x95/0x95 [xen_blkback] May 6 01:06:24 lady kernel: [4979042.044997] [<ffffffff81063289>] ? kthread+0x7a/0x82 May 6 01:06:24 lady kernel: [4979042.045002] [<ffffffff81362274>] ? kernel_thread_helper+0x4/0x10 May 6 01:06:24 lady kernel: [4979042.045008] [<ffffffff81360333>] ? int_ret_from_sys_call+0x7/0x1b May 6 01:06:24 lady kernel: [4979042.045014] [<ffffffff8135aebc>] ? retint_restore_args+0x5/0x6 May 6 01:06:24 lady kernel: [4979042.045020] [<ffffffff81362270>] ? gs_change+0x13/0x13 May 6 01:06:24 lady kernel: [4979042.045031] INFO: task md4_resync:8724 blocked for more than 120 seconds. May 6 01:06:24 lady kernel: [4979042.045102] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 6 01:06:24 lady kernel: [4979042.045195] md4_resync D ffff8800088f69a0 0 8724 2 0x00000000 May 6 01:06:24 lady kernel: [4979042.045203] ffff8800088f69a0 0000000000000246 ffff880000000000 ffff880014cbf560 May 6 01:06:24 lady kernel: [4979042.045213] 0000000000013540 ffff880001a9dfd8 ffff880001a9dfd8 0000000000013540 May 6 01:06:24 lady kernel: [4979042.045222] ffff8800088f69a0 ffff880001a9c010 dead000000100100 0000000100200200 May 6 01:06:24 lady kernel: [4979042.045231] Call Trace: May 6 01:06:24 lady kernel: [4979042.045239] [<ffffffffa010356e>] ? raise_barrier+0x126/0x15b [raid1] May 6 01:06:24 lady kernel: [4979042.045245] [<ffffffff81045e84>] ? try_to_wake_up+0x190/0x190 May 6 01:06:24 lady kernel: [4979042.045253] [<ffffffffa0103c15>] ? sync_request+0x19b/0x71c [raid1] May 6 01:06:24 lady kernel: [4979042.045265] [<ffffffffa00d8e74>] ? md_do_sync+0x78a/0xb98 [md_mod] May 6 01:06:24 lady kernel: [4979042.045272] [<ffffffff810636c5>] ? wake_up_bit+0x20/0x20 May 6 01:06:24 lady kernel: [4979042.045282] [<ffffffffa00d9507>] ? md_thread+0x105/0x123 [md_mod] May 6 01:06:24 lady kernel: [4979042.045292] [<ffffffffa00d9402>] ? md_rdev_init+0xea/0xea [md_mod] May 6 01:06:24 lady kernel: [4979042.045298] [<ffffffff81063289>] ? kthread+0x7a/0x82 May 6 01:06:24 lady kernel: [4979042.045303] [<ffffffff81362274>] ? kernel_thread_helper+0x4/0x10 May 6 01:06:24 lady kernel: [4979042.045310] [<ffffffff81360333>] ? int_ret_from_sys_call+0x7/0x1b May 6 01:06:24 lady kernel: [4979042.045315] [<ffffffff8135aebc>] ? retint_restore_args+0x5/0x6 May 6 01:06:24 lady kernel: [4979042.045321] [<ffffffff81362270>] ? gs_change+0x13/0x13
--- End Message ---
--- Begin Message ---Version: 3.2.17-1 On Sun, 2012-06-17 at 11:20 +0200, Andreas Pflug wrote: > I've been upgrading the machines to 3.2.0-0.bpo.2 (3.2.17), and > rescheduled the checkarray to occur every weekend. So far, after two > weekends, no more freeze occurrences. Previously, every 1-2 checkarray > runs the machine used to stall. Seems the bug is fixed now. OK, closing the bug. If you want this fixed in the standard stable kernel package as well, you (or someone else) will need to test the patch as I requested in <http://bugs.debian.org/671860#18>. Ben. -- Ben Hutchings It is easier to change the specification to fit the program than vice versa.signature.asc
Description: This is a digitally signed message part
--- End Message ---