Re: [Bug 212684] Re: RAID1 data-checks cause CPU soft lockups

dmb Mon, 07 Apr 2008 02:40:27 -0700

Thanks for your mail Wladimir,

I think you may well be right - it's an old system but has had a new 
lease of life with Linux ! I'm also going to try reducing the max 
bandwidth of the md devices to 10Mb/s to see if that helps.


I'll let you know if I find an answer, and would be very interested to 
hear what you find if you do try building the PIII system.

Regards,

David

>----Original Message----
>From: [EMAIL PROTECTED]
>Date: 07/04/2008 6:56 
>To: <[EMAIL PROTECTED]>
>Subj: [Bug 212684] Re: RAID1 data-checks cause CPU soft lockups
>
>My systems do not froze or stuck. The check is going on with normal 
speed (spending about 1 hour for 200 GB, 1.5 hours for 320 GB, and 2.5h 
for 500GB) and writes into kernel log about its normal completion. Just 
these soft lockups is what bothers me.
>Your system may just be old and show some faults under load. Soon 
I'll make md-raid1 on 2 IDE disks on similar older system (P3TDDE 
board, 2xPIII-1133MHz) and tell you how it goes in my case.
>
>-- 
>RAID1 data-checks cause CPU soft lockups
>https://bugs.launchpad.net/bugs/212684
>You received this bug notification because you are a direct 
subscriber
>of the bug.
>
>Status in Source Package "linux" in Ubuntu: New
>
>Bug description:
>Binary package hint: linux-image-2.6.24-15-generic
>
>I track Hardy development packages on some of my systems.
>They share some common hardware and configuration features, in part :
>Pentium4 CPU with Hyperthreading turned on
>(so that 2 logical cores are visible); 1 or 2 GB RAM;
>Intel chipset with ICH5/6/7[R] SATA controller with RAID turned off ;
>two SATA disks (Seagate, WD or Samsung) of 200..320..500 GB each.
>On each of these disks, two 0xfd partitions (Linux RAID autodetect) 
are allocated : first of 100 MB, and second taking the rest of disk 
volume.
>They are assembled by mdadm in RAID1 (mirrored) md arrays :
>first, of 100 MB, to hold /boot filesystem,
>and second, the big one, to hold LVM2 PV, a volume group with a 
couple of LVs with filesystems, including root FS (all of type ext3), 
as well as a swap LV.
>The boot loader is LILO. Systems boot and run well, providing 
appropriate fault-tolerance for disks when needed.
>
>However, mdadm package provides 'checkarray' script which is run by 
cron on first Sunday of each month to check for RAID arrays integrity. 
>The actions performed by script are in fact 
>'echo check > $i' for i in /sys/block/*/md/sync_action
>And this integrity check gives the following messages in the kernel 
log :
>
>Apr  6 01:06:02 hostname kernel: [ 9859.807932] md: data-check of 
RAID array md0
>Apr  6 01:06:02 hostname kernel: [ 9859.808090] md: minimum 
_guaranteed_  speed: 1000 KB/sec/disk.
>Apr  6 01:06:02 hostname kernel: [ 9859.808222] md: using maximum 
available idle IO bandwidth (but not more than 200000 KB/sec) for data-
check.
>Apr  6 01:06:02 hostname kernel: [ 9859.808422] md: using 128k 
window, over a total of 104320 blocks.
>Apr  6 01:06:02 hostname kernel: [ 9859.886364] md: delaying data-
check of md2 until md0 has finished (they share one or more physical 
units)
>Apr  6 01:06:04 hostname kernel: [ 9862.098900] md: md0: data-check 
done.
>Apr  6 01:06:04 hostname kernel: [ 9862.137205] md: data-check of 
RAID array md2
>Apr  6 01:06:04 hostname kernel: [ 9862.137238] md: minimum 
_guaranteed_  speed: 1000 KB/sec/disk.
>Apr  6 01:06:04 hostname kernel: [ 9862.137272] md: using maximum 
available idle IO bandwidth (but not more than 200000 KB/sec) for data-
check.
>Apr  6 01:06:04 hostname kernel: [ 9862.137327] md: using 128k 
window, over a total of 312464128 blocks.
>Apr  6 01:06:04 hostname kernel: [ 9862.189968] RAID1 conf printout:
>Apr  6 01:06:04 hostname kernel: [ 9862.190003]  --- wd:2 rd:2
>Apr  6 01:06:04 hostname kernel: [ 9862.190035]  disk 0, wo:0, o:1, 
dev:sdb1
>Apr  6 01:06:04 hostname kernel: [ 9862.190062]  disk 1, wo:0, o:1, 
dev:sda1
>
> ... 13 seconds later :
>
>Apr  6 01:06:17 hostname kernel: [ 9875.118427] BUG: soft lockup - 
CPU#0 stuck for 11s! [md2_raid1:2378]
>Apr  6 01:06:17 hostname kernel: [ 9875.118581] 
>Apr  6 01:06:17 hostname kernel: [ 9875.118671] Pid: 2378, comm: 
md2_raid1 Not tainted (2.6.24-15-generic #1)
>Apr  6 01:06:17 hostname kernel: [ 9875.118811] EIP: 0060:
[<f887c9b0>] EFLAGS: 00010282 CPU: 0
>Apr  6 01:06:17 hostname kernel: [ 9875.119048] EIP is at 
raid1d+0x770/0xff0 [raid1]
>Apr  6 01:06:17 hostname kernel: [ 9875.119159] EAX: e7ffb000 EBX: 
c14fff60 ECX: 00000f24 EDX: f4b41800
>Apr  6 01:06:17 hostname kernel: [ 9875.119284] ESI: e7ffb0dc EDI: 
e807e0dc EBP: df92fe40 ESP: f7495e9c
>Apr  6 01:06:17 hostname kernel: [ 9875.119448]  DS: 007b ES: 007b 
FS: 00d8 GS: 0000 SS: 0068
>Apr  6 01:06:17 hostname kernel: [ 9875.119567] CR0: 8005003b CR2: 
b7f32480 CR3: 374de000 CR4: 000006d0
>Apr  6 01:06:17 hostname kernel: [ 9875.119696] DR0: 00000000 DR1: 
00000000 DR2: 00000000 DR3: 00000000
>Apr  6 01:06:17 hostname kernel: [ 9875.119833] DR6: ffff0ff0 DR7: 
00000400
>Apr  6 01:06:17 hostname kernel: [ 9875.120537]  [jbd:
schedule+0x20a/0x650] schedule+0x20a/0x600
>Apr  6 01:06:17 hostname kernel: [ 9875.121051]  [<f88b5ed0>] 
md_thread+0x0/0xe0 [md_mod]
>Apr  6 01:06:17 hostname kernel: [ 9875.121319]  [shpchp:
schedule_timeout+0x76/0x2d0] schedule_timeout+0x76/0xd0
>Apr  6 01:06:17 hostname kernel: [ 9875.121494]  
[apic_timer_interrupt+0x28/0x30] apic_timer_interrupt+0x28/0x30
>Apr  6 01:06:17 hostname kernel: [ 9875.121751]  [<f88b5ed0>] 
md_thread+0x0/0xe0 [md_mod]
>Apr  6 01:06:17 hostname kernel: [ 9875.122050]  [<f887007b>] 
mirror_status+0x19b/0x250 [dm_mirror]
>Apr  6 01:06:17 hostname kernel: [ 9875.122333]  [<f88b5ed0>] 
md_thread+0x0/0xe0 [md_mod]
>Apr  6 01:06:17 hostname kernel: [ 9875.122590]  [<f88b5ef3>] 
md_thread+0x23/0xe0 [md_mod]
>Apr  6 01:06:17 hostname kernel: [ 9875.122854]  [<c0141b70>] 
autoremove_wake_function+0x0/0x40
>Apr  6 01:06:17 hostname kernel: [ 9875.123122]  [<f88b5ed0>] 
md_thread+0x0/0xe0 [md_mod]
>Apr  6 01:06:17 hostname kernel: [ 9875.123356]  [kthread+0x42/0x70] 
kthread+0x42/0x70
>Apr  6 01:06:17 hostname kernel: [ 9875.123497]  [kthread+0x0/0x70] 
kthread+0x0/0x70
>Apr  6 01:06:17 hostname kernel: [ 9875.123673]  
[kernel_thread_helper+0x7/0x10] kernel_thread_helper+0x7/0x10
>Apr  6 01:06:17 hostname kernel: [ 9875.123937]  
=======================
>
>... Then, these soft lockups repeat about each 13..20 seconds. Their 
stacks are not the same, but they share common spot in 'raid1d' 
function or thread. Most frequent offset is raid1d+0x770. Sometimes it 
is +0x17b or nearby values. Here is a sample distribution :
>
>      1 raid1d+0x174/0xff0
>      5 raid1d+0x17b/0xff0
>      1 raid1d+0x18d/0xff0
>      1 raid1d+0x669/0xff0
>      1 raid1d+0x75f/0xff0
>     86 raid1d+0x770/0xff0
>      1 raid1d+0x772/0xff0
>
>During the check of these two RAID1 arrays, 100 MB and 320 GB in 
total, the lockups happened more than 90 times. Astronomically, this 
check had taken 1 hour 37 minutes.
>
>Don't know if I should ignore these lockups, or ask you, dear 
maintainers, to research into the problem. I just decided to inform 
you. I'll give you any more details on your request. I just listed the 
common traits of three my systems (with equal sets of Hardy packages) 
where these lockups are reproducible.
>



______________________________________________
Get up to 33% off Norton Security from Tiscali - 
http://www.tiscali.co.uk/securepc

Make the most of your Broadband - http://www.tiscali.co.uk/services

-- 
RAID1 data-checks cause CPU soft lockups
https://bugs.launchpad.net/bugs/212684
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Re: [Bug 212684] Re: RAID1 data-checks cause CPU soft lockups

Reply via email to