Thanks for your mail Wladimir, I think you may well be right - it's an old system but has had a new lease of life with Linux ! I'm also going to try reducing the max bandwidth of the md devices to 10Mb/s to see if that helps.
I'll let you know if I find an answer, and would be very interested to hear what you find if you do try building the PIII system. Regards, David >----Original Message---- >From: [EMAIL PROTECTED] >Date: 07/04/2008 6:56 >To: <[EMAIL PROTECTED]> >Subj: [Bug 212684] Re: RAID1 data-checks cause CPU soft lockups > >My systems do not froze or stuck. The check is going on with normal speed (spending about 1 hour for 200 GB, 1.5 hours for 320 GB, and 2.5h for 500GB) and writes into kernel log about its normal completion. Just these soft lockups is what bothers me. >Your system may just be old and show some faults under load. Soon I'll make md-raid1 on 2 IDE disks on similar older system (P3TDDE board, 2xPIII-1133MHz) and tell you how it goes in my case. > >-- >RAID1 data-checks cause CPU soft lockups >https://bugs.launchpad.net/bugs/212684 >You received this bug notification because you are a direct subscriber >of the bug. > >Status in Source Package "linux" in Ubuntu: New > >Bug description: >Binary package hint: linux-image-2.6.24-15-generic > >I track Hardy development packages on some of my systems. >They share some common hardware and configuration features, in part : >Pentium4 CPU with Hyperthreading turned on >(so that 2 logical cores are visible); 1 or 2 GB RAM; >Intel chipset with ICH5/6/7[R] SATA controller with RAID turned off ; >two SATA disks (Seagate, WD or Samsung) of 200..320..500 GB each. >On each of these disks, two 0xfd partitions (Linux RAID autodetect) are allocated : first of 100 MB, and second taking the rest of disk volume. >They are assembled by mdadm in RAID1 (mirrored) md arrays : >first, of 100 MB, to hold /boot filesystem, >and second, the big one, to hold LVM2 PV, a volume group with a couple of LVs with filesystems, including root FS (all of type ext3), as well as a swap LV. >The boot loader is LILO. Systems boot and run well, providing appropriate fault-tolerance for disks when needed. > >However, mdadm package provides 'checkarray' script which is run by cron on first Sunday of each month to check for RAID arrays integrity. >The actions performed by script are in fact >'echo check > $i' for i in /sys/block/*/md/sync_action >And this integrity check gives the following messages in the kernel log : > >Apr 6 01:06:02 hostname kernel: [ 9859.807932] md: data-check of RAID array md0 >Apr 6 01:06:02 hostname kernel: [ 9859.808090] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. >Apr 6 01:06:02 hostname kernel: [ 9859.808222] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data- check. >Apr 6 01:06:02 hostname kernel: [ 9859.808422] md: using 128k window, over a total of 104320 blocks. >Apr 6 01:06:02 hostname kernel: [ 9859.886364] md: delaying data- check of md2 until md0 has finished (they share one or more physical units) >Apr 6 01:06:04 hostname kernel: [ 9862.098900] md: md0: data-check done. >Apr 6 01:06:04 hostname kernel: [ 9862.137205] md: data-check of RAID array md2 >Apr 6 01:06:04 hostname kernel: [ 9862.137238] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. >Apr 6 01:06:04 hostname kernel: [ 9862.137272] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data- check. >Apr 6 01:06:04 hostname kernel: [ 9862.137327] md: using 128k window, over a total of 312464128 blocks. >Apr 6 01:06:04 hostname kernel: [ 9862.189968] RAID1 conf printout: >Apr 6 01:06:04 hostname kernel: [ 9862.190003] --- wd:2 rd:2 >Apr 6 01:06:04 hostname kernel: [ 9862.190035] disk 0, wo:0, o:1, dev:sdb1 >Apr 6 01:06:04 hostname kernel: [ 9862.190062] disk 1, wo:0, o:1, dev:sda1 > > ... 13 seconds later : > >Apr 6 01:06:17 hostname kernel: [ 9875.118427] BUG: soft lockup - CPU#0 stuck for 11s! [md2_raid1:2378] >Apr 6 01:06:17 hostname kernel: [ 9875.118581] >Apr 6 01:06:17 hostname kernel: [ 9875.118671] Pid: 2378, comm: md2_raid1 Not tainted (2.6.24-15-generic #1) >Apr 6 01:06:17 hostname kernel: [ 9875.118811] EIP: 0060: [<f887c9b0>] EFLAGS: 00010282 CPU: 0 >Apr 6 01:06:17 hostname kernel: [ 9875.119048] EIP is at raid1d+0x770/0xff0 [raid1] >Apr 6 01:06:17 hostname kernel: [ 9875.119159] EAX: e7ffb000 EBX: c14fff60 ECX: 00000f24 EDX: f4b41800 >Apr 6 01:06:17 hostname kernel: [ 9875.119284] ESI: e7ffb0dc EDI: e807e0dc EBP: df92fe40 ESP: f7495e9c >Apr 6 01:06:17 hostname kernel: [ 9875.119448] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 >Apr 6 01:06:17 hostname kernel: [ 9875.119567] CR0: 8005003b CR2: b7f32480 CR3: 374de000 CR4: 000006d0 >Apr 6 01:06:17 hostname kernel: [ 9875.119696] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 >Apr 6 01:06:17 hostname kernel: [ 9875.119833] DR6: ffff0ff0 DR7: 00000400 >Apr 6 01:06:17 hostname kernel: [ 9875.120537] [jbd: schedule+0x20a/0x650] schedule+0x20a/0x600 >Apr 6 01:06:17 hostname kernel: [ 9875.121051] [<f88b5ed0>] md_thread+0x0/0xe0 [md_mod] >Apr 6 01:06:17 hostname kernel: [ 9875.121319] [shpchp: schedule_timeout+0x76/0x2d0] schedule_timeout+0x76/0xd0 >Apr 6 01:06:17 hostname kernel: [ 9875.121494] [apic_timer_interrupt+0x28/0x30] apic_timer_interrupt+0x28/0x30 >Apr 6 01:06:17 hostname kernel: [ 9875.121751] [<f88b5ed0>] md_thread+0x0/0xe0 [md_mod] >Apr 6 01:06:17 hostname kernel: [ 9875.122050] [<f887007b>] mirror_status+0x19b/0x250 [dm_mirror] >Apr 6 01:06:17 hostname kernel: [ 9875.122333] [<f88b5ed0>] md_thread+0x0/0xe0 [md_mod] >Apr 6 01:06:17 hostname kernel: [ 9875.122590] [<f88b5ef3>] md_thread+0x23/0xe0 [md_mod] >Apr 6 01:06:17 hostname kernel: [ 9875.122854] [<c0141b70>] autoremove_wake_function+0x0/0x40 >Apr 6 01:06:17 hostname kernel: [ 9875.123122] [<f88b5ed0>] md_thread+0x0/0xe0 [md_mod] >Apr 6 01:06:17 hostname kernel: [ 9875.123356] [kthread+0x42/0x70] kthread+0x42/0x70 >Apr 6 01:06:17 hostname kernel: [ 9875.123497] [kthread+0x0/0x70] kthread+0x0/0x70 >Apr 6 01:06:17 hostname kernel: [ 9875.123673] [kernel_thread_helper+0x7/0x10] kernel_thread_helper+0x7/0x10 >Apr 6 01:06:17 hostname kernel: [ 9875.123937] ======================= > >... Then, these soft lockups repeat about each 13..20 seconds. Their stacks are not the same, but they share common spot in 'raid1d' function or thread. Most frequent offset is raid1d+0x770. Sometimes it is +0x17b or nearby values. Here is a sample distribution : > > 1 raid1d+0x174/0xff0 > 5 raid1d+0x17b/0xff0 > 1 raid1d+0x18d/0xff0 > 1 raid1d+0x669/0xff0 > 1 raid1d+0x75f/0xff0 > 86 raid1d+0x770/0xff0 > 1 raid1d+0x772/0xff0 > >During the check of these two RAID1 arrays, 100 MB and 320 GB in total, the lockups happened more than 90 times. Astronomically, this check had taken 1 hour 37 minutes. > >Don't know if I should ignore these lockups, or ask you, dear maintainers, to research into the problem. I just decided to inform you. I'll give you any more details on your request. I just listed the common traits of three my systems (with equal sets of Hardy packages) where these lockups are reproducible. > ______________________________________________ Get up to 33% off Norton Security from Tiscali - http://www.tiscali.co.uk/securepc Make the most of your Broadband - http://www.tiscali.co.uk/services -- RAID1 data-checks cause CPU soft lockups https://bugs.launchpad.net/bugs/212684 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs