Dear all, We're running problem with RAID1 config in our systm. I don't know if it is the problem of the disk or the RAID code/DMA code on i810? We'd like to know that is the RAID1 could not really protect a single disk failure and crashing our partiton, or our config has something wrong? We're still considering should we run Linux software RAID1 to increase protection to failures. Please help us! Thanks! OS & software tools: ------------------- It is running on a Debian 2.2 (kernel 2.2.17 with RAID patch 2.2.17-A0) Raidtool is Debian's package: raidtools2 v0.90.990824-5 The harddisks on this system are: -------------------------------- hda1: Quantum Sirroco 2.5G ATA4 ( hdb1, hdb2, hdb3: IBM Deskstar DTLA-307045 UDMA (30G+14G+700M) hdc1, hdc2, hdc3: IBM Deskstar DTLA-307045 UDMA (30G+14G+700M) md0 = hdb1 + hdc1 md1 = hdb2 + hdc2 md2 = hdb3 + hdc3 [1st story] Firstly, we build the RAID disks on a K6-2 350Mhz, ASUS P5A system. We put our users 6G data on the RAID set. However we found that the performance is very slow since 2.2.17 found the ALi controller an unknown controller and turn off DMA. Then, we switch to a Celeron 433Mhz, ASUS i810 motherboard system. It found the PIIX3 controller on this i810. Everything runs fine and until the error reports: <start kern.log> Mar 19 02:55:08 tux1 kernel: hdc: timeout waiting for DMA Mar 19 02:55:08 tux1 kernel: hdc: irq timeout: status=0x50 { DriveReady SeekComp lete } Mar 19 04:14:19 tux1 kernel: hdb: timeout waiting for DMA Mar 19 04:14:19 tux1 kernel: hdb: irq timeout: status=0xd0 { Busy } Mar 19 04:14:19 tux1 kernel: hda: DMA disabled Mar 19 04:14:19 tux1 kernel: hdb: DMA disabled Mar 19 04:14:19 tux1 kernel: ide0: unexpected interrupt, status=0xd0, count=1 Mar 19 04:14:19 tux1 kernel: ide0: reset: success Mar 19 04:14:30 tux1 kernel: hdb: lost interrupt Mar 19 04:14:30 tux1 kernel: hdb: status error: status=0x58 { DriveReady SeekCom plete DataRequest } Mar 19 04:14:30 tux1 kernel: hdb: drive not ready for command Mar 19 04:14:40 tux1 kernel: hdb: lost interrupt Mar 19 04:14:40 tux1 kernel: hdb: status error: status=0x58 { DriveReady SeekCom plete DataRequest } Mar 19 04:14:40 tux1 kernel: hdb: drive not ready for command Mar 19 04:14:50 tux1 kernel: hdb: lost interrupt Mar 19 04:14:50 tux1 kernel: hdb: status error: status=0x58 { DriveReady SeekCom plete DataRequest } Mar 19 04:14:50 tux1 kernel: hdb: drive not ready for command Mar 19 04:15:00 tux1 kernel: hdb: lost interrupt Mar 19 04:15:00 tux1 kernel: hdb: status error: status=0x58 { DriveReady SeekCom plete DataRequest } <end kern.log> We took a reboot, then the ext2fs on the md0 partition died! We lost our 6G data put into the drive! [2nd story] We afraid of the i810, then we move the disks back to the slow K6-2 350Mhz, ASUS P5A system. Remade the RAID partiton, put back the 6G data. It runs fine again for another 12 hours, until: <start kern.log> Mar 19 15:28:49 tux1 kernel: hdb: read_intr: status=0x59 { DriveReady SeekComple te DataRequest Error } Mar 19 15:28:49 tux1 kernel: hdb: read_intr: error=0x40 { UncorrectableError }, LBAsect=58285004, sector=58284941 Mar 19 15:28:49 tux1 kernel: end_request: I/O error, dev 03:41 (hdb), sector 582 84941 Mar 19 15:28:49 tux1 kernel: interrupting MD-thread pid 46 Mar 19 15:28:49 tux1 kernel: raid1: mirror resync was not fully finished, restar ting next time. Mar 19 15:28:49 tux1 kernel: raid1: Disk failure on hdb1, disabling device. Mar 19 15:28:49 tux1 kernel: Operation continuing on 1 devices Mar 19 15:28:49 tux1 kernel: raid1: md0: rescheduling block 7285617 Mar 19 15:28:49 tux1 kernel: md: recovery thread got woken up ... Mar 19 15:28:49 tux1 kernel: md0: no spare disk to reconstruct array! -- continu ing in degraded mode Mar 19 15:28:49 tux1 kernel: md: recovery thread finished ... Mar 19 15:28:49 tux1 kernel: dirty sb detected, updating. Mar 19 15:28:49 tux1 kernel: md: updating md0 RAID superblock on device Mar 19 15:28:49 tux1 kernel: hdc1 [events: 00000014](write) hdc1's sb offset: 30 009280 Mar 19 15:28:49 tux1 kernel: (skipping faulty hdb1 ) Mar 19 15:28:49 tux1 kernel: . Mar 19 15:28:49 tux1 kernel: raid1: md0: redirecting sector 7285617 to another m irror Mar 19 15:28:49 tux1 kernel: md_do_sync() got signal ... exiting Mar 19 15:28:49 tux1 kernel: md: syncing RAID array md1 Mar 19 15:28:49 tux1 kernel: md: minimum _guaranteed_ reconstruction speed: 100 KB/sec. Mar 19 15:28:49 tux1 kernel: md: using maximum available idle IO bandwith for re construction. Mar 19 15:28:49 tux1 kernel: md: using 128k window. Mar 19 15:28:49 tux1 kernel: md: serializing resync, md2 has overlapping physica l units with md1! Mar 19 18:39:10 tux1 kernel: md: md1: sync done. Mar 19 18:39:10 tux1 kernel: md: syncing RAID array md2 Mar 19 18:39:10 tux1 kernel: md: minimum _guaranteed_ reconstruction speed: 100 KB/sec. Mar 19 18:39:10 tux1 kernel: md: using maximum available idle IO bandwith for re construction. Mar 19 18:39:10 tux1 kernel: md: using 128k window. Mar 19 18:49:29 tux1 kernel: md: md2: sync done. <end kern.log> We know that the hdb died here, but seems the mirror works and the system continue runs with hdc1 turned on. We discover further errors on hdc1: <start kern.log> Mar 20 10:25:33 tux1 kernel: EXT2-fs warning (device md(9,0)): ext2_free_blocks: bit already cleared for block 5964380 Mar 20 10:25:34 tux1 kernel: EXT2-fs warning (device md(9,0)): ext2_free_blocks: bit already cleared for block 5964383 Mar 20 10:25:34 tux1 kernel: EXT2-fs warning (device md(9,0)): ext2_free_inode: bit already cleared for inode 2981899 Mar 20 10:25:34 tux1 kernel: EXT2-fs warning (device md(9,0)): ext2_free_blocks: bit already cleared for block 6816380 Mar 20 10:25:34 tux1 kernel: EXT2-fs warning (device md(9,0)): ext2_free_inode: bit already cleared for inode 3407889 Mar 20 10:25:34 tux1 kernel: EXT2-fs warning (device md(9,0)): ext2_free_blocks: bit already cleared for block 6816351 Mar 20 10:25:34 tux1 last message repeated 11 times Mar 20 16:56:21 tux1 kernel: EXT2-fs warning (device md(9,0)): empty_dir: bad di rectory (dir #3162259) - no `.' or `..' Mar 20 16:56:21 tux1 kernel: EXT2-fs warning (device md(9,0)): ext2_rmdir: empty directory has nlink!=2 (4) Mar 20 17:52:09 tux1 kernel: EXT2-fs error (device md(9,0)): ext2_readdir: bad e ntry in directory #2752659: rec_len %% 4 != 0 - offset=0, inode=2884604103, rec_ len=28186, name_len=27 Mar 20 17:52:09 tux1 kernel: Remounting filesystem read-only Mar 20 17:52:09 tux1 kernel: EXT2-fs error (device md(9,0)): ext2_readdir: bad e ntry in directory #2752659: rec_len %% 4 != 0 - offset=0, inode=2884604103, rec_ len=28186, name_len=27 Mar 20 17:52:09 tux1 kernel: Remounting filesystem read-only Mar 20 17:52:09 tux1 kernel: EXT2-fs error (device md(9,0)): ext2_readdir: bad e ntry in directory #2752659: rec_len %% 4 != 0 - offset=0, inode=2884604103, rec_ len=28186, name_len=27 <end kern.log> The story ends here. At that time it is before our daily backup schedule period (00:00am) and we do a emergency backup over this read-only partition. We restored the data to hdc1, However, some of our user report they lost some files at that day and they found some of their deleted files re-appear! When hdb1 died hdc1 can continue to run with this RAID1 config, but it lead to a partition error?! We stopped using RAID until there is something clear for us first. :( /etc/raidtab: ------------ # Sample raid-1 configuration raiddev /dev/md0 raid-level 1 nr-raid-disks 2 nr-spare-disks 0 chunk-size 32 device /dev/hdb1 raid-disk 0 device /dev/hdc1 raid-disk 1 # raid drive 2 raiddev /dev/md1 raid-level 1 nr-raid-disks 2 nr-spare-disks 0 chunk-size 32 device /dev/hdb2 raid-disk 0 device /dev/hdc2 raid-disk 1 # raid drive 3 raiddev /dev/md2 raid-level 1 nr-raid-disks 2 nr-spare-disks 0 chunk-size 32 device /dev/hdb3 raid-disk 0 device /dev/hdc3 raid-disk 1 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED]