Hi, I've recently installed a new server running Linux 2.6.8 with software RAID5 and 3 SATA HDD's.
The relevant specs: Motherboard: Asus P4P800-E Deluxe CPU: P4 3GHZ SATA controllers: 0000:00:1f.2 IDE interface: Intel Corp. 82801EB (ICH5) Serial ATA 150 Storage Controller (rev 02) 0000:02:04.0 RAID bus controller: Promise Technology, Inc. PDC20378 (FastTrak 78/SATA 378) (rev 02) The RAID5 contains 3 identical Maxtor 6Y160M0 hdd's. Each partititoned like: sdx1 raid1 / (1G) sdx2 raid5 lvm (rest) sdx3 swap (1G) The root partition is running RAID1 with 2 active disks and one spare. The rest is one big LVM volumne with /usr, /home and /var running on logical lvm voulmes. There is also a CDROM on hdc (normal ata controller). The server is running debian sarge (testing) / linux 2.6.8-686-smp. For server ran fine for about 3 weeks but been locked up once a week or so. Since it is colocated I was not able to see the exact error message, but reboot brought it back up. After the 3rd or 4th crash I've been able to get part of the error message from the colocation support and figure that one of the disks, sdb, was probably whats causing the problem. So we dropped this disk from the arrays and kept running with partial array for about 10 days. In this period there server ran without any crashes. Yesterday I went to replace the bad disk with a new one. While syncing the disks I got this error message: ata2: command 0x35 timeout, stat 0xd0 host_stat 0x20 scsi1: ERROR on channel 0, id 0, lun 0, CDB: Write (10) 00 00 10 05 0f 00 0 08 00 current sdb: sense key Medium Error Additional sense: Write error - auto reallocation failed end_request: I/O error, dev sdb, sector 1049871 ATA: abnormal status 0xD0 on port 0XEFA7 Ofcourse, it also locked the machine. I tried to reboot few times and got the same error (more or less) every time. I tried to take out the new disk, but still couldn't bring the array back up - somehow, mdadm confused the array and marked two disks as dirty. At this stage we brought a backup server on and I took this server back with me. Today I managed to correct the original array (using knoppix and mdadm assemble) and bring the server on. Then, I managed to add the "broken" disk back to the array (reminding - a new disk that replaced another "broken" disk) and it synced cleanly. I tried marking the other disks as failed (one at a time) and resyncing each, which went cleanly. Then, I tried to run few benchmarks using bonnie++. It ran fine for few hours, but then crashed again, with the same error message (well, host_stat was 0x21 instead of 0x20, and another sector). Again, I rebooted few times, each time getting the same error message. I'm not sure what happend, but eventually I managed to bring it back on, and started resyncing - while I'm writing this, it already ran for 30 mins and resynced 60% of the array. I'm not sure what should I do next. It doesn't seem like a problem with the disks as I tried a different couple. I'm pretty sure all the errors so far were on sdb - the second SATA port of the ICH5 controller. Did anyone else experience with similar setup/problems? Any hints? Sagi ================================================================= To unsubscribe, send mail to [EMAIL PROTECTED] with the word "unsubscribe" in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]