Hi,

I've recently installed a new server running Linux 2.6.8 with software
RAID5 and 3 SATA HDD's.

The relevant specs:
Motherboard: Asus P4P800-E Deluxe
CPU: P4 3GHZ
SATA controllers:
0000:00:1f.2 IDE interface: Intel Corp. 82801EB (ICH5) Serial ATA 150
Storage Controller (rev 02)
0000:02:04.0 RAID bus controller: Promise Technology, Inc. PDC20378
(FastTrak 78/SATA 378) (rev 02)

The RAID5 contains 3 identical Maxtor 6Y160M0 hdd's. Each partititoned like:
sdx1 raid1 / (1G)
sdx2 raid5 lvm (rest)
sdx3 swap (1G)

The root partition is running RAID1 with 2 active disks and one spare.
The rest is one big LVM volumne with /usr, /home and /var running on
logical lvm voulmes.

There is also a CDROM on hdc (normal ata controller).

The server is running debian sarge (testing) / linux 2.6.8-686-smp. 

For server ran fine for about 3 weeks but been locked up once a week
or so. Since it is colocated I was not able to see the exact error
message, but reboot brought it back up.

After the 3rd or 4th crash I've been able to get part of the error
message from the colocation support and figure that one of the disks,
sdb, was probably whats causing the problem.

So we dropped this disk from the arrays and kept running with partial
array for about 10 days. In this period there server ran without any
crashes.

Yesterday I went to replace the bad disk with a new one. While syncing
the disks I got this error message:
ata2: command 0x35 timeout, stat 0xd0 host_stat 0x20
scsi1: ERROR on channel 0, id 0, lun 0, CDB: Write (10) 00 00 10 05 0f
00 0 08 00
current sdb: sense key Medium Error
Additional sense: Write error - auto reallocation failed
end_request: I/O error, dev sdb, sector 1049871
ATA: abnormal status 0xD0 on port 0XEFA7

Ofcourse, it also locked the machine. I tried to reboot few times and
got the same error (more or less) every time. I tried to take out the
new disk, but still couldn't bring the array back up - somehow, mdadm
confused the array and marked two disks as dirty.

At this stage we brought a backup server on and I took this server
back with me. Today I managed to correct the original array (using
knoppix and mdadm assemble) and bring the server on. Then, I managed
to add the "broken" disk back to the array (reminding - a new disk
that replaced another "broken" disk) and it synced cleanly. I tried
marking the other disks as failed (one at a time) and resyncing each,
which went cleanly.

Then, I tried to run few benchmarks using bonnie++. It ran fine for
few hours, but then crashed again, with the same error message (well,
host_stat was 0x21 instead of 0x20, and another sector). Again, I
rebooted few times, each time getting the same error message.

I'm not sure what happend, but eventually I managed to bring it back
on, and started resyncing - while I'm writing this, it already ran for
30 mins and resynced 60% of the array.

I'm not sure what should I do next. It doesn't seem like a problem
with the disks as I tried a different couple. I'm pretty sure all the
errors so far were on sdb - the second SATA port of the ICH5
controller.

Did anyone else experience with similar setup/problems? Any hints?

Sagi

=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Reply via email to