Software levels: Redhat 6.0, kernel 2.2.5-15, raidtools-0.90 I reported the following kernel errors to the list late last week that I got on one of my production email pop servers (each of my five service has ~7000 user pop mailboxes). Late yesterday, I got the exact same failure of the raid-1 on a different server. I have yet to fix the first server, as I have to wait until this Saturday to schedule a four hour outage, as I attempt to repair the raid. Anybody have a clue about what is causing this, how to fix the raid (do I have to tar it up, and then rebuild the raid from scratch and restore the data?), and what can ge done to prevent this kind of failure? Btw, even though the Raid-1 has failed, it appears to be reading and writing correctly to the first disk in the raid in both cases. This is not a hardware problem. Kernel errors follow: Nov 8 14:03:27 pop-3 kernel: attempt to access beyond end of device Nov 8 14:03:27 pop-3 kernel: 08:51: rw=0, want=892027448, limit=8956206 Nov 8 14:03:27 pop-3 kernel: raid1: Disk failure on sdf1, disabling device. Nov 8 14:03:27 pop-3 kernel: Operation continuing on 1 devices Nov 8 14:03:27 pop-3 kernel: raid1: md1: rescheduling block 892027447 Nov 8 14:03:27 pop-3 kernel: attempt to access beyond end of device Nov 8 14:03:27 pop-3 kernel: 08:41: rw=0, want=1295592304, limit=8956206 Nov 8 14:03:27 pop-3 kernel: raid1: only one disk left and IO error. Nov 8 14:03:27 pop-3 kernel: raid1: md1: rescheduling block 1295592303 Nov 8 14:03:27 pop-3 kernel: md: recovery thread got woken up ... Nov 8 14:03:27 pop-3 kernel: md1: no spare disk to reconstruct array! -- conti nuing in degraded mode Nov 8 14:03:27 pop-3 kernel: md: recovery thread finished ... Nov 8 14:03:27 pop-3 kernel: dirty sb detected, updating. Nov 8 14:03:27 pop-3 kernel: md: updating md1 RAID superblock on device Nov 8 14:03:27 pop-3 kernel: (skipping faulty sdf1 ) Nov 8 14:03:27 pop-3 kernel: (skipping faulty sde1 ) Nov 8 14:03:27 pop-3 kernel: . Nov 8 14:03:27 pop-3 kernel: raid1: md1: unrecoverable I/O read error for bloc k 1295592303 Nov 8 14:03:27 pop-3 kernel: raid1: md1: redirecting sector 892027447 to anoth er mirror Nov 8 14:03:27 pop-3 kernel: attempt to access beyond end of device Nov 8 14:03:27 pop-3 kernel: 08:41: rw=0, want=892027448, limit=8956206 Nov 8 14:03:27 pop-3 kernel: raid1: only one disk left and IO error. Nov 8 14:03:27 pop-3 kernel: raid1: md1: rescheduling block 892027447 Nov 8 14:03:27 pop-3 kernel: raid1: md1: unrecoverable I/O read error for bloc k 892027447 Nov 8 14:03:27 pop-3 kernel: md: recovery thread got woken up ... Nov 8 14:03:27 pop-3 kernel: md1: no spare disk to reconstruct array! -- conti n uing in degraded mode Nov 8 14:03:27 pop-3 kernel: md: recovery thread finished ... Nov 8 14:03:27 pop-3 kernel: attempt to access beyond end of device Nov 8 14:03:27 pop-3 kernel: 08:41: rw=0, want=892027448, limit=8956206 Nov 8 14:03:27 pop-3 kernel: raid1: only one disk left and IO error. Nov 8 14:03:27 pop-3 kernel: raid1: md1: rescheduling block 892027447 Nov 8 14:03:27 pop-3 kernel: attempt to access beyond end of device Nov 8 14:03:27 pop-3 kernel: 08:41: rw=0, want=1295592304, limit=8956206 Nov 8 14:03:27 pop-3 kernel: raid1: only one disk left and IO error. Nov 8 14:03:27 pop-3 kernel: raid1: md1: rescheduling block 1295592303 Nov 8 14:03:27 pop-3 kernel: md: recovery thread got woken up ... Nov 8 14:03:27 pop-3 kernel: md1: no spare disk to reconstruct array! -- conti n uing in degraded mode Nov 8 14:03:27 pop-3 kernel: md: recovery thread finished ... Nov 8 14:03:27 pop-3 kernel: raid1: md1: unrecoverable I/O read error for block 1295592303 Nov 8 14:03:27 pop-3 kernel: raid1: md1: unrecoverable I/O read error for bloc k 892027447 The disk drives involved are for our pop servers, so most likely the software error was due to a corrupt map file. My /proc/mdstat currently looks like: Personalities : [raid1] read_ahead 1024 sectors md1 : active raid1 sdf1[1](F) sde1[0](F) 8956096 blocks [2/1] [U_] unused devices: <none> The first thing I tried was to do was: raidhotremove /dev/md1 /dev/sde1 but it said the device was busy. >From looking at /proc/scsi/aic7xxx/0, I have determined that reads and writes to /dev/md1 are going to device 4, which is /dev/sde1 . Now the question is, what do I do to recover from this situation? My instinct is to tar /dev/sde1 up as a backup before I mess with the raid stuff and then to reboot in single user mode after removing /dev/md1 from /etc/fstab so that the raid does not attempt to start up. Then do an fsck on /dev/sde1. Now if I restart the raid at this point, what ensures me that this will resync in the right direction? That is, uses /dev/sde1 as the master and not /dev/sdf1. Does it use the raid superblocks to determine which way to sync up? Comments??? My /etc/raidtab looks like: # raid-1 configuration raiddev /dev/md1 raid-level 1 nr-raid-disks 2 nr-spare-disks 0 chunk-size 4 persistent-superblock 1 device /dev/sde1 raid-disk 0 device /dev/sdf1 raid-disk 1 Thanks, Kent Ziebell __________________________________________________________________ Kent A. Ziebell [EMAIL PROTECTED] 249 Durham Center Iowa State University Computation Center voice: (515) 294-9607 Ames, Iowa 50011 fax: (515) 294-1717