Software levels: Redhat 6.0, kernel 2.2.5-15, raidtools-0.90

I reported the following kernel errors to the list late last week that I got
on one of my production email pop servers (each of my five service has ~7000
user pop mailboxes).  Late yesterday, I got the exact same failure of the
raid-1 on a different server.  I have yet to fix the first server, as I have
to wait until this Saturday to schedule a four hour outage, as I attempt to
repair the raid.  Anybody have a clue about what is causing this, how to fix
the raid (do I have to tar it up, and then rebuild the raid from scratch and
restore the data?), and what can ge done to prevent this kind of failure?

Btw, even though the Raid-1 has failed, it appears to be reading and writing
correctly to the first disk in the raid in both cases.  This is not a
hardware problem. 


Kernel errors follow:
Nov  8 14:03:27 pop-3 kernel: attempt to access beyond end of device 
Nov  8 14:03:27 pop-3 kernel: 08:51: rw=0, want=892027448, limit=8956206 
Nov  8 14:03:27 pop-3 kernel: raid1: Disk failure on sdf1, disabling device.
Nov  8 14:03:27 pop-3 kernel:        Operation continuing on 1 devices 
Nov  8 14:03:27 pop-3 kernel: raid1: md1: rescheduling block 892027447 
Nov  8 14:03:27 pop-3 kernel: attempt to access beyond end of device 
Nov  8 14:03:27 pop-3 kernel: 08:41: rw=0, want=1295592304, limit=8956206 
Nov  8 14:03:27 pop-3 kernel: raid1: only one disk left and IO error. 
Nov  8 14:03:27 pop-3 kernel: raid1: md1: rescheduling block 1295592303 
Nov  8 14:03:27 pop-3 kernel: md: recovery thread got woken up ... 
Nov  8 14:03:27 pop-3 kernel: md1: no spare disk to reconstruct array! -- conti
nuing in degraded mode 
Nov  8 14:03:27 pop-3 kernel: md: recovery thread finished ... 
Nov  8 14:03:27 pop-3 kernel: dirty sb detected, updating. 
Nov  8 14:03:27 pop-3 kernel: md: updating md1 RAID superblock on device 
Nov  8 14:03:27 pop-3 kernel: (skipping faulty sdf1 ) 
Nov  8 14:03:27 pop-3 kernel: (skipping faulty sde1 ) 
Nov  8 14:03:27 pop-3 kernel: . 
Nov  8 14:03:27 pop-3 kernel: raid1: md1: unrecoverable I/O read error for bloc
k 1295592303 
Nov  8 14:03:27 pop-3 kernel: raid1: md1: redirecting sector 892027447 to anoth
er mirror 
Nov  8 14:03:27 pop-3 kernel: attempt to access beyond end of device 
Nov  8 14:03:27 pop-3 kernel: 08:41: rw=0, want=892027448, limit=8956206 
Nov  8 14:03:27 pop-3 kernel: raid1: only one disk left and IO error. 
Nov  8 14:03:27 pop-3 kernel: raid1: md1: rescheduling block 892027447 
Nov  8 14:03:27 pop-3 kernel: raid1: md1: unrecoverable I/O read error for bloc
k 892027447 
Nov  8 14:03:27 pop-3 kernel: md: recovery thread got woken up ... 
Nov  8 14:03:27 pop-3 kernel: md1: no spare disk to reconstruct array! -- conti
n uing in degraded mode 
Nov  8 14:03:27 pop-3 kernel: md: recovery thread finished ... 
Nov  8 14:03:27 pop-3 kernel: attempt to access beyond end of device 
Nov  8 14:03:27 pop-3 kernel: 08:41: rw=0, want=892027448, limit=8956206 
Nov  8 14:03:27 pop-3 kernel: raid1: only one disk left and IO error. 
Nov  8 14:03:27 pop-3 kernel: raid1: md1: rescheduling block 892027447 
Nov  8 14:03:27 pop-3 kernel: attempt to access beyond end of device 
Nov  8 14:03:27 pop-3 kernel: 08:41: rw=0, want=1295592304, limit=8956206 
Nov  8 14:03:27 pop-3 kernel: raid1: only one disk left and IO error. 
Nov  8 14:03:27 pop-3 kernel: raid1: md1: rescheduling block 1295592303 
Nov  8 14:03:27 pop-3 kernel: md: recovery thread got woken up ... 
Nov  8 14:03:27 pop-3 kernel: md1: no spare disk to reconstruct array! -- conti
n
uing in degraded mode 
Nov  8 14:03:27 pop-3 kernel: md: recovery thread finished ... 
Nov  8 14:03:27 pop-3 kernel: raid1: md1: unrecoverable I/O read error for
block 1295592303 
Nov  8 14:03:27 pop-3 kernel: raid1: md1: unrecoverable I/O read error for bloc
k 892027447 

The disk drives involved are for our pop servers, so most likely the software
error was due to a corrupt map file.  

My /proc/mdstat currently looks like:

Personalities : [raid1] 
read_ahead 1024 sectors
md1 : active raid1 sdf1[1](F) sde1[0](F) 8956096 blocks [2/1] [U_]
unused devices: <none>

The first thing I tried was to do was:
raidhotremove /dev/md1 /dev/sde1
but it said the device was busy.  

>From looking at /proc/scsi/aic7xxx/0, I have determined that reads and writes
to /dev/md1 are going to device 4, which is /dev/sde1 .

Now the question is, what do I do to recover from this situation?  My
instinct is to tar /dev/sde1 up as a backup before I mess with the raid stuff
and then to reboot in single user mode after removing /dev/md1 from
/etc/fstab so that the raid does not attempt to start up.  Then do an fsck on
/dev/sde1.  Now if I restart the raid at this point, what ensures me that
this will resync in the right direction?  That is, uses /dev/sde1 as the
master and not /dev/sdf1.  Does it use the raid superblocks to determine
which way to sync up?

Comments???

My /etc/raidtab looks like:

 # raid-1 configuration
raiddev                 /dev/md1
raid-level              1
nr-raid-disks           2
nr-spare-disks          0
chunk-size              4
persistent-superblock   1

device                  /dev/sde1
raid-disk               0

device                  /dev/sdf1
raid-disk               1


Thanks,
Kent Ziebell
__________________________________________________________________
Kent A. Ziebell                              [EMAIL PROTECTED]
249 Durham Center                            
Iowa State University Computation Center     voice: (515) 294-9607
Ames, Iowa 50011                             fax:   (515) 294-1717

Reply via email to