Raid bug, or is it my hardware?

Joakim Ahlen Thu, 30 Sep 1999 07:27:05 -0700
Hi!

I run a linux box as a server with 2.2.5-15 kernel, and some raid0145
patch (don't remember which). During the past 3 months of (almost
flawless) operation, the server has kicked out one of the raid-arrays a
couple of times. This is the log for the event:

(more description of my error after the logfile)

-----------------------------------------------------------------
Sep 30 14:17:18 opi kernel: attempt to access beyond end of device
Sep 30 14:17:18 opi kernel: 08:61: rw=0, want=92319788, limit=8956206
Sep 30 14:17:18 opi kernel: raid5: Disk failure on sdg1, disabling device.
Operation continuing on 3 devices
Sep 30 14:17:18 opi kernel: raid5: restarting stripe 184639568
Sep 30 14:17:18 opi kernel: attempt to access beyond end of device
Sep 30 14:17:18 opi kernel: 08:51: rw=0, want=92319788, limit=8956206
Sep 30 14:17:18 opi kernel: raid5: Disk failure on sdf1, disabling device.
Operation continuing on 2 devices
Sep 30 14:17:18 opi kernel: attempt to access beyond end of device
Sep 30 14:17:18 opi kernel: 08:71: rw=0, want=92319788, limit=8956206
Sep 30 14:17:18 opi kernel: raid5: Disk failure on sdh1, disabling device.
Operation continuing on 1 devices
Sep 30 14:17:18 opi kernel: attempt to access beyond end of device
Sep 30 14:17:18 opi kernel: 08:81: rw=0, want=92319788, limit=8956206
Sep 30 14:17:18 opi kernel: raid5: Disk failure on sdi1, disabling device.
Operation continuing on 0 devices
Sep 30 14:17:18 opi kernel: raid5: restarting stripe 184639568
Sep 30 14:17:18 opi kernel: raid5: md1: unrecoverable I/O error for block
2216723498
Sep 30 14:17:18 opi kernel: raid5: md1: unrecoverable I/O error for block
3507241
Sep 30 14:17:18 opi kernel: md: recovery thread got woken up ...
Sep 30 14:17:18 opi kernel: md1: no spare disk to reconstruct array! --
continuing in degraded mode
Sep 30 14:17:18 opi kernel: md: recovery thread finished ...
Sep 30 14:17:18 opi kernel: md: updating md1 RAID superblock on device
Sep 30 14:17:18 opi kernel: (skipping faulty sdi1 )
Sep 30 14:17:18 opi kernel: (skipping faulty sdh1 )
Sep 30 14:17:18 opi kernel: (skipping faulty sdg1 )
Sep 30 14:17:18 opi kernel: (skipping faulty sdf1 )
Sep 30 14:17:18 opi kernel: .
Sep 30 14:17:18 opi kernel: raid5: md1: unrecoverable I/O error for block
3507241
Sep 30 14:17:18 opi kernel: raid5: md1: unrecoverable I/O error for block
2216723498
---------------------------------------------------------------------

It seems to me, that "attempt to access beyond end of device" couldn't be
a hardware error, but must depend on either the partitioning/physical SCSI
disk data, OR the raid5 code. It seems just a bit too coincidental that
all my devices get this "attempt to access..." error at the same time. The
two last lines repeats themself with other blocknumbers forever, which is
not very strange, since all the devices in the raid array have been 
disabled. 

I've considered this being a SCSI error, since I sometimes (not very
often) get SCSI timeouts for all the disks in the array. These disks are
alone on one single SCSI adapter. However, this particular error is not
preceded by such a timeout.

Has onyone encountered this error before? Too bad i don't know which
version of raid0145 i have (I didn't compile it myself), but could it be
that this error has been fixed in a later version of the raid0145 patch?

Now for the fun part... the second error, which occurred when
reconstructing the array after the error above. I'm pretty sure this error
is related to the SCSI timeout that sometimes occur, and i think (hope)
that this problem will go away when (if) i fix this SCSI error (I have a
pretty good idea of what it could be). However, i felt obliged to share
some logfile data, becase of a certain line in it that read "Sep 30
15:59:18 opi kernel: md: bug in file raid5.c, line 659" =)

Anyways, you can clearly see the timeout in the beginning of my logfile,
followed by this "bug" warning. After that, I get a printout of the
"COMPLETE RAID STATE", which could be useful if someone in here knows what
the heck is going on. Anyway, here it comes....

---------------------------------------------------------------------
Sep 30 15:59:13 opi kernel: scsi : aborting command due to timeout : pid
635186, scsi1, channel 0, id 9, lun 0 Write (10) 00 00 cc 47 bf 00 00 40
00
Sep 30 15:59:13 opi kernel: scsi : aborting command due to timeout : pid
635187, scsi1, channel 0, id 10, lun 0 Write (10) 00 00 cc 47 7f 00 00 40
00
Sep 30 15:59:13 opi kernel: scsi : aborting command due to timeout : pid
635188, scsi1, channel 0, id 11, lun 0 Write (10) 00 00 cc 47 3f 00 00 40
00
Sep 30 15:59:14 opi kernel: scsi : aborting command due to timeout : pid
635185, scsi1, channel 0, id 8, lun 0 Request Sense 00 00 00 10 00
Sep 30 15:59:15 opi kernel: SCSI host 1 abort (pid 635185) timed out -
resetting
Sep 30 15:59:15 opi kernel: SCSI bus is being reset for host 1 channel 0.
Sep 30 15:59:15 opi kernel: (scsi1:0:8:0) Performing Domain validation.
Sep 30 15:59:18 opi kernel: (scsi1:0:8:0) Successfully completed Domain
validation.
Sep 30 15:59:18 opi kernel: (scsi1:0:8:0) Synchronous at 40.0 Mbyte/sec,
offset 31.
Sep 30 15:59:18 opi kernel: (scsi1:0:8:0) Performing Domain validation.
Sep 30 15:59:18 opi kernel: (scsi1:0:9:0) Synchronous at 40.0 Mbyte/sec,
offset 31.
Sep 30 15:59:18 opi kernel: (scsi1:0:10:0) Synchronous at 40.0 Mbyte/sec,
offset 31.
Sep 30 15:59:18 opi kernel: (scsi1:0:11:0) Synchronous at 40.0 Mbyte/sec,
offset 31.
Sep 30 15:59:18 opi kernel: (scsi1:0:8:0) Successfully completed Domain
validation.
Sep 30 15:59:18 opi kernel: SCSI disk error : host 1 channel 0 id 8 lun 0
return code = 28000002
Sep 30 15:59:18 opi kernel: extra data not valid Current error sd08:51:
sense key Not Ready
Sep 30 15:59:18 opi kernel: Additional sense indicates Logical unit is in
process of becoming ready
Sep 30 15:59:18 opi kernel: scsidisk I/O error: dev 08:51, sector 13387712
Sep 30 15:59:18 opi kernel: interrupting MD-thread pid 60
Sep 30 15:59:18 opi kernel: raid5: parity resync was not fully finished,
restarting next time.
Sep 30 15:59:18 opi kernel: raid5: Disk failure on sdf1, disabling device.
Operation continuing on 3 devices
Sep 30 15:59:18 opi kernel: md: recovery thread got woken up ...
Sep 30 15:59:18 opi kernel: md1: no spare disk to reconstruct array! --
continuing in degraded mode
Sep 30 15:59:18 opi kernel: md: recovery thread finished ...
Sep 30 15:59:18 opi kernel: md: updating md1 RAID superblock on device
Sep 30 15:59:18 opi kernel: sdi1 [events: 0000002e](write) sdi1's sb
offset: 8956096
Sep 30 15:59:18 opi kernel: sdh1 [events: 0000002e](write) sdh1's sb
offset: 8956096
Sep 30 15:59:18 opi kernel: SCSI disk error : host 1 channel 0 id 8 lun 0
return code = 28000002
Sep 30 15:59:18 opi kernel: extra data not valid Current error sd08:51:
sense key Not Ready
Sep 30 15:59:18 opi kernel: Additional sense indicates Logical unit is in
process of becoming ready
Sep 30 15:59:18 opi kernel: scsidisk I/O error: dev 08:51, sector 13387714
Sep 30 15:59:18 opi kernel: md: bug in file raid5.c, line 659
Sep 30 15:59:18 opi kernel:
Sep 30 15:59:18 opi kernel:        **********************************
Sep 30 15:59:18 opi kernel:        * <COMPLETE RAID STATE PRINTOUT> *
Sep 30 15:59:18 opi kernel:        **********************************
Sep 30 15:59:18 opi kernel: md1: <sdi1><sdh1><sdg1><sdf1> array
superblock:
Sep 30 15:59:18 opi kernel:   SB: (V:0.90.0)
ID:<09376505.0a815ae7.1f23af66.c8081184> CT:379c34f8
Sep 30 15:59:18 opi kernel:      L5 S08956096 ND:4 RD:4 md1 LO:2 CS:32768
Sep 30 15:59:18 opi kernel:      UT:37f36cb6 ST:0 AD:3 WD:3 FD:1 SD:0
CSUM:1428b103 E:0000002e
Sep 30 15:59:18 opi kernel:      D  0:  DISK<N:0,sdf1(8,81),R:0,S:1>
Sep 30 15:59:18 opi kernel:      D  1:  DISK<N:1,sdg1(8,97),R:1,S:6>
Sep 30 15:59:18 opi kernel:      D  2:  DISK<N:2,sdh1(8,113),R:2,S:6>
Sep 30 15:59:18 opi kernel:      D  3:  DISK<N:3,sdi1(8,129),R:3,S:6>
Sep 30 15:59:18 opi kernel:      D  4:  DISK<N:4,[dev 00:00](0,0),R:4,S:9>
Sep 30 15:59:18 opi kernel:      D  5:  DISK<N:0,[dev 00:00](0,0),R:0,S:4>
Sep 30 15:59:18 opi kernel:      D  6:  DISK<N:0,[dev 00:00](0,0),R:0,S:4>
Sep 30 15:59:18 opi kernel:      D  7:  DISK<N:0,[dev 00:00](0,0),R:0,S:4>
Sep 30 15:59:18 opi kernel:      D  8:  DISK<N:0,[dev 00:00](0,0),R:0,S:4>
Sep 30 15:59:18 opi kernel:      D  9:  DISK<N:0,[dev 00:00](0,0),R:0,S:4>
Sep 30 15:59:18 opi kernel:      D 10:  DISK<N:0,[dev 00:00](0,0),R:0,S:4>
Sep 30 15:59:18 opi kernel:      D 11:  DISK<N:0,[dev 00:00](0,0),R:0,S:4>
Sep 30 15:59:18 opi kernel:      THIS:  DISK<N:3,sdi1(8,129),R:3,S:6>

...the logfile goes on, printing one of these "printouts" for each
device...

------------------------------------------------------------------

I'm a bit worried by this "scsidisk I/O error: dev 08:51, sector
13387714". Could be some hardware error in this particular disk, that
somehow produces the timeout, causing all the raid trouble. My other idea
about the SCSI problems, is that this is a termination issue. These disks
are LVD scsi disks running on an Adaptec U2W scsi-adapter. However, these
disks require an external terminator, and the only one we could get our
hands on was a regular UW-SCSI terminator, which is not right. So, the
disks are running in standard Ultra SE-mode at 40Mb/s. When i get a hold
of the right terminator, i hope these timeouts will go away.

Thanks, hope anyone can sort out what the error is here!

//Jocke

Joakim Ahlén
Mediaprint, Uddevalla AB
Sweden

-------------------------------------------------------------------

Some data:

Machine:
PIII-450, 384Mb RAM, MS-6163 motherboard
one Adaptec U2W-scsi adapter running _only_ with the four IBM DNES-309170W
9Gb LVD-disks, in Ultra SE mode. (not LVD)
one Adaptec UW-scsi adapter with boot-disks and a bunch of other
scsidevices

Linux:
Linux Redhat 6.0, upgraded with kernel 2.2.5-15 with raid0145 patch.
Raidutils: raidtools-0.90-3, the one shipped with RH6.

well... that's about it.
Raid bug, or is it my hardware?

Reply via email to