Re: Please help - when is a bad disk a bad disk?

Mike Bilow Wed, 12 Apr 2000 15:53:23 -0700
On Tue, 11 Apr 2000, Darren Nickerson wrote:

> >>>>> On Tue, 11 Apr 2000, "Lance" == D. Lance Robinson wrote:
> 
>   Lance> So, if the md driver doesn't fail a drive that is because the lower
>   Lance> levels have taken care of all the nitty details and have supposedly
>   Lance> performed the requested data transfer correctly. As long as the actual
>   Lance> device drivers do the requests, the md driver won't know about any
>   Lance> problems.
> 
> Thank-you for this explanation. The state of the array leads me to suspect
> that the IDE layer was NOT "doing the right thing" and was never getting those
> writes properly flushed, but I really cannot say this with any confidence. 
> Clearly the "write beyond end of device" followed by all my disks being kicked 
> from the array is indicating something badly wrong, but I'm on the fence in 
> regards labelling this as a hardware or a software problem.

What I think it going on here is that you have a filesystem error.  That
is, you have an inode that effectively says there is data located
somewhere in the general area of infinity.  Naturally, when the filesystem
code calls for accessing a sector at an effectively infinite address, this
requests gets passed down to the hardware driver and eventually the actual
drive itself.  Of course, the drive dutifully reports "no such sector" or
"attempt to access beyond end of device" or some appropriate error, but --
and this is the critical problem -- the error filters up as a disk error
(which it technically is) and causes your RAID layer to remove the device. 
The RAID layer then reissues the request for a different device, which
also responds identically, and this device gets removed from the array. 
In quick succession, RAID sees all of the devices reporting this error and
removes them all from the array: pop-pop-pop.

I am not sure what the proper behavior is here.  I do know from personal
experience that some really failed drives do report all access attempts as
"beyond end of device."  I had an old Quantum 512 MB SCSI drive do this to
me once, and it ended up having to go back to Quantum.  (Quantum stores
drive configuration information on the media in a customer-inaccessible
place, so if this information goes corrupt the drive can do very odd
things like claim to be a zero-length device.)  So configuring RAID to
disregard "attempt to access beyond end of device" may not be a good idea,
but this may be the only proper option.

At best, I think some sort of compromise would be appropriate.  That is,
it may be reasonable to remove all but the last device when this occurs.
I certainly cannot see any merit to taking down the whole system because
of a corrupt inode.  Worse, I'm not sure if there is any way to fix the
corrupt inode automatically, since fsck has to run above the RAID layer.

-- Mike
Re: Please help - when is a bad disk a bad disk?

Reply via email to