md9:fyodor.hcoop.net

Graham Freeman Sat, 07 Apr 2007 23:57:54 -0700

On 07 Apr 07, at 15:25, Davor Ocelic wrote:

> On Thu, 05 Apr 2007 19:01:12 -0700
> Adam Megacz <[EMAIL PROTECTED]> wrote:
>
>>
>> Davor Ocelic <[EMAIL PROTECTED]> writes:
>>> Someone help me, from the output below, does "hdb1[1] hda1[0]"
>>> and "[_U]" indicate errors on hda or hdb ?
>>
>>>> md1 : active raid1 hdb1[1] hda1[0](F)
>>>>       1003904 blocks [2/1] [_U]
>>
>> The drives with "(F)" next to them have failed, so, that would be  
>> hda.
>
> Hey, folks from the -sysadmin list, do you have any insight into this
> type of problem?
>
> Did some of you already have RAID 1 disk fail as shown?



Yep.  Many times, not only with Linux software RAID but also various  
other RAID solutions.  Similar symptoms - system is running fine for  
extended periods, and then suddenly the Linux RAID monitoring system  
reports a disk failure with the same error display as shown above.


> What were
> the ways of making sure it really was a faulty disk and not some other
> problem?


Adam's suggestion to use the SMART monitoring tools is a good one - I  
second that.

However, the only time I've had anything that could be considered a  
false-positive in this context was when the cable to the drive in  
question was bad.  So, the drive really was "bad" from the  
perspective of the system, but the hardware was actually fine once we  
replaced the cable.  The difference is that this was immediately  
apparent upon boot, whereas in your situation it's apparently been  
running fine for months and has only recently been identified as  
failed.  So, I think it's very unlikely that you're seeing a false- 
positive.

One way to be sure is to run the manufacturer's disk diagnostic  
utility on the drive in question.  Such utilities usually run from a  
manufacturer-packaged DOS boot disk or CD, which means you can't run  
it with the Linux system running.  Assuming you pay for someone's  
time to do this, and given the low probability of a false positive,  
it's almost certainly cheapest and best to simply conclude that the  
disk in question is bad and replace it.  You'll still need to  
schedule downtime, but you needn't worry about spending time  
verifying that the previous disk was bad.


> Finally, did you do anything before replacing the disk


Run the manufacturer's diagnostic tool on the new disk, and/or a  
Linux-based disk verification process.  (This is to make sure you're  
not starting out with a dud.)


> , and how did
> you get a new disk to become a member of the array and get synced?


Rather than reciting possibly-incorrect commands from memory, I'd  
simply refer you to the documentation for the Linux RAID tools (mdadm  
and kin).  My understanding is that you'd add the new disk to the  
RAID-1 (mirrored) array the same way you would if you'd just built  
the array and were adding a member.  Just make sure you don't  
accidentally overwrite the existing drive with data from the blank  
new drive.  :)


> Any comments are helpful.


What are the state of backups?  It's safe to conclude that the disk  
in question really has failed, which means you're one disk failure  
away from downtime (at minimum) or catastrophic downtime (if you  
don't have backups).  If a heat or vibration/impact problem caused or  
contributed to the first disk failure, the same problem is likely to  
affect the remaining disk.

Also, were both disks purchased at about the same time, from the same  
vendor?  If so, the other may be nearing the end of its life as well,  
considering that they've likely seen identical usage patterns.  This  
would add urgency to the situation.

best,

Graham


_______________________________________________
HCoop-SysAdmin mailing list
[email protected]
http://hcoop.net/cgi-bin/mailman/listinfo/hcoop-sysadmin

Re: [HCoop-SysAdmin] DegradedArray event on /dev/md9:fyodor.hcoop.net

Reply via email to