On 07 Apr 07, at 15:25, Davor Ocelic wrote: > On Thu, 05 Apr 2007 19:01:12 -0700 > Adam Megacz <[EMAIL PROTECTED]> wrote: > >> >> Davor Ocelic <[EMAIL PROTECTED]> writes: >>> Someone help me, from the output below, does "hdb1[1] hda1[0]" >>> and "[_U]" indicate errors on hda or hdb ? >> >>>> md1 : active raid1 hdb1[1] hda1[0](F) >>>> 1003904 blocks [2/1] [_U] >> >> The drives with "(F)" next to them have failed, so, that would be >> hda. > > Hey, folks from the -sysadmin list, do you have any insight into this > type of problem? > > Did some of you already have RAID 1 disk fail as shown?
Yep. Many times, not only with Linux software RAID but also various other RAID solutions. Similar symptoms - system is running fine for extended periods, and then suddenly the Linux RAID monitoring system reports a disk failure with the same error display as shown above. > What were > the ways of making sure it really was a faulty disk and not some other > problem? Adam's suggestion to use the SMART monitoring tools is a good one - I second that. However, the only time I've had anything that could be considered a false-positive in this context was when the cable to the drive in question was bad. So, the drive really was "bad" from the perspective of the system, but the hardware was actually fine once we replaced the cable. The difference is that this was immediately apparent upon boot, whereas in your situation it's apparently been running fine for months and has only recently been identified as failed. So, I think it's very unlikely that you're seeing a false- positive. One way to be sure is to run the manufacturer's disk diagnostic utility on the drive in question. Such utilities usually run from a manufacturer-packaged DOS boot disk or CD, which means you can't run it with the Linux system running. Assuming you pay for someone's time to do this, and given the low probability of a false positive, it's almost certainly cheapest and best to simply conclude that the disk in question is bad and replace it. You'll still need to schedule downtime, but you needn't worry about spending time verifying that the previous disk was bad. > Finally, did you do anything before replacing the disk Run the manufacturer's diagnostic tool on the new disk, and/or a Linux-based disk verification process. (This is to make sure you're not starting out with a dud.) > , and how did > you get a new disk to become a member of the array and get synced? Rather than reciting possibly-incorrect commands from memory, I'd simply refer you to the documentation for the Linux RAID tools (mdadm and kin). My understanding is that you'd add the new disk to the RAID-1 (mirrored) array the same way you would if you'd just built the array and were adding a member. Just make sure you don't accidentally overwrite the existing drive with data from the blank new drive. :) > Any comments are helpful. What are the state of backups? It's safe to conclude that the disk in question really has failed, which means you're one disk failure away from downtime (at minimum) or catastrophic downtime (if you don't have backups). If a heat or vibration/impact problem caused or contributed to the first disk failure, the same problem is likely to affect the remaining disk. Also, were both disks purchased at about the same time, from the same vendor? If so, the other may be nearing the end of its life as well, considering that they've likely seen identical usage patterns. This would add urgency to the situation. best, Graham _______________________________________________ HCoop-SysAdmin mailing list [email protected] http://hcoop.net/cgi-bin/mailman/listinfo/hcoop-sysadmin
