RE: owie, disk failure

Corin Hartland-Swann Mon, 07 Aug 2000 01:28:59 -0700

Jeffrey,

On Sun, 6 Aug 2000, Gregory Leblanc wrote:
> On Sun, 6 Aug 2000, Jeffrey Paul wrote:
> > hmmmm, the day i had hoped would never arrive has...

It's _always_ waiting :(

> > Aug  2 07:38:27 chrome kernel: raid1: Disk failure on hdg1, 
> > disabling device.

OK, so it thinks hdg1 is faulty...

> > Aug  2 07:38:27 chrome kernel: raid1: md0: rescheduling block 8434238
> > Aug  2 07:38:27 chrome kernel: md0: no spare disk to reconstruct 
> > array! -- continuing in degraded mode
> > Aug  2 07:38:27 chrome kernel: raid1: md0: redirecting sector 8434238 
> > to another mirror
> > 
> > my setup is a two-disk (40gb each) raid1 configuration... hde1 and 
> > hdg1.   I didn't have measures in place to notify me of such an 
> > event, so I didnt notice it until i looked at the console today and 
> > noticed it there...

In 'degraded' mode it is basically just a normal disk without
redundancy. Nothing bad is going to happen just because you're still
running it in degraded mode.

> I think I ran for about 2 weeks on a dead drive.  Thankfully it wasn't a
> production system, but notification isn't quite as "out of the box" as it
> needs to be just yet.

A simple cron script is probably the way to go.

> > I ran raidhotremove /dev/md0 /dev/hdg1 and then raidhotadd /dev/md0 
> > /dev/hdg1 and it seemed to begin reconstruction:

I don't understand why you did this... it thinks the disk is failed, and
yet you are using raidhotadd to reinsert it into the array. The idea is
that you replace the disk, and _then_ raidhotadd the new disk.

Having said that, there's nothing wrong with what you did - it will
presumably just fail again at some later date.

> > but I got scared and decided to stop it...  so now it's sitting idle 
> > unmounted spun down (both disks) awaiting professional advice (rather 
> > than me stumbling around in the dark before i hose my data).   Both 

I think what you need to do is to test the disk to see if it's really
faulty. If it's a Maxtor DiamondMax Plus 40 (the only 40G disk I'm aware
of), then try:

 badblocks -s -v /dev/hdg1 40017915

(or substitute the correct number of blocks for your partition)

If this succeeds, you may want to try it with the '-w' option (enable
writes). This takes a *VERY* *LONG* *TIME* though. I believe it could be
several *DAYS* on a disk this size, since it repeatedly writes to the disk
and then reads the data back to check for errors.

> > disks are less than two weeks old, although I have heard of people 
> > having similar problems (disks failing in less than a month new from 
> > the factory) with this brand and model....   I would like to get the 

In my experience 95% of drive failures occur in the first couple of
weeks. If they get out of this timeframe, then I find they usually last
for a long time. I don't think this is a failing of this brand and/or
model.

> > drives back to the way the were before the system decided that the 
> > disk had failed (what causes it to think that, anyways?) and see if 
> > it continues to work, as I find it hard to believe that the drive 
> > would have died so quickly.   What is the proper course of action?

It is entirely possible that it has failed (but luckily you'll get a
replacement really quickly when it fails so early). You can continue to
run the system in degraded mode for the moment, as long as you're aware
that there's no redundancy. If you confirm it's faulty, then I'd return
it, get the new disk, and then raidhotadd it back into the array.

> First, do you have ANY log messages from anything other than RAID indicating
> a failed disk?  Since these are IDE drives, I'd expect some messages from
> the IDE subsystem if the drive really had died (my SCSI messages went pretty
> wild when I had a disk fail).

I'd agree with Gregory here - I'd definately expect something else in the
logs (IDE bus resets, perhaps). The disk may well be fine, and just got
ejected from the array by gremlins...

> To check and see if the drive is actually in good condition, grab the
> diagnostic utility from the support site of your drive manufacturer,
> boot from a DOS floppy, and run diagnostics on the drive.

I have to confess I've never heard of manufacturers offering diagnostic
utilities for disks... Gregory, can you point me at any examples? Am I
just being a complete dumbass here?

> In order for them to replace my drives, I've had to do "write"
> testing, which destroys all data on the drive, so you may want to disconnect
> power from one of the drives before you play around with that.

If you don't trust yourself to get the right disk for a write test then
you need to do this. However, if you check *EXACTLY* what you are doing
before running a write-test, then I don't see any reason to go so far as
to unplug the disks. YMMV.

Regards,

Corin

/------------------------+-------------------------------------\
| Corin Hartland-Swann   | Direct: +44 (0) 20 7544 4676        |
| Commerce Internet Ltd  | Mobile: +44 (0) 79 5854 0027        |
| 22 Cavendish Buildings |    Tel: +44 (0) 20 7491 2000        |
| Gilbert Street         |    Fax: +44 (0) 20 7491 2010        |
| Mayfair                |    Web: http://www.commerce.uk.net/ |
| London W1K 5HJ         | E-Mail: [EMAIL PROTECTED]        |
\------------------------+-------------------------------------/
RE: owie, disk failure

Reply via email to