RE: owie, disk failure

2000-08-07 Thread Corin Hartland-Swann


Jeffrey,

On Sun, 6 Aug 2000, Gregory Leblanc wrote:
 On Sun, 6 Aug 2000, Jeffrey Paul wrote:
  h, the day i had hoped would never arrive has...

It's _always_ waiting :(

  Aug  2 07:38:27 chrome kernel: raid1: Disk failure on hdg1, 
  disabling device.

OK, so it thinks hdg1 is faulty...

  Aug  2 07:38:27 chrome kernel: raid1: md0: rescheduling block 8434238
  Aug  2 07:38:27 chrome kernel: md0: no spare disk to reconstruct 
  array! -- continuing in degraded mode
  Aug  2 07:38:27 chrome kernel: raid1: md0: redirecting sector 8434238 
  to another mirror
  
  my setup is a two-disk (40gb each) raid1 configuration... hde1 and 
  hdg1.   I didn't have measures in place to notify me of such an 
  event, so I didnt notice it until i looked at the console today and 
  noticed it there...

In 'degraded' mode it is basically just a normal disk without
redundancy. Nothing bad is going to happen just because you're still
running it in degraded mode.

 I think I ran for about 2 weeks on a dead drive.  Thankfully it wasn't a
 production system, but notification isn't quite as "out of the box" as it
 needs to be just yet.

A simple cron script is probably the way to go.

  I ran raidhotremove /dev/md0 /dev/hdg1 and then raidhotadd /dev/md0 
  /dev/hdg1 and it seemed to begin reconstruction:

I don't understand why you did this... it thinks the disk is failed, and
yet you are using raidhotadd to reinsert it into the array. The idea is
that you replace the disk, and _then_ raidhotadd the new disk.

Having said that, there's nothing wrong with what you did - it will
presumably just fail again at some later date.

  but I got scared and decided to stop it...  so now it's sitting idle 
  unmounted spun down (both disks) awaiting professional advice (rather 
  than me stumbling around in the dark before i hose my data).   Both 

I think what you need to do is to test the disk to see if it's really
faulty. If it's a Maxtor DiamondMax Plus 40 (the only 40G disk I'm aware
of), then try:

 badblocks -s -v /dev/hdg1 40017915

(or substitute the correct number of blocks for your partition)

If this succeeds, you may want to try it with the '-w' option (enable
writes). This takes a *VERY* *LONG* *TIME* though. I believe it could be
several *DAYS* on a disk this size, since it repeatedly writes to the disk
and then reads the data back to check for errors.

  disks are less than two weeks old, although I have heard of people 
  having similar problems (disks failing in less than a month new from 
  the factory) with this brand and model   I would like to get the 

In my experience 95% of drive failures occur in the first couple of
weeks. If they get out of this timeframe, then I find they usually last
for a long time. I don't think this is a failing of this brand and/or
model.

  drives back to the way the were before the system decided that the 
  disk had failed (what causes it to think that, anyways?) and see if 
  it continues to work, as I find it hard to believe that the drive 
  would have died so quickly.   What is the proper course of action?

It is entirely possible that it has failed (but luckily you'll get a
replacement really quickly when it fails so early). You can continue to
run the system in degraded mode for the moment, as long as you're aware
that there's no redundancy. If you confirm it's faulty, then I'd return
it, get the new disk, and then raidhotadd it back into the array.

 First, do you have ANY log messages from anything other than RAID indicating
 a failed disk?  Since these are IDE drives, I'd expect some messages from
 the IDE subsystem if the drive really had died (my SCSI messages went pretty
 wild when I had a disk fail).

I'd agree with Gregory here - I'd definately expect something else in the
logs (IDE bus resets, perhaps). The disk may well be fine, and just got
ejected from the array by gremlins...

 To check and see if the drive is actually in good condition, grab the
 diagnostic utility from the support site of your drive manufacturer,
 boot from a DOS floppy, and run diagnostics on the drive.

I have to confess I've never heard of manufacturers offering diagnostic
utilities for disks... Gregory, can you point me at any examples? Am I
just being a complete dumbass here?

 In order for them to replace my drives, I've had to do "write"
 testing, which destroys all data on the drive, so you may want to disconnect
 power from one of the drives before you play around with that.

If you don't trust yourself to get the right disk for a write test then
you need to do this. However, if you check *EXACTLY* what you are doing
before running a write-test, then I don't see any reason to go so far as
to unplug the disks. YMMV.

Regards,

Corin

/+-\
| Corin Hartland-Swann   | Direct: +44 (0) 20 7544 4676|
| Commerce Internet Ltd  | Mobile: +44 (0) 79 5854 0027|
| 22 Cavendish Buildings |Tel: +44 (0) 20 7491 2000   

RE: owie, disk failure

2000-08-07 Thread Danilo Godec

On Mon, 7 Aug 2000, Corin Hartland-Swann wrote:

 I have to confess I've never heard of manufacturers offering diagnostic
 utilities for disks... Gregory, can you point me at any examples? Am I
 just being a complete dumbass here?

At least Western Digital does on their ftp address
ftp://ftp.wdc.com/pub/drivers/hdutil, however I don't know what and how
those utils do better than badblocks  friends.


   D.





RE: owie, disk failure

2000-08-07 Thread Gregory Leblanc

   disks are less than two weeks old, although I have heard 
 of people 
   having similar problems (disks failing in less than a 
 month new from 
   the factory) with this brand and model   I would like 
 to get the 
 
 In my experience 95% of drive failures occur in the first couple of
 weeks. If they get out of this timeframe, then I find they 
 usually last
 for a long time. I don't think this is a failing of this brand and/or
 model.

Well, from the drives that I've had, they either fail after a few weeks, or
after several years (like 5+).  Almost never in between.  We keep a spare
drive of each size around anyway.  :-)

  To check and see if the drive is actually in good 
 condition, grab the
  diagnostic utility from the support site of your drive manufacturer,
  boot from a DOS floppy, and run diagnostics on the drive.
 
 I have to confess I've never heard of manufacturers offering 
 diagnostic
 utilities for disks... Gregory, can you point me at any examples? Am I
 just being a complete dumbass here?

Yes, you are.  :-)  From Maxtor's site (since I just RM'd a drive last week)
(http://www.maxtor.com/) click on software download.  Right on that page is
info about the MaxDiag utility.  It does a little more than badblocks and
friends, at least for IDE drives.  It will return drive specific error
codes, and if you've run all of those tests by the time you call support,
you can just give them the error numbers, and they issue an RMA.  The other
nice feature is that it gives you the tech support number to call as soon as
it shows the error.  :-)

  In order for them to replace my drives, I've had to do "write"
  testing, which destroys all data on the drive, so you may 
 want to disconnect
  power from one of the drives before you play around with that.
 
 If you don't trust yourself to get the right disk for a write 
 test then
 you need to do this. However, if you check *EXACTLY* what you 
 are doing
 before running a write-test, then I don't see any reason to 
 go so far as
 to unplug the disks. YMMV.

Well, that's true, but if you don't trust yourself to get the right drive,
then you should unplug the one that still has the data intact.  Depending on
the value of the data, it may be worth unplugging it just for safety's sake,
although if it's that important, it should be backed up.  Later,
Greg



RE: owie, disk failure

2000-08-06 Thread Gregory Leblanc

 -Original Message-
 From: Jeffrey Paul [mailto:[EMAIL PROTECTED]]
 Sent: Sunday, August 06, 2000 5:56 PM
 To: [EMAIL PROTECTED]
 Subject: owie, disk failure
 
 h, the day i had hoped would never arrive has...
 
 Aug  2 07:38:27 chrome kernel: raid1: Disk failure on hdg1, 
 disabling device.
 Aug  2 07:38:27 chrome kernel: raid1: md0: rescheduling block 8434238
 Aug  2 07:38:27 chrome kernel: md0: no spare disk to reconstruct 
 array! -- continuing in degraded mode
 Aug  2 07:38:27 chrome kernel: raid1: md0: redirecting sector 8434238 
 to another mirror
 
 my setup is a two-disk (40gb each) raid1 configuration... hde1 and 
 hdg1.   I didn't have measures in place to notify me of such an 
 event, so I didnt notice it until i looked at the console today and 
 noticed it there...

I think I ran for about 2 weeks on a dead drive.  Thankfully it wasn't a
production system, but notification isn't quite as "out of the box" as it
needs to be just yet.

 I ran raidhotremove /dev/md0 /dev/hdg1 and then raidhotadd /dev/md0 
 /dev/hdg1 and it seemed to begin reconstruction:
 
 but I got scared and decided to stop it...  so now it's sitting idle 
 unmounted spun down (both disks) awaiting professional advice (rather 
 than me stumbling around in the dark before i hose my data).   Both 
 disks are less than two weeks old, although I have heard of people 
 having similar problems (disks failing in less than a month new from 
 the factory) with this brand and model   I would like to get the 
 drives back to the way the were before the system decided that the 
 disk had failed (what causes it to think that, anyways?) and see if 
 it continues to work, as I find it hard to believe that the drive 
 would have died so quickly.   What is the proper course of action?

First, do you have ANY log messages from anything other than RAID indicating
a failed disk?  Since these are IDE drives, I'd expect some messages from
the IDE subsystem if the drive really had died (my SCSI messages went pretty
wild when I had a disk fail).  To check and see if the drive is actually in
good condition, grab the diagnostic utility from the support site of your
drive manufacturer, boot from a DOS floppy, and run diagnostics on the
drive.  In order for them to replace my drives, I've had to do "write"
testing, which destroys all data on the drive, so you may want to disconnect
power from one of the drives before you play around with that.  If the disk
is good, then you're all set.  If not, get it replaced.  I've seen drives
fail very quickly before, it's always been a manufacturing defect of some
kind.  HTH,
Greg