RE: owie, disk failure
Jeffrey, On Sun, 6 Aug 2000, Gregory Leblanc wrote: On Sun, 6 Aug 2000, Jeffrey Paul wrote: h, the day i had hoped would never arrive has... It's _always_ waiting :( Aug 2 07:38:27 chrome kernel: raid1: Disk failure on hdg1, disabling device. OK, so it thinks hdg1 is faulty... Aug 2 07:38:27 chrome kernel: raid1: md0: rescheduling block 8434238 Aug 2 07:38:27 chrome kernel: md0: no spare disk to reconstruct array! -- continuing in degraded mode Aug 2 07:38:27 chrome kernel: raid1: md0: redirecting sector 8434238 to another mirror my setup is a two-disk (40gb each) raid1 configuration... hde1 and hdg1. I didn't have measures in place to notify me of such an event, so I didnt notice it until i looked at the console today and noticed it there... In 'degraded' mode it is basically just a normal disk without redundancy. Nothing bad is going to happen just because you're still running it in degraded mode. I think I ran for about 2 weeks on a dead drive. Thankfully it wasn't a production system, but notification isn't quite as "out of the box" as it needs to be just yet. A simple cron script is probably the way to go. I ran raidhotremove /dev/md0 /dev/hdg1 and then raidhotadd /dev/md0 /dev/hdg1 and it seemed to begin reconstruction: I don't understand why you did this... it thinks the disk is failed, and yet you are using raidhotadd to reinsert it into the array. The idea is that you replace the disk, and _then_ raidhotadd the new disk. Having said that, there's nothing wrong with what you did - it will presumably just fail again at some later date. but I got scared and decided to stop it... so now it's sitting idle unmounted spun down (both disks) awaiting professional advice (rather than me stumbling around in the dark before i hose my data). Both I think what you need to do is to test the disk to see if it's really faulty. If it's a Maxtor DiamondMax Plus 40 (the only 40G disk I'm aware of), then try: badblocks -s -v /dev/hdg1 40017915 (or substitute the correct number of blocks for your partition) If this succeeds, you may want to try it with the '-w' option (enable writes). This takes a *VERY* *LONG* *TIME* though. I believe it could be several *DAYS* on a disk this size, since it repeatedly writes to the disk and then reads the data back to check for errors. disks are less than two weeks old, although I have heard of people having similar problems (disks failing in less than a month new from the factory) with this brand and model I would like to get the In my experience 95% of drive failures occur in the first couple of weeks. If they get out of this timeframe, then I find they usually last for a long time. I don't think this is a failing of this brand and/or model. drives back to the way the were before the system decided that the disk had failed (what causes it to think that, anyways?) and see if it continues to work, as I find it hard to believe that the drive would have died so quickly. What is the proper course of action? It is entirely possible that it has failed (but luckily you'll get a replacement really quickly when it fails so early). You can continue to run the system in degraded mode for the moment, as long as you're aware that there's no redundancy. If you confirm it's faulty, then I'd return it, get the new disk, and then raidhotadd it back into the array. First, do you have ANY log messages from anything other than RAID indicating a failed disk? Since these are IDE drives, I'd expect some messages from the IDE subsystem if the drive really had died (my SCSI messages went pretty wild when I had a disk fail). I'd agree with Gregory here - I'd definately expect something else in the logs (IDE bus resets, perhaps). The disk may well be fine, and just got ejected from the array by gremlins... To check and see if the drive is actually in good condition, grab the diagnostic utility from the support site of your drive manufacturer, boot from a DOS floppy, and run diagnostics on the drive. I have to confess I've never heard of manufacturers offering diagnostic utilities for disks... Gregory, can you point me at any examples? Am I just being a complete dumbass here? In order for them to replace my drives, I've had to do "write" testing, which destroys all data on the drive, so you may want to disconnect power from one of the drives before you play around with that. If you don't trust yourself to get the right disk for a write test then you need to do this. However, if you check *EXACTLY* what you are doing before running a write-test, then I don't see any reason to go so far as to unplug the disks. YMMV. Regards, Corin /+-\ | Corin Hartland-Swann | Direct: +44 (0) 20 7544 4676| | Commerce Internet Ltd | Mobile: +44 (0) 79 5854 0027| | 22 Cavendish Buildings |Tel: +44 (0) 20 7491 2000
RE: owie, disk failure
On Mon, 7 Aug 2000, Corin Hartland-Swann wrote: I have to confess I've never heard of manufacturers offering diagnostic utilities for disks... Gregory, can you point me at any examples? Am I just being a complete dumbass here? At least Western Digital does on their ftp address ftp://ftp.wdc.com/pub/drivers/hdutil, however I don't know what and how those utils do better than badblocks friends. D.
RE: owie, disk failure
disks are less than two weeks old, although I have heard of people having similar problems (disks failing in less than a month new from the factory) with this brand and model I would like to get the In my experience 95% of drive failures occur in the first couple of weeks. If they get out of this timeframe, then I find they usually last for a long time. I don't think this is a failing of this brand and/or model. Well, from the drives that I've had, they either fail after a few weeks, or after several years (like 5+). Almost never in between. We keep a spare drive of each size around anyway. :-) To check and see if the drive is actually in good condition, grab the diagnostic utility from the support site of your drive manufacturer, boot from a DOS floppy, and run diagnostics on the drive. I have to confess I've never heard of manufacturers offering diagnostic utilities for disks... Gregory, can you point me at any examples? Am I just being a complete dumbass here? Yes, you are. :-) From Maxtor's site (since I just RM'd a drive last week) (http://www.maxtor.com/) click on software download. Right on that page is info about the MaxDiag utility. It does a little more than badblocks and friends, at least for IDE drives. It will return drive specific error codes, and if you've run all of those tests by the time you call support, you can just give them the error numbers, and they issue an RMA. The other nice feature is that it gives you the tech support number to call as soon as it shows the error. :-) In order for them to replace my drives, I've had to do "write" testing, which destroys all data on the drive, so you may want to disconnect power from one of the drives before you play around with that. If you don't trust yourself to get the right disk for a write test then you need to do this. However, if you check *EXACTLY* what you are doing before running a write-test, then I don't see any reason to go so far as to unplug the disks. YMMV. Well, that's true, but if you don't trust yourself to get the right drive, then you should unplug the one that still has the data intact. Depending on the value of the data, it may be worth unplugging it just for safety's sake, although if it's that important, it should be backed up. Later, Greg
RE: owie, disk failure
-Original Message- From: Jeffrey Paul [mailto:[EMAIL PROTECTED]] Sent: Sunday, August 06, 2000 5:56 PM To: [EMAIL PROTECTED] Subject: owie, disk failure h, the day i had hoped would never arrive has... Aug 2 07:38:27 chrome kernel: raid1: Disk failure on hdg1, disabling device. Aug 2 07:38:27 chrome kernel: raid1: md0: rescheduling block 8434238 Aug 2 07:38:27 chrome kernel: md0: no spare disk to reconstruct array! -- continuing in degraded mode Aug 2 07:38:27 chrome kernel: raid1: md0: redirecting sector 8434238 to another mirror my setup is a two-disk (40gb each) raid1 configuration... hde1 and hdg1. I didn't have measures in place to notify me of such an event, so I didnt notice it until i looked at the console today and noticed it there... I think I ran for about 2 weeks on a dead drive. Thankfully it wasn't a production system, but notification isn't quite as "out of the box" as it needs to be just yet. I ran raidhotremove /dev/md0 /dev/hdg1 and then raidhotadd /dev/md0 /dev/hdg1 and it seemed to begin reconstruction: but I got scared and decided to stop it... so now it's sitting idle unmounted spun down (both disks) awaiting professional advice (rather than me stumbling around in the dark before i hose my data). Both disks are less than two weeks old, although I have heard of people having similar problems (disks failing in less than a month new from the factory) with this brand and model I would like to get the drives back to the way the were before the system decided that the disk had failed (what causes it to think that, anyways?) and see if it continues to work, as I find it hard to believe that the drive would have died so quickly. What is the proper course of action? First, do you have ANY log messages from anything other than RAID indicating a failed disk? Since these are IDE drives, I'd expect some messages from the IDE subsystem if the drive really had died (my SCSI messages went pretty wild when I had a disk fail). To check and see if the drive is actually in good condition, grab the diagnostic utility from the support site of your drive manufacturer, boot from a DOS floppy, and run diagnostics on the drive. In order for them to replace my drives, I've had to do "write" testing, which destroys all data on the drive, so you may want to disconnect power from one of the drives before you play around with that. If the disk is good, then you're all set. If not, get it replaced. I've seen drives fail very quickly before, it's always been a manufacturing defect of some kind. HTH, Greg