Re: Paranoia about DegradedArray

2008-10-30 Thread Hendrik Boom
On Wed, 29 Oct 2008 17:58:56 -0400, Hal Vaughan wrote:

 On Wednesday 29 October 2008, Hendrik Boom wrote:

 I'm a bit surprised that none of the messages identifies the other
 drive, /dev/hdc3.  Is this normal?  Is that information available
 somewhere besides the sysadmin's memory?
 
 Luckily it's been at least a couple months since I worked with a
 degraded array, but I *thought* it listed the failed devices as well. It
 looks like the device has not only failed but been removed -- is there a
 chance you removed it after the failure, before running this command?

No.  I did not explicitly fail it or remove it.  There must have been 
some automatic mechanism that did.


 So presumably the thing to do is
mdadm --fail /dev/md0 /dev/hdc3
mdadm --remove /dev/md0 /dev/hdc3
 and then
mdadm --add/dev/md0 /dev/hdc3
 
 I think there's a --readd that you have to use or something like that,
 but I'd try --add first and see if that works.  You might find that hdc3
 has already failed and, form the output above, looks like it's already
 been removed.

In the docs, re-add is specified as something to use if a drive has been 
removed *recently*, and then it writes all the blocks that were to have 
been written while it was out -- a way of doing an update instead of s 
full copy.  It doesn't seem relevant in this case.

 
 Is the --fail really needed in my case?  the --detail option seems to
 have given /dev/hdc3 the status of removed (although it failed to
 mention is was /dev/hdc3).
 
 I've had trouble with removing drives if I didn't manually fail them.
 Someone who knows the inner workings of mdadm might be able to provide
 more information on that.

I wonder if /dev/hdc3 still needs to be manually failed.  I wonder if it 
is even possible to fail a removed drive...


 
 Yes, paranoia is a good thing in system administration.  It's kept me
 from severe problems previously!

And paranoia will make sure I have two complete backups before I actually 
do any of this fixup.

- hendrik

 
 
 Hal



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Re: Paranoia about DegradedArray

2008-10-30 Thread Hal Vaughan
On Thursday 30 October 2008, Hendrik Boom wrote:
...
  I've had trouble with removing drives if I didn't manually fail
  them. Someone who knows the inner workings of mdadm might be able
  to provide more information on that.

 I wonder if /dev/hdc3 still needs to be manually failed.  I wonder if
 it is even possible to fail a removed drive...


Try adding it.  If it works, then you're okay -- assuming the drive is 
okay.  If it doesn't work, you'll get an error message and it won't add 
it.

  Yes, paranoia is a good thing in system administration.  It's kept
  me from severe problems previously!

 And paranoia will make sure I have two complete backups before I
 actually do any of this fixup.

I've learned, among other things, to not trust RAID5 with mdadm.  I've 
also learned that even with RAID1, I have a full backup elsewhere.  I 
stick with RAID1 so if it blows, as long as one drive is still okay, I 
can always remount it as a regular drive.


Hal


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Re: Paranoia about DegradedArray

2008-10-30 Thread Hendrik Boom
On Thu, 30 Oct 2008 13:43:52 -0400, Hal Vaughan wrote:

 On Thursday 30 October 2008, Hendrik Boom wrote: ...
  I've had trouble with removing drives if I didn't manually fail them.
  Someone who knows the inner workings of mdadm might be able to
  provide more information on that.

 I wonder if /dev/hdc3 still needs to be manually failed.  I wonder if
 it is even possible to fail a removed drive...
 
 
 Try adding it.  If it works, then you're okay -- assuming the drive is
 okay.  If it doesn't work, you'll get an error message and it won't add
 it.

There have been occasional reboots; presumably the add failed on reboot.  
I should perhaps check the system log.

-- hendrik


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Re: Paranoia about DegradedArray

2008-10-29 Thread Hal Vaughan
On Wednesday 29 October 2008, Hendrik Boom wrote:
 I got the message (via email)

 This is an automatically generated mail message from mdadm
 running on april

 A DegradedArray event had been detected on md device /dev/md0.

 Faithfully yours, etc.

 P.S. The /proc/mdstat file currently contains the following:

 Personalities : [raid1]
 md0 : active raid1 hda3[0]
   242219968 blocks [2/1] [U_]

 unused devices: none


You don't mention that you've checked the array with 
mdadm --detail /dev/md0.  Try that and it will give you some good 
information.

I've never used /proc/mdstat because the --detail option gives me more 
data in one shot.  From what I remember, this is a raid1, right?  It 
looks like it has 2 devices and one is still working, but I might be 
wrong. Again --detail will spell out a lot of this explicitly.

 Now I gather from what I've googled that somehow I've got to get the
 RAID to reestablish the failed drive by copying from the nonfailed
 drive. I do believe the hardware is basically OK, and that what I've
 got is probably a problem due to a power failure  (We've had a lot of
 these recently) or something transient.

 (a) How do I do this?

If a drive has actually failed, then mdadm --remove /dev/md0 /dev/hdxx.  
If the drive has not failed, then you need to fail it first with --fail 
as an option/switch for mdadm.

 (b) is hda3 the failed drive, or is it the one that's still working?

That's one of the things mdadm --detail /dev/md0 will tell you.  It will 
list the active drives and the failed drives.

Hal


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Re: Paranoia about DegradedArray

2008-10-29 Thread Hendrik Boom
On Wed, 29 Oct 2008 13:00:25 -0400, Hal Vaughan wrote:

 On Wednesday 29 October 2008, Hendrik Boom wrote:
 I got the message (via email)

 This is an automatically generated mail message from mdadm running on
 april

 A DegradedArray event had been detected on md device /dev/md0.

 Faithfully yours, etc.

 P.S. The /proc/mdstat file currently contains the following:

 Personalities : [raid1]
 md0 : active raid1 hda3[0]
   242219968 blocks [2/1] [U_]

 unused devices: none


 You don't mention that you've checked the array with mdadm --detail
 /dev/md0.  Try that and it will give you some good information.

april:/farhome/hendrik# mdadm --detail /dev/md0
/dev/md0:
Version : 00.90.03
  Creation Time : Sun Feb 19 10:53:01 2006
 Raid Level : raid1
 Array Size : 242219968 (231.00 GiB 248.03 GB)
Device Size : 242219968 (231.00 GiB 248.03 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Wed Oct 29 13:23:15 2008
  State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

   UUID : 4dc189ba:e7a12d38:e6262cdf:db1beda2
 Events : 0.5130704

Number   Major   Minor   RaidDevice State
   0   330  active sync   /dev/hda3
   1   001  removed
april:/farhome/hendrik# 



So from this do I conclude that /dev/hda3 is still working, but that it's 
the other drive (which isn't identified) that has trouble?

I'm a bit surprised that none of the messages identifies the other 
drive, /dev/hdc3.  Is this normal?  Is that information available 
somewhere besides the sysadmin's memory?

 
 I've never used /proc/mdstat because the --detail option gives me more
 data in one shot.  From what I remember, this is a raid1, right?  It
 looks like it has 2 devices and one is still working, but I might be
 wrong. Again --detail will spell out a lot of this explicitly.
 
 Now I gather from what I've googled that somehow I've got to get the
 RAID to reestablish the failed drive by copying from the nonfailed
 drive. I do believe the hardware is basically OK, and that what I've
 got is probably a problem due to a power failure  (We've had a lot of
 these recently) or something transient.

 (a) How do I do this?
 
 If a drive has actually failed, then mdadm --remove /dev/md0 /dev/hdxx.
 If the drive has not failed, then you need to fail it first with --fail
 as an option/switch for mdadm.

So presumably the thing to do is 
   mdadm --fail /dev/md0 /dev/hdc3
   mdadm --remove /dev/md0 /dev/hdc3
and then
   mdadm --add/dev/md0 /dev/hdc3

Is the --fail really needed in my case?  the --detail option seems to 
have given /dev/hdc3 the status of removed (although it failed to 
mention is was /dev/hdc3).

 
 (b) is hda3 the failed drive, or is it the one that's still working?
 
 That's one of the things mdadm --detail /dev/md0 will tell you.  It will
 list the active drives and the failed drives.

Well.  I'm glad I was paranoid enough to ask.  It seems to be the drive 
that's working.  Glas I didn't try to remove and add in *that* one.

Thanks,

-- hendrik


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Re: Paranoia about DegradedArray

2008-10-29 Thread Hal Vaughan
On Wednesday 29 October 2008, Hendrik Boom wrote:
 On Wed, 29 Oct 2008 13:00:25 -0400, Hal Vaughan wrote:
  On Wednesday 29 October 2008, Hendrik Boom wrote:
  I got the message (via email)
 
  This is an automatically generated mail message from mdadm running
  on april
 
  A DegradedArray event had been detected on md device /dev/md0.
 
  Faithfully yours, etc.
 
  P.S. The /proc/mdstat file currently contains the following:
 
  Personalities : [raid1]
  md0 : active raid1 hda3[0]
242219968 blocks [2/1] [U_]
 
  unused devices: none
 
  You don't mention that you've checked the array with mdadm --detail
  /dev/md0.  Try that and it will give you some good information.

 april:/farhome/hendrik# mdadm --detail /dev/md0
 /dev/md0:
 Version : 00.90.03
   Creation Time : Sun Feb 19 10:53:01 2006
  Raid Level : raid1
  Array Size : 242219968 (231.00 GiB 248.03 GB)
 Device Size : 242219968 (231.00 GiB 248.03 GB)
Raid Devices : 2
   Total Devices : 1
 Preferred Minor : 0
 Persistence : Superblock is persistent

 Update Time : Wed Oct 29 13:23:15 2008
   State : clean, degraded
  Active Devices : 1
 Working Devices : 1
  Failed Devices : 0
   Spare Devices : 0

UUID : 4dc189ba:e7a12d38:e6262cdf:db1beda2
  Events : 0.5130704

 Number   Major   Minor   RaidDevice State
0   330  active sync   /dev/hda3
1   001  removed
 april:/farhome/hendrik#



 So from this do I conclude that /dev/hda3 is still working, but that
 it's the other drive (which isn't identified) that has trouble?

 I'm a bit surprised that none of the messages identifies the other
 drive, /dev/hdc3.  Is this normal?  Is that information available
 somewhere besides the sysadmin's memory?

Luckily it's been at least a couple months since I worked with a 
degraded array, but I *thought* it listed the failed devices as well.  
It looks like the device has not only failed but been removed -- is 
there a chance you removed it after the failure, before running this 
command?


  I've never used /proc/mdstat because the --detail option gives me
  more data in one shot.  From what I remember, this is a raid1,
  right?  It looks like it has 2 devices and one is still working,
  but I might be wrong. Again --detail will spell out a lot of this
  explicitly.
 
  Now I gather from what I've googled that somehow I've got to get
  the RAID to reestablish the failed drive by copying from the
  nonfailed drive. I do believe the hardware is basically OK, and
  that what I've got is probably a problem due to a power failure 
  (We've had a lot of these recently) or something transient.
 
  (a) How do I do this?
 
  If a drive has actually failed, then mdadm --remove /dev/md0
  /dev/hdxx. If the drive has not failed, then you need to fail it
  first with --fail as an option/switch for mdadm.

 So presumably the thing to do is
mdadm --fail /dev/md0 /dev/hdc3
mdadm --remove /dev/md0 /dev/hdc3
 and then
mdadm --add/dev/md0 /dev/hdc3

I think there's a --readd that you have to use or something like that, 
but I'd try --add first and see if that works.  You might find that 
hdc3 has already failed and, form the output above, looks like it's 
already been removed.

 Is the --fail really needed in my case?  the --detail option seems to
 have given /dev/hdc3 the status of removed (although it failed to
 mention is was /dev/hdc3).

I've had trouble with removing drives if I didn't manually fail them.  
Someone who knows the inner workings of mdadm might be able to provide 
more information on that.

  (b) is hda3 the failed drive, or is it the one that's still
  working?
 
  That's one of the things mdadm --detail /dev/md0 will tell you.  It
  will list the active drives and the failed drives.

 Well.  I'm glad I was paranoid enough to ask.  It seems to be the
 drive that's working.  Glas I didn't try to remove and add in *that*
 one.

Yes, paranoia is a good thing in system administration.  It's kept me 
from severe problems previously!


Hal


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED] 
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]