Re: Paranoia about DegradedArray
On Wed, 29 Oct 2008 17:58:56 -0400, Hal Vaughan wrote: On Wednesday 29 October 2008, Hendrik Boom wrote: I'm a bit surprised that none of the messages identifies the other drive, /dev/hdc3. Is this normal? Is that information available somewhere besides the sysadmin's memory? Luckily it's been at least a couple months since I worked with a degraded array, but I *thought* it listed the failed devices as well. It looks like the device has not only failed but been removed -- is there a chance you removed it after the failure, before running this command? No. I did not explicitly fail it or remove it. There must have been some automatic mechanism that did. So presumably the thing to do is mdadm --fail /dev/md0 /dev/hdc3 mdadm --remove /dev/md0 /dev/hdc3 and then mdadm --add/dev/md0 /dev/hdc3 I think there's a --readd that you have to use or something like that, but I'd try --add first and see if that works. You might find that hdc3 has already failed and, form the output above, looks like it's already been removed. In the docs, re-add is specified as something to use if a drive has been removed *recently*, and then it writes all the blocks that were to have been written while it was out -- a way of doing an update instead of s full copy. It doesn't seem relevant in this case. Is the --fail really needed in my case? the --detail option seems to have given /dev/hdc3 the status of removed (although it failed to mention is was /dev/hdc3). I've had trouble with removing drives if I didn't manually fail them. Someone who knows the inner workings of mdadm might be able to provide more information on that. I wonder if /dev/hdc3 still needs to be manually failed. I wonder if it is even possible to fail a removed drive... Yes, paranoia is a good thing in system administration. It's kept me from severe problems previously! And paranoia will make sure I have two complete backups before I actually do any of this fixup. - hendrik Hal -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Re: Paranoia about DegradedArray
On Thursday 30 October 2008, Hendrik Boom wrote: ... I've had trouble with removing drives if I didn't manually fail them. Someone who knows the inner workings of mdadm might be able to provide more information on that. I wonder if /dev/hdc3 still needs to be manually failed. I wonder if it is even possible to fail a removed drive... Try adding it. If it works, then you're okay -- assuming the drive is okay. If it doesn't work, you'll get an error message and it won't add it. Yes, paranoia is a good thing in system administration. It's kept me from severe problems previously! And paranoia will make sure I have two complete backups before I actually do any of this fixup. I've learned, among other things, to not trust RAID5 with mdadm. I've also learned that even with RAID1, I have a full backup elsewhere. I stick with RAID1 so if it blows, as long as one drive is still okay, I can always remount it as a regular drive. Hal -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Re: Paranoia about DegradedArray
On Thu, 30 Oct 2008 13:43:52 -0400, Hal Vaughan wrote: On Thursday 30 October 2008, Hendrik Boom wrote: ... I've had trouble with removing drives if I didn't manually fail them. Someone who knows the inner workings of mdadm might be able to provide more information on that. I wonder if /dev/hdc3 still needs to be manually failed. I wonder if it is even possible to fail a removed drive... Try adding it. If it works, then you're okay -- assuming the drive is okay. If it doesn't work, you'll get an error message and it won't add it. There have been occasional reboots; presumably the add failed on reboot. I should perhaps check the system log. -- hendrik -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Re: Paranoia about DegradedArray
On Wednesday 29 October 2008, Hendrik Boom wrote: I got the message (via email) This is an automatically generated mail message from mdadm running on april A DegradedArray event had been detected on md device /dev/md0. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [raid1] md0 : active raid1 hda3[0] 242219968 blocks [2/1] [U_] unused devices: none You don't mention that you've checked the array with mdadm --detail /dev/md0. Try that and it will give you some good information. I've never used /proc/mdstat because the --detail option gives me more data in one shot. From what I remember, this is a raid1, right? It looks like it has 2 devices and one is still working, but I might be wrong. Again --detail will spell out a lot of this explicitly. Now I gather from what I've googled that somehow I've got to get the RAID to reestablish the failed drive by copying from the nonfailed drive. I do believe the hardware is basically OK, and that what I've got is probably a problem due to a power failure (We've had a lot of these recently) or something transient. (a) How do I do this? If a drive has actually failed, then mdadm --remove /dev/md0 /dev/hdxx. If the drive has not failed, then you need to fail it first with --fail as an option/switch for mdadm. (b) is hda3 the failed drive, or is it the one that's still working? That's one of the things mdadm --detail /dev/md0 will tell you. It will list the active drives and the failed drives. Hal -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Re: Paranoia about DegradedArray
On Wed, 29 Oct 2008 13:00:25 -0400, Hal Vaughan wrote: On Wednesday 29 October 2008, Hendrik Boom wrote: I got the message (via email) This is an automatically generated mail message from mdadm running on april A DegradedArray event had been detected on md device /dev/md0. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [raid1] md0 : active raid1 hda3[0] 242219968 blocks [2/1] [U_] unused devices: none You don't mention that you've checked the array with mdadm --detail /dev/md0. Try that and it will give you some good information. april:/farhome/hendrik# mdadm --detail /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Sun Feb 19 10:53:01 2006 Raid Level : raid1 Array Size : 242219968 (231.00 GiB 248.03 GB) Device Size : 242219968 (231.00 GiB 248.03 GB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Wed Oct 29 13:23:15 2008 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 UUID : 4dc189ba:e7a12d38:e6262cdf:db1beda2 Events : 0.5130704 Number Major Minor RaidDevice State 0 330 active sync /dev/hda3 1 001 removed april:/farhome/hendrik# So from this do I conclude that /dev/hda3 is still working, but that it's the other drive (which isn't identified) that has trouble? I'm a bit surprised that none of the messages identifies the other drive, /dev/hdc3. Is this normal? Is that information available somewhere besides the sysadmin's memory? I've never used /proc/mdstat because the --detail option gives me more data in one shot. From what I remember, this is a raid1, right? It looks like it has 2 devices and one is still working, but I might be wrong. Again --detail will spell out a lot of this explicitly. Now I gather from what I've googled that somehow I've got to get the RAID to reestablish the failed drive by copying from the nonfailed drive. I do believe the hardware is basically OK, and that what I've got is probably a problem due to a power failure (We've had a lot of these recently) or something transient. (a) How do I do this? If a drive has actually failed, then mdadm --remove /dev/md0 /dev/hdxx. If the drive has not failed, then you need to fail it first with --fail as an option/switch for mdadm. So presumably the thing to do is mdadm --fail /dev/md0 /dev/hdc3 mdadm --remove /dev/md0 /dev/hdc3 and then mdadm --add/dev/md0 /dev/hdc3 Is the --fail really needed in my case? the --detail option seems to have given /dev/hdc3 the status of removed (although it failed to mention is was /dev/hdc3). (b) is hda3 the failed drive, or is it the one that's still working? That's one of the things mdadm --detail /dev/md0 will tell you. It will list the active drives and the failed drives. Well. I'm glad I was paranoid enough to ask. It seems to be the drive that's working. Glas I didn't try to remove and add in *that* one. Thanks, -- hendrik -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Re: Paranoia about DegradedArray
On Wednesday 29 October 2008, Hendrik Boom wrote: On Wed, 29 Oct 2008 13:00:25 -0400, Hal Vaughan wrote: On Wednesday 29 October 2008, Hendrik Boom wrote: I got the message (via email) This is an automatically generated mail message from mdadm running on april A DegradedArray event had been detected on md device /dev/md0. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [raid1] md0 : active raid1 hda3[0] 242219968 blocks [2/1] [U_] unused devices: none You don't mention that you've checked the array with mdadm --detail /dev/md0. Try that and it will give you some good information. april:/farhome/hendrik# mdadm --detail /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Sun Feb 19 10:53:01 2006 Raid Level : raid1 Array Size : 242219968 (231.00 GiB 248.03 GB) Device Size : 242219968 (231.00 GiB 248.03 GB) Raid Devices : 2 Total Devices : 1 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Wed Oct 29 13:23:15 2008 State : clean, degraded Active Devices : 1 Working Devices : 1 Failed Devices : 0 Spare Devices : 0 UUID : 4dc189ba:e7a12d38:e6262cdf:db1beda2 Events : 0.5130704 Number Major Minor RaidDevice State 0 330 active sync /dev/hda3 1 001 removed april:/farhome/hendrik# So from this do I conclude that /dev/hda3 is still working, but that it's the other drive (which isn't identified) that has trouble? I'm a bit surprised that none of the messages identifies the other drive, /dev/hdc3. Is this normal? Is that information available somewhere besides the sysadmin's memory? Luckily it's been at least a couple months since I worked with a degraded array, but I *thought* it listed the failed devices as well. It looks like the device has not only failed but been removed -- is there a chance you removed it after the failure, before running this command? I've never used /proc/mdstat because the --detail option gives me more data in one shot. From what I remember, this is a raid1, right? It looks like it has 2 devices and one is still working, but I might be wrong. Again --detail will spell out a lot of this explicitly. Now I gather from what I've googled that somehow I've got to get the RAID to reestablish the failed drive by copying from the nonfailed drive. I do believe the hardware is basically OK, and that what I've got is probably a problem due to a power failure (We've had a lot of these recently) or something transient. (a) How do I do this? If a drive has actually failed, then mdadm --remove /dev/md0 /dev/hdxx. If the drive has not failed, then you need to fail it first with --fail as an option/switch for mdadm. So presumably the thing to do is mdadm --fail /dev/md0 /dev/hdc3 mdadm --remove /dev/md0 /dev/hdc3 and then mdadm --add/dev/md0 /dev/hdc3 I think there's a --readd that you have to use or something like that, but I'd try --add first and see if that works. You might find that hdc3 has already failed and, form the output above, looks like it's already been removed. Is the --fail really needed in my case? the --detail option seems to have given /dev/hdc3 the status of removed (although it failed to mention is was /dev/hdc3). I've had trouble with removing drives if I didn't manually fail them. Someone who knows the inner workings of mdadm might be able to provide more information on that. (b) is hda3 the failed drive, or is it the one that's still working? That's one of the things mdadm --detail /dev/md0 will tell you. It will list the active drives and the failed drives. Well. I'm glad I was paranoid enough to ask. It seems to be the drive that's working. Glas I didn't try to remove and add in *that* one. Yes, paranoia is a good thing in system administration. It's kept me from severe problems previously! Hal -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]