[CentOS] more software raid questions

2010-10-19 Thread fred smith
hi all!

back in Aug several of you assisted me in solving a problem where one
of my drives had dropped out of (or been kicked out of) the raid1 array.

something vaguely similar appears to have happened just a few mins ago,
upon rebooting after a small update. I received four emails like this,
one for /dev/md0, one for /dev/md1, one for /dev/md125 and one for
/dev/md126:

Subject: DegradedArray event on /dev/md125:fcshome.stoneham.ma.us
X-Spambayes-Classification: unsure; 0.24
Status: RO
Content-Length: 564
Lines: 23

This is an automatically generated mail message from mdadm
running on fcshome.stoneham.ma.us

A DegradedArray event had been detected on md device /dev/md125.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1] 
md0 : active raid1 sda1[0]
  104320 blocks [2/1] [U_]
  
md126 : active raid1 sdb1[1]
  104320 blocks [2/1] [_U]
  
md125 : active raid1 sdb2[1]
  312464128 blocks [2/1] [_U]
  
md1 : active raid1 sda2[0]
  312464128 blocks [2/1] [U_]
  
unused devices: 

firstly, what the heck are md125 and md126? previously there was
only md0 and md1 

secondly, I'm not sure what it's trying to tell me. it says there was a 
"degradedarray event" but at the bottom it says there are no unused devices.

there are also some messages in /var/log/messages from the time of the
boot earlier today, but they do NOT say anything about "kicking out"
any of the md member devices (as they did in the event back in August):

Oct 19 18:29:41 fcshome kernel: device-mapper: dm-raid45: initialized 
v0.2594l
Oct 19 18:29:41 fcshome kernel: md: Autodetecting RAID arrays.
Oct 19 18:29:41 fcshome kernel: md: autorun ...
Oct 19 18:29:41 fcshome kernel: md: considering sdb2 ...
Oct 19 18:29:41 fcshome kernel: md:  adding sdb2 ...
Oct 19 18:29:41 fcshome kernel: md: sdb1 has different UUID to sdb2
Oct 19 18:29:41 fcshome kernel: md: sda2 has same UUID but different 
superblock 
to sdb2
Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sdb2
Oct 19 18:29:41 fcshome kernel: md: created md125
Oct 19 18:29:41 fcshome kernel: md: bind
Oct 19 18:29:41 fcshome kernel: md: running: 
Oct 19 18:29:41 fcshome kernel: raid1: raid set md125 active with 1 out 
of 2 mir
rors
Oct 19 18:29:41 fcshome kernel: md: considering sdb1 ...
Oct 19 18:29:41 fcshome kernel: md:  adding sdb1 ...
Oct 19 18:29:41 fcshome kernel: md: sda2 has different UUID to sdb1
Oct 19 18:29:41 fcshome kernel: md: sda1 has same UUID but different 
superblock 
to sdb1
Oct 19 18:29:41 fcshome kernel: md: created md126
Oct 19 18:29:41 fcshome kernel: md: bind
Oct 19 18:29:41 fcshome kernel: md: running: 
Oct 19 18:29:41 fcshome kernel: raid1: raid set md126 active with 1 out 
of 2 mirrors
Oct 19 18:29:41 fcshome kernel: md: considering sda2 ...
Oct 19 18:29:41 fcshome kernel: md:  adding sda2 ...
Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sda2
Oct 19 18:29:41 fcshome kernel: md: created md1
Oct 19 18:29:41 fcshome kernel: md: bind
Oct 19 18:29:41 fcshome kernel: md: running: 
Oct 19 18:29:41 fcshome kernel: raid1: raid set md1 active with 1 out 
of 2 mirrors
Oct 19 18:29:41 fcshome kernel: md: considering sda1 ...
Oct 19 18:29:41 fcshome kernel: md:  adding sda1 ...
Oct 19 18:29:41 fcshome kernel: md: created md0
Oct 19 18:29:41 fcshome kernel: md: bind
Oct 19 18:29:41 fcshome kernel: md: running: 
Oct 19 18:29:41 fcshome kernel: raid1: raid set md0 active with 1 out 
of 2 mirrors
Oct 19 18:29:41 fcshome kernel: md: ... autorun DONE.

and here's /etc/mdadm.conf:

# cat /etc/mdadm.conf

# mdadm.conf written out by anaconda
DEVICE partitions
MAILADDR fredex
ARRAY /dev/md0 level=raid1 num-devices=2 
uuid=4eb13e45:b5228982:f03cd503:f935bd69
ARRAY /dev/md1 level=raid1 num-devices=2 
uuid=5c79b138:e36d4286:df9cf6f6:62ae1f12

which doesn't say anything about md125 or md126,... might they be some kind of 
detritus
or fragments left over from whatever kind of failure caused the array to become 
degraded?

do ya suppose a boot from power-off might somehow give it a whack upside the 
head so
it'll reassemble itself according to mdadm.conf?

I'm not sure which devices need to be failed and re-added to make it clean 
again (which
is all I had to do when I had the aforementioned earlier problem.)

Thanks in advance for any advice!

Fred

-- 
 Fred Smith -- fre...@fcshome.stoneham.ma.us -

Re: [CentOS] more software raid questions

2010-10-19 Thread Rob Kampen

fred smith wrote:

hi all!

back in Aug several of you assisted me in solving a problem where one
of my drives had dropped out of (or been kicked out of) the raid1 array.

something vaguely similar appears to have happened just a few mins ago,
upon rebooting after a small update. I received four emails like this,
one for /dev/md0, one for /dev/md1, one for /dev/md125 and one for
/dev/md126:

Subject: DegradedArray event on /dev/md125:fcshome.stoneham.ma.us
X-Spambayes-Classification: unsure; 0.24
Status: RO
Content-Length: 564
Lines: 23

This is an automatically generated mail message from mdadm
running on fcshome.stoneham.ma.us

A DegradedArray event had been detected on md device /dev/md125.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

	Personalities : [raid1] 
	md0 : active raid1 sda1[0]

  104320 blocks [2/1] [U_]
	  
	md126 : active raid1 sdb1[1]

  104320 blocks [2/1] [_U]
	  
	md125 : active raid1 sdb2[1]

  312464128 blocks [2/1] [_U]
	  
	md1 : active raid1 sda2[0]

  312464128 blocks [2/1] [U_]
	  
	unused devices: 


firstly, what the heck are md125 and md126? previously there was
only md0 and md1 

secondly, I'm not sure what it's trying to tell me. it says there was a 
"degradedarray event" but at the bottom it says there are no unused devices.


there are also some messages in /var/log/messages from the time of the
boot earlier today, but they do NOT say anything about "kicking out"
any of the md member devices (as they did in the event back in August):

Oct 19 18:29:41 fcshome kernel: device-mapper: dm-raid45: initialized 
v0.2594l
Oct 19 18:29:41 fcshome kernel: md: Autodetecting RAID arrays.
Oct 19 18:29:41 fcshome kernel: md: autorun ...
Oct 19 18:29:41 fcshome kernel: md: considering sdb2 ...
Oct 19 18:29:41 fcshome kernel: md:  adding sdb2 ...
Oct 19 18:29:41 fcshome kernel: md: sdb1 has different UUID to sdb2
	Oct 19 18:29:41 fcshome kernel: md: sda2 has same UUID but different superblock 
	to sdb2
  

This appears to be the cause

Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sdb2
Oct 19 18:29:41 fcshome kernel: md: created md125
  
this was auto created - I've not experienced this myself and run half a 
dozen of these on different machines.

Oct 19 18:29:41 fcshome kernel: md: bind
Oct 19 18:29:41 fcshome kernel: md: running: 
Oct 19 18:29:41 fcshome kernel: raid1: raid set md125 active with 1 out 
of 2 mir
rors
  

now it has mounted it separately

Oct 19 18:29:41 fcshome kernel: md: considering sdb1 ...
Oct 19 18:29:41 fcshome kernel: md:  adding sdb1 ...
Oct 19 18:29:41 fcshome kernel: md: sda2 has different UUID to sdb1
	Oct 19 18:29:41 fcshome kernel: md: sda1 has same UUID but different superblock 
	to sdb1
  

and now for the second one

Oct 19 18:29:41 fcshome kernel: md: created md126
Oct 19 18:29:41 fcshome kernel: md: bind
Oct 19 18:29:41 fcshome kernel: md: running: 
Oct 19 18:29:41 fcshome kernel: raid1: raid set md126 active with 1 out 
of 2 mirrors
Oct 19 18:29:41 fcshome kernel: md: considering sda2 ...
Oct 19 18:29:41 fcshome kernel: md:  adding sda2 ...
Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sda2
Oct 19 18:29:41 fcshome kernel: md: created md1
Oct 19 18:29:41 fcshome kernel: md: bind
Oct 19 18:29:41 fcshome kernel: md: running: 
Oct 19 18:29:41 fcshome kernel: raid1: raid set md1 active with 1 out 
of 2 mirrors
Oct 19 18:29:41 fcshome kernel: md: considering sda1 ...
Oct 19 18:29:41 fcshome kernel: md:  adding sda1 ...
Oct 19 18:29:41 fcshome kernel: md: created md0
Oct 19 18:29:41 fcshome kernel: md: bind
Oct 19 18:29:41 fcshome kernel: md: running: 
Oct 19 18:29:41 fcshome kernel: raid1: raid set md0 active with 1 out 
of 2 mirrors
Oct 19 18:29:41 fcshome kernel: md: ... autorun DONE.

and here's /etc/mdadm.conf:

# cat /etc/mdadm.conf

# mdadm.conf written out by anaconda
DEVICE partitions
MAILADDR fredex
ARRAY /dev/md0 level=raid1 num-devices=2 
uuid=4eb13e45:b5228982:f03cd503:f935bd69
ARRAY /dev/md1 level=raid1 num-devices=2 
uuid=5c79b138:e36d4286:df9cf6f6:62ae1f12

which doesn't say anything about md125 or md126,... might they be some kind of 
detritus
or fragments left over from whatever kind of failure caused the array to become 
degraded?

  
now you need to decide (by looking at each device (may need to mount 
first.)) which is the correct master.
remove the other one and add it back to the original array - it will 
them rebuild.
If these are SATA drives just check the cable - I have one machine where 
they work loose

Re: [CentOS] more software raid questions

2010-10-19 Thread Tom H
On Tue, Oct 19, 2010 at 7:59 PM, fred smith
 wrote:
>
> back in Aug several of you assisted me in solving a problem where one
> of my drives had dropped out of (or been kicked out of) the raid1 array.
>
> something vaguely similar appears to have happened just a few mins ago,
> upon rebooting after a small update. I received four emails like this,
> one for /dev/md0, one for /dev/md1, one for /dev/md125 and one for
> /dev/md126:
>
>        Subject: DegradedArray event on /dev/md125:fcshome.stoneham.ma.us
>        X-Spambayes-Classification: unsure; 0.24
>        Status: RO
>        Content-Length: 564
>        Lines: 23
>
>        This is an automatically generated mail message from mdadm
>        running on fcshome.stoneham.ma.us
>
>        A DegradedArray event had been detected on md device /dev/md125.
>
>        Faithfully yours, etc.
>
>        P.S. The /proc/mdstat file currently contains the following:
>
>        Personalities : [raid1]
>        md0 : active raid1 sda1[0]
>              104320 blocks [2/1] [U_]
>
>        md126 : active raid1 sdb1[1]
>              104320 blocks [2/1] [_U]
>
>        md125 : active raid1 sdb2[1]
>              312464128 blocks [2/1] [_U]
>
>        md1 : active raid1 sda2[0]
>              312464128 blocks [2/1] [U_]
>
>        unused devices: 
>
> firstly, what the heck are md125 and md126? previously there was
> only md0 and md1 
>
> secondly, I'm not sure what it's trying to tell me. it says there was a
> "degradedarray event" but at the bottom it says there are no unused devices.
>
> there are also some messages in /var/log/messages from the time of the
> boot earlier today, but they do NOT say anything about "kicking out"
> any of the md member devices (as they did in the event back in August):
>
>        Oct 19 18:29:41 fcshome kernel: device-mapper: dm-raid45: initialized 
> v0.2594l
>        Oct 19 18:29:41 fcshome kernel: md: Autodetecting RAID arrays.
>        Oct 19 18:29:41 fcshome kernel: md: autorun ...
>        Oct 19 18:29:41 fcshome kernel: md: considering sdb2 ...
>        Oct 19 18:29:41 fcshome kernel: md:  adding sdb2 ...
>        Oct 19 18:29:41 fcshome kernel: md: sdb1 has different UUID to sdb2
>        Oct 19 18:29:41 fcshome kernel: md: sda2 has same UUID but different 
> superblock
>        to sdb2
>        Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sdb2
>        Oct 19 18:29:41 fcshome kernel: md: created md125
>        Oct 19 18:29:41 fcshome kernel: md: bind
>        Oct 19 18:29:41 fcshome kernel: md: running: 
>        Oct 19 18:29:41 fcshome kernel: raid1: raid set md125 active with 1 
> out of 2 mir
>        rors
>        Oct 19 18:29:41 fcshome kernel: md: considering sdb1 ...
>        Oct 19 18:29:41 fcshome kernel: md:  adding sdb1 ...
>        Oct 19 18:29:41 fcshome kernel: md: sda2 has different UUID to sdb1
>        Oct 19 18:29:41 fcshome kernel: md: sda1 has same UUID but different 
> superblock
>        to sdb1
>        Oct 19 18:29:41 fcshome kernel: md: created md126
>        Oct 19 18:29:41 fcshome kernel: md: bind
>        Oct 19 18:29:41 fcshome kernel: md: running: 
>        Oct 19 18:29:41 fcshome kernel: raid1: raid set md126 active with 1 
> out of 2 mirrors
>        Oct 19 18:29:41 fcshome kernel: md: considering sda2 ...
>        Oct 19 18:29:41 fcshome kernel: md:  adding sda2 ...
>        Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sda2
>        Oct 19 18:29:41 fcshome kernel: md: created md1
>        Oct 19 18:29:41 fcshome kernel: md: bind
>        Oct 19 18:29:41 fcshome kernel: md: running: 
>        Oct 19 18:29:41 fcshome kernel: raid1: raid set md1 active with 1 out 
> of 2 mirrors
>        Oct 19 18:29:41 fcshome kernel: md: considering sda1 ...
>        Oct 19 18:29:41 fcshome kernel: md:  adding sda1 ...
>        Oct 19 18:29:41 fcshome kernel: md: created md0
>        Oct 19 18:29:41 fcshome kernel: md: bind
>        Oct 19 18:29:41 fcshome kernel: md: running: 
>        Oct 19 18:29:41 fcshome kernel: raid1: raid set md0 active with 1 out 
> of 2 mirrors
>        Oct 19 18:29:41 fcshome kernel: md: ... autorun DONE.
>
> and here's /etc/mdadm.conf:
>
>        # cat /etc/mdadm.conf
>
>        # mdadm.conf written out by anaconda
>        DEVICE partitions
>        MAILADDR fredex
>        ARRAY /dev/md0 level=raid1 num-devices=2 
> uuid=4eb13e45:b5228982:f03cd503:f935bd69
>        ARRAY /dev/md1 level=raid1 num-devices=2 
> uuid=5c79b138:e36d4286:df9cf6f6:62ae1f12
>
> which doesn't say anything about md125 or md126,... might they be some kind 
> of detritus
> or fragments left over from whatever kind of failure caused the array to 
> become degraded?

The superblocks in sdb1 and sdb2 is different from the superblocks in
sda1 and sda2 so mdadm assembled sdb1 and sdb2 into different arrays.
I'd have expected them to be md126 and md127 not md125 and md126 bu
that's normal.

Your problem is that all four arrays are degraded.

Which ones are mounted? Assuming

Re: [CentOS] more software raid questions

2010-10-19 Thread Nataraj
fred smith wrote:helppain/backups/disks/
> hi all!
>
> back in Aug several of you assisted me in solving a problem where one
> of my drives had dropped out of (or been kicked out of) the raid1 array.
>
> something vaguely similar appears to have happened just a few mins ago,
> upon rebooting after a small update. I received four emails like this,
> one for /dev/md0, one for /dev/md1, one for /dev/md125 and one for
> /dev/md126:
>
>   Subject: DegradedArray event on /dev/md125:fcshome.stoneham.ma.us
>   X-Spambayes-Classification: unsure; 0.24
>   Status: RO
>   Content-Length: 564
>   Lines: 23
>
>   This is an automatically generated mail message from mdadm
>   running on fcshome.stoneham.ma.us
>
>   A DegradedArray event had been detected on md device /dev/md125.
>
>   Faithfully yours, etc.resources/
>
>   P.S. The /proc/mdstat file currently contains the following:
>
>   Personalities : [raid1] 
>   md0 : active raid1 sda1[0]
> 104320 blocks [2/1] [U_]
> 
>   md126 : active raid1 sdb1[1]
> 104320 blocks [2/1] [_U]
> 
>   md125 : active raid1 sdb2[1]
> 312464128 blocks [2/1] [_U]
> 
>   md1 : active raid1 sda2[0]
> 312464128 blocks [2/1] [U_]
> 
>   unused devices: 
>
> firstly, what the heck are md125 and md126? previously there was
> only md0 and md1 
>
> secondly, I'm not sure what it's trying to tell me. it says there was a 
> "degradedarray event" but at the bottom it says there are no unused devices.
>
> there are also some messages in /var/log/messages from the time of the
> boot earlier today, but they do NOT say anything about "kicking out"
> any of the md member devices (as they did in the event back in August):
>
>   Oct 19 18:29:41 fcshome kernel: device-mapper: dm-raid45: initialized 
> v0.2594l
>   Oct 19 18:29:41 fcshome kernel: md: Autodetecting RAID arrays.
>   Oct 19 18:29:41 fcshome kernel: md: autorun ...
>   Oct 19 18:29:41 fcshome kernel: md: considering sdb2 ...
>   Oct 19 18:29:41 fcshome kernel: md:  adding sdb2 ...
>   Oct 19 18:29:41 fcshome kernel: md: sdb1 has different UUID to sdb2
>   Oct 19 18:29:41 fcshome kernel: md: sda2 has same UUID but different 
> superblock 
>   to sdb2
>   Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sdb2
>   Oct 19 18:29:41 fcshome kernel: md: created md125
>   Oct 19 18:29:41 fcshome kernel: md: bind
>   Oct 19 18:29:41 fcshome kernel: md: running: 
>   Oct 19 18:29:41 fcshome kernel: raid1: raid set md125 active with 1 out 
> of 2 mir
>   rors
>   Oct 19 18:29:41 fcshome kernel: md: considering sdb1 ...
>   Oct 19 18:29:41 fcshome kernel: md:  adding sdb1 ...
>   Oct 19 18:29:41 fcshome kernel: md: sda2 has different UUID to sdb1
>   Oct 19 18:29:41 fcshome kernel: md: sda1 has same UUID but different 
> superblock 
>   to sdb1
>   Oct 19 18:29:41 fcshome kernel: md: created md126
>   Oct 19 18:29:41 fcshome kernel: md: bind
>   Oct 19 18:29:41 fcshome kernel: md: running: 
>   Oct 19 18:29:41 fcshome kernel: raid1: raid set md126 active with 1 out 
> of 2 mirrors
>   Oct 19 18:29:41 fcshome kernel: md: considering sda2 ...
>   Oct 19 18:29:41 fcshome kernel: md:  adding sda2 ...
>   Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sda2
>   Oct 19 18:29:41 fcshome kernel: md: created md1
>   Oct 19 18:29:41 fcshome kernel: md: bind
>   Oct 19 18:29:41 fcshome kernel: md: running: 
>   Oct 19 18:29:41 fcshome kernel: raid1: raid set md1 active with 1 out 
> of 2 mirrors
>   Oct 19 18:29:41 fcshome kernel: md: considering sda1 ...
>   Oct 19 18:29:41 fcshome kernel: md:  adding sda1 ...
>   Oct 19 18:29:41 fcshome kernel: md: created md0
>   Oct 19 18:29:41 fcshome kernel: md: bind
>   Oct 19 18:29:41 fcshome kernel: md: running: 
>   Oct 19 18:29:41 fcshome kernel: raid1: raid set md0 active with 1 out 
> of 2 mirrors
>   Oct 19 18:29:41 fcshome kernel: md: ... autorun DONE.
>
> and here's /etc/mdadm.conf:
>
>   # cat /etc/mdadm.conf
>
>   # mdadm.conf written out by anaconda
>   DEVICE partitions
>   MAILADDR fredex
>   ARRAY /dev/md0 level=raid1 num-devices=2 
> uuid=4eb13e45:b5228982:f03cd503:f935bd69
>   ARRAY /dev/md1 level=raid1 num-devices=2 
> uuid=5c79b138:e36d4286:df9cf6f6:62ae1f12
>
> which doesn't say anything about md125 or md126,... might they be some kind 
> of detritus
> or fragments left over from whatever kind of failure caused the array to 
> become degraded?
>
> do ya suppose a boot from power-off might somehow give it a whack upside the 
> head so
> it'll reassemble itself according to mdadm.conf?
>
> I'm not sure which devices need to be failed and re-added to make it clean 
> again (which
> is all I had to do when I had the aforementioned earlier problem.)

Re: [CentOS] more software raid questions

2010-10-19 Thread Nataraj
Nataraj wrote:
>
> I've seen this kind of thing happen when the autodetection stuff 
> misbehaves. I'm not sure why it does this or how to prevent it. Anyway, 
> to recover, I would use something like:
>
> mdadm --stop /dev/md125
> mdadm --stop /dev/md126
>
> If for some reason the above commands fail, check and make sure it has 
> not automounted the file systems from md125 and md126. Hopefully this 
> won't happen.
>
> Then use:
> mdadm /dev/md0 -a /dev/sdXX
> To add back the drive which belongs in md0, and similar for md1. In 
> general, it won't let you add the wrong drive, but if you want to check use:
> mdadm --examine /dev/sda1 | grep UUID
> and so forth for all your drives and find the ones with the same UUID.
>
> When I create my Raid arrays, I always use the option --bitmap=internal. 
> With this option set, a bitmap is used to keep track of which pages on 
> the drive are out of date and then you only resync pages which need 
> updating instead of recopying the whole drive when this happens. In the 
> past I once added a bitmap to an existing raid1 array using something 
> like this. This may not be the exact command, but I know it can be done:
> mdadm /dev/mdN --bitmap=internal
>
> Adding the bitmap is very worthwhile and saves time and risk of data 
> loss by not having to recopy the whole partition.
>
> Nataraj
> ___
> CentOS mailing list
> CentOS@centos.org
> http://lists.centos.org/mailman/listinfo/centos
>   
mdadm /dev/mdN --assemble --force
could also be useful, though I would be careful here. 
To use this, you would have to stop all of the arrays and then 
reassemble.  You could also specify the specific drives.
If you don't have a backup, you might want to backup the single drives 
that are properly mounted from md0 and md1.  Data loss is always a 
possibility with these type of manipulations, though I have successfully 
recovered from things like this without losing any data.  In fact I pull 
drives out of a raid array and add new drives in daily to sync them and 
send the second drive off site as a backup.

Nataraj

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] more software raid questions

2010-10-20 Thread fred smith
On Tue, Oct 19, 2010 at 07:34:19PM -0700, Nataraj wrote:
> fred smith wrote:helppain/backups/disks/
> > hi all!
> >
> > back in Aug several of you assisted me in solving a problem where one
> > of my drives had dropped out of (or been kicked out of) the raid1 array.
> >
> > something vaguely similar appears to have happened just a few mins ago,
> > upon rebooting after a small update. I received four emails like this,
> > one for /dev/md0, one for /dev/md1, one for /dev/md125 and one for
> > /dev/md126:
> >
> > Subject: DegradedArray event on /dev/md125:fcshome.stoneham.ma.us
> > X-Spambayes-Classification: unsure; 0.24
> > Status: RO
> > Content-Length: 564
> > Lines: 23
> >
> > This is an automatically generated mail message from mdadm
> > running on fcshome.stoneham.ma.us
> >
> > A DegradedArray event had been detected on md device /dev/md125.
> >
> > Faithfully yours, etc.resources/
> >
> > P.S. The /proc/mdstat file currently contains the following:
> >
> > Personalities : [raid1] 
> > md0 : active raid1 sda1[0]
> >   104320 blocks [2/1] [U_]
> >   
> > md126 : active raid1 sdb1[1]
> >   104320 blocks [2/1] [_U]
> >   
> > md125 : active raid1 sdb2[1]
> >   312464128 blocks [2/1] [_U]
> >   
> > md1 : active raid1 sda2[0]
> >   312464128 blocks [2/1] [U_]
> >   
> > unused devices: 
> >
> > firstly, what the heck are md125 and md126? previously there was
> > only md0 and md1 
> >
> > secondly, I'm not sure what it's trying to tell me. it says there was a 
> > "degradedarray event" but at the bottom it says there are no unused devices.
> >
> > there are also some messages in /var/log/messages from the time of the
> > boot earlier today, but they do NOT say anything about "kicking out"
> > any of the md member devices (as they did in the event back in August):
> >
> > Oct 19 18:29:41 fcshome kernel: device-mapper: dm-raid45: initialized 
> > v0.2594l
> > Oct 19 18:29:41 fcshome kernel: md: Autodetecting RAID arrays.
> > Oct 19 18:29:41 fcshome kernel: md: autorun ...
> > Oct 19 18:29:41 fcshome kernel: md: considering sdb2 ...
> > Oct 19 18:29:41 fcshome kernel: md:  adding sdb2 ...
> > Oct 19 18:29:41 fcshome kernel: md: sdb1 has different UUID to sdb2
> > Oct 19 18:29:41 fcshome kernel: md: sda2 has same UUID but different 
> > superblock 
> > to sdb2
> > Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sdb2
> > Oct 19 18:29:41 fcshome kernel: md: created md125
> > Oct 19 18:29:41 fcshome kernel: md: bind
> > Oct 19 18:29:41 fcshome kernel: md: running: 
> > Oct 19 18:29:41 fcshome kernel: raid1: raid set md125 active with 1 out 
> > of 2 mir
> > rors
> > Oct 19 18:29:41 fcshome kernel: md: considering sdb1 ...
> > Oct 19 18:29:41 fcshome kernel: md:  adding sdb1 ...
> > Oct 19 18:29:41 fcshome kernel: md: sda2 has different UUID to sdb1
> > Oct 19 18:29:41 fcshome kernel: md: sda1 has same UUID but different 
> > superblock 
> > to sdb1
> > Oct 19 18:29:41 fcshome kernel: md: created md126
> > Oct 19 18:29:41 fcshome kernel: md: bind
> > Oct 19 18:29:41 fcshome kernel: md: running: 
> > Oct 19 18:29:41 fcshome kernel: raid1: raid set md126 active with 1 out 
> > of 2 mirrors
> > Oct 19 18:29:41 fcshome kernel: md: considering sda2 ...
> > Oct 19 18:29:41 fcshome kernel: md:  adding sda2 ...
> > Oct 19 18:29:41 fcshome kernel: md: sda1 has different UUID to sda2
> > Oct 19 18:29:41 fcshome kernel: md: created md1
> > Oct 19 18:29:41 fcshome kernel: md: bind
> > Oct 19 18:29:41 fcshome kernel: md: running: 
> > Oct 19 18:29:41 fcshome kernel: raid1: raid set md1 active with 1 out 
> > of 2 mirrors
> > Oct 19 18:29:41 fcshome kernel: md: considering sda1 ...
> > Oct 19 18:29:41 fcshome kernel: md:  adding sda1 ...
> > Oct 19 18:29:41 fcshome kernel: md: created md0
> > Oct 19 18:29:41 fcshome kernel: md: bind
> > Oct 19 18:29:41 fcshome kernel: md: running: 
> > Oct 19 18:29:41 fcshome kernel: raid1: raid set md0 active with 1 out 
> > of 2 mirrors
> > Oct 19 18:29:41 fcshome kernel: md: ... autorun DONE.
> >
> > and here's /etc/mdadm.conf:
> >
> > # cat /etc/mdadm.conf
> >
> > # mdadm.conf written out by anaconda
> > DEVICE partitions
> > MAILADDR fredex
> > ARRAY /dev/md0 level=raid1 num-devices=2 
> > uuid=4eb13e45:b5228982:f03cd503:f935bd69
> > ARRAY /dev/md1 level=raid1 num-devices=2 
> > uuid=5c79b138:e36d4286:df9cf6f6:62ae1f12
> >
> > which doesn't say anything about md125 or md126,... might they be some kind 
> > of detritus
> > or fragments left over from whatever kind of failure caused the array to 
> > become degraded?
> >
> > do ya suppose a boot from power-off might somehow give it a whack upside 
> > the head so
> > it'll reassemble itself according to mdadm.conf?
> >
> > I

Re: [CentOS] more software raid questions

2010-10-20 Thread Rob Kampen




fred smith wrote:

  On Tue, Oct 19, 2010 at 07:34:19PM -0700, Nataraj wrote:
  
  
fred smith wrote:helppain/backups/disks/

  



  
Well, I've already tried to use --fail and --remove on md125 and md126
but I'm told the members are still active.

mdadm /dev/md126 --fail /dev/sdb1 --remove /dev/sdb1
mdadm /dev/md125 --fail /dev/sdb2 --remove /dev/sdb2

	mdadm /dev/md126 --fail /dev/sdb1 --remove /dev/sdb1
	mdadm: set /dev/sdb1 faulty in /dev/md126
	mdadm: hot remove failed for /dev/sdb1: Device or resource busy

with the intention of then re-adding them to md0 and md1.

so I tried:

mdadm /dev/md0 --fail /dev/sda1 --remove /dev/sda1
and got a similar message. 

at which point I knew I was in over my head.

  

it appears that the devices are mounted - need to umount first??

  
  
When I create my Raid arrays, I always use the option --bitmap=internal. 
With this option set, a bitmap is used to keep track of which pages on 
the drive are out of date and then you only resync pages which need 
updating instead of recopying the whole drive when this happens. In the 
past I once added a bitmap to an existing raid1 array using something 
like this. This may not be the exact command, but I know it can be done:
mdadm /dev/mdN --bitmap=internal

Adding the bitmap is very worthwhile and saves time and risk of data 
loss by not having to recopy the whole partition.

Nataraj
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos

  
  
  



<>___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] more software raid questions

2010-10-20 Thread Rudi Ahlers
On Wed, Oct 20, 2010 at 4:34 AM, Nataraj  wrote:
> When I create my Raid arrays, I always use the option --bitmap=internal.
> With this option set, a bitmap is used to keep track of which pages on
> the drive are out of date and then you only resync pages which need
> updating instead of recopying the whole drive when this happens. In the
> past I once added a bitmap to an existing raid1 array using something
> like this. This may not be the exact command, but I know it can be done:
> mdadm /dev/mdN --bitmap=internal
>
> ___
> CentOS mailing list
> CentOS@centos.org
> http://lists.centos.org/mailman/listinfo/centos
>



How do you add  --bitmap=internal to an existing, running RAID set? I
have tried with the command above but got the following error:


[r...@intranet ~]# mdadm /dev/md2 --bitmap=internal
mdadm: -b cannot have any extra immediately after it, sorry.
[r...@intranet ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 hda1[0]
  104320 blocks [2/1] [U_]

md2 : active raid1 sda1[0] sdb1[1]
  244195904 blocks [2/2] [UU]

md1 : active raid1 hda2[0]
  244091520 blocks [2/1] [U_]




-- 
Kind Regards
Rudi Ahlers
SoftDux

Website: http://www.SoftDux.com
Technical Blog: http://Blog.SoftDux.com
Office: 087 805 9573
Cell: 082 554 7532
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] more software raid questions

2010-10-20 Thread Rob Kampen

Rudi Ahlers wrote:

On Wed, Oct 20, 2010 at 4:34 AM, Nataraj  wrote:
  

When I create my Raid arrays, I always use the option --bitmap=internal.
With this option set, a bitmap is used to keep track of which pages on
the drive are out of date and then you only resync pages which need
updating instead of recopying the whole drive when this happens. In the
past I once added a bitmap to an existing raid1 array using something
like this. This may not be the exact command, but I know it can be done:
mdadm /dev/mdN --bitmap=internal

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos






How do you add  --bitmap=internal to an existing, running RAID set? I
have tried with the command above but got the following error:

  

try mdadm /dev/md2 -Gb internal
also it pays to have everything clean first and check you have a 
persistent superblock

i.e. mdadm -D /dev/md2
HTH

[r...@intranet ~]# mdadm /dev/md2 --bitmap=internal
mdadm: -b cannot have any extra immediately after it, sorry.
[r...@intranet ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 hda1[0]
  104320 blocks [2/1] [U_]

md2 : active raid1 sda1[0] sdb1[1]
  244195904 blocks [2/2] [UU]

md1 : active raid1 hda2[0]
  244091520 blocks [2/1] [U_]




  


<>___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] more software raid questions

2010-10-21 Thread Nataraj
fred smith wrote:
> On Tue, Oct 19, 2010 at 07:34:19PM -0700, Nataraj wrote:
>   
>>
>> I've seen this kind of thing happen when the autodetection stuff 
>> misbehaves. I'm not sure why it does this or how to prevent it. Anyway, 
>> to recover, I would use something like:
>>
>> mdadm --stop /dev/md125
>> mdadm --stop /dev/md126
>>
>> If for some reason the above commands fail, check and make sure it has 
>> not automounted the file systems from md125 and md126. Hopefully this 
>> won't happen.
>>
>> Then use:
>> mdadm /dev/md0 -a /dev/sdXX
>> To add back the drive which belongs in md0, and similar for md1. In 
>> general, it won't let you add the wrong drive, but if you want to check use:
>> mdadm --examine /dev/sda1 | grep UUID
>> and so forth for all your drives and find the ones with the same UUID.
>> 
>
> Well, I've already tried to use --fail and --remove on md125 and md126
> but I'm told the members are still active.
>
> mdadm /dev/md126 --fail /dev/sdb1 --remove /dev/sdb1
> mdadm /dev/md125 --fail /dev/sdb2 --remove /dev/sdb2
>   
You want to use --stop for the md125 and md126. Those are the raid 
devices that are not correct. Once they are stopped, you can take the 
drives from them and return them to md0 and md1 where they belong.

You will need to add the correct drive that was originally paired in 
each raid set, but as I mentioned, it won't let you add the wrong 
drives, so just try adding sdb1 to md0, then if it doesn't work, add it 
to sdb1. You can't fail out drives from arrays that only have one drive.

Nataraj
>   mdadm /dev/md126 --fail /dev/sdb1 --remove /dev/sdb1
>   mdadm: set /dev/sdb1 faulty in /dev/md126
>
>
>   mdadm: hot remove failed for /dev/sdb1: Device or resource busy
>
> with the intention of then re-adding them to md0 and md1.
>
> so I tried:
>
> mdadm /dev/md0 --fail /dev/sda1 --remove /dev/sda1
> and got a similar message. 
>
> at which point I knew I was in over my head.
>
>   
>> When I create my Raid arrays, I always use the option --bitmap=internal. 
>> With this option set, a bitmap is used to keep track of which pages on 
>> the drive are out of date and then you only resync pages which need 
>> updating instead of recopying the whole drive when this happens. In the 
>> past I once added a bitmap to an existing raid1 array using something 
>> like this. This may not be the exact command, but I know it can be done:
>> mdadm /dev/mdN --bitmap=internal
>>
>> Adding the bitmap is very worthwhile and saves time and risk of data 
>> loss by not having to recopy the whole partition.
>>
>> Nataraj
>> ___
>> CentOS mailing list
>> CentOS@centos.org
>> http://lists.centos.org/mailman/listinfo/centos
>> 
>
>   

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] more software raid questions

2010-10-21 Thread fred smith
On Thu, Oct 21, 2010 at 08:59:13AM -0700, Nataraj wrote:
> fred smith wrote:
> > On Tue, Oct 19, 2010 at 07:34:19PM -0700, Nataraj wrote:
> >   
> >>
> >> I've seen this kind of thing happen when the autodetection stuff 
> >> misbehaves. I'm not sure why it does this or how to prevent it. Anyway, 
> >> to recover, I would use something like:
> >>
> >> mdadm --stop /dev/md125
> >> mdadm --stop /dev/md126
> >>
> >> If for some reason the above commands fail, check and make sure it has 
> >> not automounted the file systems from md125 and md126. Hopefully this 
> >> won't happen.
> >>
> >> Then use:
> >> mdadm /dev/md0 -a /dev/sdXX
> >> To add back the drive which belongs in md0, and similar for md1. In 
> >> general, it won't let you add the wrong drive, but if you want to check 
> >> use:
> >> mdadm --examine /dev/sda1 | grep UUID
> >> and so forth for all your drives and find the ones with the same UUID.
> >> 
> >
> > Well, I've already tried to use --fail and --remove on md125 and md126
> > but I'm told the members are still active.
> >
> > mdadm /dev/md126 --fail /dev/sdb1 --remove /dev/sdb1
> > mdadm /dev/md125 --fail /dev/sdb2 --remove /dev/sdb2
> >   
> You want to use --stop for the md125 and md126. Those are the raid 
> devices that are not correct. Once they are stopped, you can take the 
> drives from them and return them to md0 and md1 where they belong.!

> 
> You will need to add the correct drive that was originally paired in 
> each raid set, but as I mentioned, it won't let you add the wrong 
> drives, so just try adding sdb1 to md0, then if it doesn't work, add it 
> to sdb1. You can't fail out drives from arrays that only have one drive.

Thanks for the additional information.

I'll try backing up everything this weekend then will take a stab at it.

someone said earlier that the differing raid superblocks were probably
the cause of the misassignment in the first place. but I have no clue
how the superblocks could have become messed up, can any of you comment
on that? willl I need to hack at that issue, too, before I can succeed?

thanks again!

> 
> Nataraj
> > mdadm /dev/md126 --fail /dev/sdb1 --remove /dev/sdb1
> > mdadm: set /dev/sdb1 faulty in /dev/md126
> >
> >
> > mdadm: hot remove failed for /dev/sdb1: Device or resource busy
> >
> > with the intention of then re-adding them to md0 and md1.
> >
> > so I tried:
> >
> > mdadm /dev/md0 --fail /dev/sda1 --remove /dev/sda1
> > and got a similar message. 
> >
> > at which point I knew I was in over my head.
> >
> >   
> >> When I create my Raid arrays, I always use the option --bitmap=internal. 
> >> With this option set, a bitmap is used to keep track of which pages on 
> >> the drive are out of date and then you only resync pages which need 
> >> updating instead of recopying the whole drive when this happens. In the 
> >> past I once added a bitmap to an existing raid1 array using something 
> >> like this. This may not be the exact command, but I know it can be done:
> >> mdadm /dev/mdN --bitmap=internal
> >>
> >> Adding the bitmap is very worthwhile and saves time and risk of data 
> >> loss by not having to recopy the whole partition.
> >>
> >> Nataraj

-- 
 Fred Smith -- fre...@fcshome.stoneham.ma.us -
The Lord detests the way of the wicked 
  but he loves those who pursue righteousness.
- Proverbs 15:9 (niv) -
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] more software raid questions

2010-10-21 Thread Nataraj
fred smith wrote:
>
> Thanks for the additional information.
>
> I'll try backing up everything this weekend then will take a stab at it.
>
> someone said earlier that the differing raid superblocks were probably
> the cause of the misassignment in the first place. but I have no clue
> how the superblocks could have become messed up, can any of you comment
> on that? willl I need to hack at that issue, too, before I can succeed?
>
> thanks again!
>
>   
>> Nataraj
>> 
I would first try adding the drives back in with:

mdadm /dev/mdN -a /dev/sdXn

Again, this is after having stopped the bogus md arrays.

If that doesn't work, I would try assemble with a --force option, which 
might be a little more dangerous than the hot add, but probably not 
much. I can say that when I have a drive fall out of an array I am 
always able to add it back with the first command (-a). As I mentioned, 
I do have bitmaps on all my arrays, but you can't change that until you 
rebuild the raidset.

I believe these comands will take care of everything. You shouldn't have 
to do any diddling of the superblocks at a low level, and if the problem 
is that bad, you might be best to backup and recreate the whole array or 
engage the services of someone who knows how to muck with the data 
structures on the disk. I've never had to use anything other than mdadm 
to manage my raid arrays and I've never lost data with linux software 
raid in the 10 or more years that I've been using it. I've found it to 
be quite robust. Backing up is just a precaution that is a good idea for 
anyone to take if they care about their data.

If these problems reoccur on a regular basis, you could have a bad 
drive, a power supply problem or a cabling problem. Assuming your drives 
are attached to SATA, SCSI or SAS controller, you can use smartctl to 
check the drives and see if they are getting errors or other faults. 
smartctl will not work with USB or firefire attached drives.

Nataraj
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] more software raid questions

2010-10-21 Thread Nataraj
Nataraj wrote:
> fred smith wrote:
>   
>> Thanks for the additional information.
>>
>> I'll try backing up everything this weekend then will take a stab at it.
>>
>> someone said earlier that the differing raid superblocks were probably
>> the cause of the misassignment in the first place. but I have no clue
>> how the superblocks could have become messed up, can any of you comment
>> on that? willl I need to hack at that issue, too, before I can succeed?
>>
>> thanks again!
>>
>>   
>> 
>>> Nataraj
>>> 
>>>   
> I would first try adding the drives back in with:
>
> mdadm /dev/mdN -a /dev/sdXn
>
> Again, this is after having stopped the bogus md arrays.
>
> If that doesn't work, I would try assemble with a --force option, which 
> might be a little more dangerous than the hot add, but probably not 
> much. I can say that when I have a drive fall out of an array I am 
> always able to add it back with the first command (-a). As I mentioned, 
> I do have bitmaps on all my arrays, but you can't change that until you 
> rebuild the raidset.
>   
Note, that if you need to use assemble --force, you must stop the array 
first and know exactly which drives you want to assemble the array with.
> I believe these comands will take care of everything. You shouldn't have 
> to do any diddling of the superblocks at a low level, and if the problem 
> is that bad, you might be best to backup and recreate the whole array or 
> engage the services of someone who knows how to muck with the data 
> structures on the disk. I've never had to use anything other than mdadm 
> to manage my raid arrays and I've never lost data with linux software 
> raid in the 10 or more years that I've been using it. I've found it to 
> be quite robust. Backing up is just a precaution that is a good idea for 
> anyone to take if they care about their data.
>
> If these problems reoccur on a regular basis, you could have a bad 
> drive, a power supply problem or a cabling problem. Assuming your drives 
> are attached to SATA, SCSI or SAS controller, you can use smartctl to 
> check the drives and see if they are getting errors or other faults. 
> smartctl will not work with USB or firefire attached drives.
>
> Nataraj
> ___
> CentOS mailing list
> CentOS@centos.org
> http://lists.centos.org/mailman/listinfo/centos
>   

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] more software raid questions

2010-10-21 Thread Nataraj
Nataraj wrote:
> Nataraj wrote:
>   
>> fred smith wrote:
>>   
>> 
>>> Thanks for the additional information.
>>>
>>> I'll try backing up everything this weekend then will take a stab at it.
>>>
>>> someone said earlier that the differing raid superblocks were probably
>>> the cause of the misassignment in the first place. but I have no clue
>>> how the superblocks could have become messed up, can any of you comment
>>> on that? willl I need to hack at that issue, too, before I can succeed?
>>>
>>> thanks again!
>>>
>>>   
>>> 
>>>   
 Nataraj
 
   
 
>> I would first try adding the drives back in with:
>>
>> mdadm /dev/mdN -a /dev/sdXn
>>
>> Again, this is after having stopped the bogus md arrays.
>>
>> If that doesn't work, I would try assemble with a --force option, which 
>> might be a little more dangerous than the hot add, but probably not 
>> much. I can say that when I have a drive fall out of an array I am 
>> always able to add it back with the first command (-a). As I mentioned, 
>> I do have bitmaps on all my arrays, but you can't change that until you 
>> rebuild the raidset.
>>   
>> 
> Note, that if you need to use assemble --force, you must stop the array 
> first and know exactly which drives you want to assemble the array with.
>   
It's possible that my drives go back so easily because of the bitmap.  
You can probably also use --force with the -a option (hot add).
If you use --force, I would make sure that you are specifying the write 
drives/partitions since --force will probably cause whatever partition 
you give it to be used in the array regardless of weather it was in the 
same array before.  So if you use --force, I would check the UUIDs of 
the partitions first and make sure they are the same, since --force 
would allow you to insert one of your md1 partitions into your md0 array.

Nataraj


Nataraj
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] more software raid questions

2010-10-25 Thread Scott Silva
on 10-21-2010 9:13 AM fred smith spake the following:
> On Thu, Oct 21, 2010 at 08:59:13AM -0700, Nataraj wrote:
>> fred smith wrote:
>>> On Tue, Oct 19, 2010 at 07:34:19PM -0700, Nataraj wrote:
>>>   

 I've seen this kind of thing happen when the autodetection stuff 
 misbehaves. I'm not sure why it does this or how to prevent it. Anyway, 
 to recover, I would use something like:

 mdadm --stop /dev/md125
 mdadm --stop /dev/md126

 If for some reason the above commands fail, check and make sure it has 
 not automounted the file systems from md125 and md126. Hopefully this 
 won't happen.

 Then use:
 mdadm /dev/md0 -a /dev/sdXX
 To add back the drive which belongs in md0, and similar for md1. In 
 general, it won't let you add the wrong drive, but if you want to check 
 use:
 mdadm --examine /dev/sda1 | grep UUID
 and so forth for all your drives and find the ones with the same UUID.
 
>>>
>>> Well, I've already tried to use --fail and --remove on md125 and md126
>>> but I'm told the members are still active.
>>>
>>> mdadm /dev/md126 --fail /dev/sdb1 --remove /dev/sdb1
>>> mdadm /dev/md125 --fail /dev/sdb2 --remove /dev/sdb2
>>>   
>> You want to use --stop for the md125 and md126. Those are the raid 
>> devices that are not correct. Once they are stopped, you can take the 
>> drives from them and return them to md0 and md1 where they belong.!
> 
>>
>> You will need to add the correct drive that was originally paired in 
>> each raid set, but as I mentioned, it won't let you add the wrong 
>> drives, so just try adding sdb1 to md0, then if it doesn't work, add it 
>> to sdb1. You can't fail out drives from arrays that only have one drive.
> 
> Thanks for the additional information.
> 
> I'll try backing up everything this weekend then will take a stab at it.
> 
> someone said earlier that the differing raid superblocks were probably
> the cause of the misassignment in the first place. but I have no clue
> how the superblocks could have become messed up, can any of you comment
> on that? willl I need to hack at that issue, too, before I can succeed?
> 
> thanks again!
> 
If the system lost power or otherwise went off before all superblock data was
flushed, that could have corrupted the data.I would assume that the oddball
devices were the corrupt ones, but unless you have something to compare to, it
is hard to be sure

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos