We have also had a few kernel panic's at the same time as a failed disk. I 
don't know what was first but anecdotally, it seems that we might be seeing an 
occasional kernel panic with a disk failure on swraid...  Though that is still 
just FUD, so don't put stock in it unless you see it.

-----Original Message-----
From: Oleg Drokin [mailto:gr...@whamcloud.com] 
Sent: Monday, March 28, 2011 3:57 PM
To: Lundgren, Andrew
Cc: Brian O'Connor; lustre-discuss@lists.lustre.org
Subject: Re: [Lustre-discuss] software raid

Hello!

On Mar 28, 2011, at 4:43 PM, Lundgren, Andrew wrote:
> 
> When you reboot a machine that has a failed disk in the array (degraded), the 
> array will not start by default in a degraded state.  If you have LVMs on top 
> of your raid arrays, they will also not start.  You will need to log into the 
> machine, manually force start the array in a degraded state and then manually 
> start the LVM on top of the SW raid array.

I am with you on everything but this point.
In my experience Linux SW raid does start when the array is degraded. Unless 
you have --no-degraded as default mdadm option, of course.
There is a subtle case when it does behave strange and I see it on just one of 
my nodes, this is when all devices claim they were stopped cleanly yet they 
disagree about number of events processed.
In this case the array still starts in degraded mode, but the one disk that has 
the outlying event counter is kicked from the array and is not rebuilt until 
you manually re-add it back.
I have seen it only with RAID5 so far and the theory is that a disk controller 
(or the disks themselves?) in that particular node is bad and does not flush 
it's cache when asked and on power off.
Of course if you miss this degraded state and don't re-add anything thee is a 
chance on next reboot the two remaining disks will get out of sync as well and 
then the array will fail to start completely.
Surprisingly what totally fixed this issue for me was enabling bitmaps (of 
course if you don't want to have negative performance impact of those you need 
to set them up on a separate device).

Bye,
    Oleg
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to