RE: new features time-line

2006-10-15 Thread Neil Brown
On Friday October 13, [EMAIL PROTECTED] wrote:
> Good to hear.  
> 
> I think when I first built my RAID (a few years ago) I did some research on
> this;
> http://www.google.com/search?hl=en&q=bad+block+replacement+capabilities+mdad
> m
> 
> And found stories where bit errors were an issue.
> http://www.ogre.com/tiki-read_article.php?articleId=7
> 
> After your email, I went out and researched it again.  Eleven months ago a
> patch to address this was submitted for RAID5, I would assume RAID6
> benefited from it too? 

Yes.  All appropriate raid level support auto-overwrite of read
errors.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ANNOUNCE: mdadm 2.5.4 - A tool for managing Soft RAID under Linux

2006-10-15 Thread Neil Brown
On Friday October 13, [EMAIL PROTECTED] wrote:
> On Fri, Oct 13, 2006 at 10:15:35AM +1000, Neil Brown wrote:
> >
> >I am pleased to announce the availability of
> >   mdadm version 2.5.4
> 
> it looks like you did not include the patches i posted against 2.5.3
> 

No... sorry about that.

They are all in the git tree now.

Thanks,
NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: possible deadlock through raid5/md

2006-10-15 Thread Peter T. Breuer
While travelling the last few days, a theory has occurred to me to
explain this sort of thing ...


>  A user has sent me a ps ax output showing an enbd client daemon
>  blocked in get_active_stripe (I presume in raid5.c).
> 
> ps ax -of,uid,pid,ppid,pri,ni,vsz,rss,wchan:30,stat,tty,time,command
> 
> F   UID   PID  PPID PRI  NI   VSZ  RSS WCHAN STAT TT TIME COMMAND
> 5 0 26540 1  23   0  2140 1048 get_active_stripe Ds   ?  00:00:00 
> enbd-client iss04 1300 -i iss04-hdd -n 2  -e -m -b 4096 -p 30 /dev/ndl


Suppose that memory is full of dirty buffers and that the _transport_
for the medium on which one of the raid disks is running (in this case
tcp, under enbd and elsewhere) needs buffers.  It needs buffers both to
read and write.  But there are none available so the call through the
user process which wants to use the transport causes the kernel to try
and free pages.

That causes the user process to end up in the kernel routines which try
and flush devices to disk, and through them in the various (request?)
functions of device drivers, and perhaps even in raid5's
get_active_stripe.

However, if that stripe is on a remote disk availale through tcp, then
tcp is blocked by lack of the resources that are trying to be freed, so
we are in deadlock?


Sound plausible? Cure ought to be to keep some kernel memory available
for tcp that is not available to dirty buffers.

Peter

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Multiple Disk Failure Recovery

2006-10-15 Thread Roger Gammans
On Sat, Oct 14, 2006 at 11:35:31PM -0700, David Rees wrote:
> On 10/14/06, Lane Brooks <[EMAIL PROTECTED]> wrote:
> >functioning.  Right now I cannot get a spare disk recovery to finish
> >because these bad sectors.  Is there a way to force as much recovery as
> >possible so that I can replace this newly faulty drive?
> 
> One technique is to use ddrescue to create an image of the failing
> drive(s) (I would image all drives if possible) and use those images
> to try to retrieve your data.


I'm currently writing error supression md layer precisely for situations
like this to remove the need for the dd step. 

Obvisouly it doesn't recover the data in the bad sectors or masked by errors
but the aim is to reduce to need of doing multiple copies which
recoversy procdure tend to invovle.

TTFN
-- 
Roger.  Home| http://www.sandman.uklinux.net/
Master of Peng Shui.  (Ancient oriental art of Penguin Arranging)
Work|Independent Sys Consultant | http://www.computer-surgery.co.uk/
So what are the eigenvalues and eigenvectors of 'The Matrix'? --anon


signature.asc
Description: Digital signature


FW: Multiple Disk Failure Recovery

2006-10-15 Thread Dan

One could remove the spare drive from the system.  Than do the mdadm
--assemble --force to get it start and keep it from trying to
resync/recover.

Once you get the array up and 'limping' carefully pick the most important
stuff and copy it off the array and hope the bad sectors don't did not
affect that data.  As you mentioned, you have bad sectors.  If you try to do
it all or even as you pick the important stuff (I know it is all important
that why it was on the RAID), you will eventually hit data that has bad
sectors and mdadm will fail the effected drive and deactivate the array.  At
that point accept that the data in that area is most likely gone. Than do
the mdadm --assemble --force to get it started again and move on to the next
areas of most important data.  Could be a long cycle...

Aside from that, I am curious, was your spare disk shared by another array.
If it was not than I would recommend you don't do a RAID5 with hot spare
next time, and do a RAID6.  But this is my person feeling and you can take
it or leave it.

I too at one point did a RAID5 with hot spare and I was using eight drives.
So yes you can have two drives fall as long as the delta between failures is
long enough to allow the raid to resync the spare in and during the process
there are no be unknown bad sectors on the remaining drives.  But I got to
thinking, if I am going to be spinning/powering that "hot spare" I may as
well as do a RAID6.  As long as the hot spare is not share with other
arrays, I see no downside and it would protect you in the future from this
problem.

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Lane Brooks
Sent: Saturday, October 14, 2006 9:29 PM
To: linux-raid@vger.kernel.org
Subject: Multiple Disk Failure Recovery

I have a RAID 5 setup with one spare disk.  One of the disks failed, so 
I replaced it, and while it was recovering, it found some bad sectors on 
another drive (unreadable and uncorrectable SMART errors).  This 
generated a fail event and shut down the array.

When I restart the array with the force command, it starts the recovery 
process and dies when it gets to the bad sectors, so I am stuck in an 
infinite loop.

I am wondering if there is a way to cut my losses with these bad sectors 
and have it recover what it can so that I can get my raid array back to 
functioning.  Right now I cannot get a spare disk recovery to finish 
because these bad sectors.  Is there a way to force as much recovery as 
possible so that I can replace this newly faulty drive?

Thanks
Lane
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html