Re: drive failure during rebuild causes page fault

2005-05-22 Thread Joe Rhett
 You need to overwrite the metadata (se above) which are located in
 different places again depending on metadata format.
 
 So where is it located with the sil3114 controler?
 (same as 3112, but with 4 ports...)
 
On Sun, May 22, 2005 at 12:45:05AM +0200, Søren Schmidt wrote:
 Depends on what BIOS you have on there, several exists for the SiI  
 chips, -current or mkIII would tell you which. Just null out the last  
 63 sectors on the disks and you should be fine since all possible  
 formats are in that range...
 
I know how to do this using dd from the start of the disk.  How do I do
this at the end of the disk?

-- 
Joe Rhett
senior geek
meer.net
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2005-05-22 Thread Søren Schmidt


On 22/05/2005, at 18:11, Joe Rhett wrote:


You need to overwrite the metadata (se above) which are located in
different places again depending on metadata format.



So where is it located with the sil3114 controler?
(same as 3112, but with 4 ports...)



On Sun, May 22, 2005 at 12:45:05AM +0200, Søren Schmidt wrote:


Depends on what BIOS you have on there, several exists for the SiI
chips, -current or mkIII would tell you which. Just null out the last
63 sectors on the disks and you should be fine since all possible
formats are in that range...



I know how to do this using dd from the start of the disk.  How do  
I do

this at the end of the disk?


man dd ? :)

you need to get the size of the disk in sectors (hint atacontrol)

then you do dd if=/dev/zero of=/dev/adN oseek=(size-63)

- Søren


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2005-05-21 Thread Søren Schmidt


On 21/05/2005, at 1:10, Joe Rhett wrote:



On Thu, May 19, 2005 at 08:21:13AM +0200, Søren Schmidt wrote:


On 19/05/2005, at 2.20, Joe Rhett wrote:



Soren, I've just retested all of this with 5.4-REL and most of the
problems
listed here are solved.  The only problems appear to be related to
these
ghost arrays that appear when it finds a drive that was taken  
offline

earlier.  For example, pull a drive and then reboot the system.



This depends heavily on the metadata format used, some of them simply
doesn't have the info to avoid this and some just ignores the  
problem.



....


You need to overwrite the metadata (se above) which are located in
different places again depending on metadata format.



So where is it located with the sil3114 controler?
(same as 3112, but with 4 ports...)


Depends on what BIOS you have on there, several exists for the SiI  
chips, -current or mkIII would tell you which. Just null out the last  
63 sectors on the disks and you should be fine since all possible  
formats are in that range...



Is there anything I can do with userland utilities?


ATA mkIII is exactly about getting ata-raid rewritten from the old
cruft that originally was written before even ATA-ng was done, so yes
I'd expect it to behave better but not necessarily solve all your
problems as some of them might be features of the metadata



So what do I need to know to determine the problem?


The metadata format for one, thats the most important factor for  
getting this to work, but some of them has no generation or anything  
so its hard if not impossible to avoid this problem.


- Søren


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2005-05-20 Thread Joe Rhett

On Thu, May 19, 2005 at 08:21:13AM +0200, Søren Schmidt wrote:
 On 19/05/2005, at 2.20, Joe Rhett wrote:
 
 Soren, I've just retested all of this with 5.4-REL and most of the  
 problems
 listed here are solved.  The only problems appear to be related to  
 these
 ghost arrays that appear when it finds a drive that was taken offline
 earlier.  For example, pull a drive and then reboot the system.
 
 This depends heavily on the metadata format used, some of them simply  
 doesn't have the info to avoid this and some just ignores the problem.
..  .. 
 You need to overwrite the metadata (se above) which are located in  
 different places again depending on metadata format.
 
So where is it located with the sil3114 controler?
(same as 3112, but with 4 ports...)

Is there anything I can do with userland utilities?

 ATA mkIII is exactly about getting ata-raid rewritten from the old  
 cruft that originally was written before even ATA-ng was done, so yes  
 I'd expect it to behave better but not necessarily solve all your  
 problems as some of them might be features of the metadata
 
So what do I need to know to determine the problem?

-- 
Joe Rhett
senior geek
meer.net
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2005-05-19 Thread Søren Schmidt
On 19/05/2005, at 2.20, Joe Rhett wrote:
Soren, I've just retested all of this with 5.4-REL and most of the  
problems
listed here are solved.  The only problems appear to be related to  
these
ghost arrays that appear when it finds a drive that was taken offline
earlier.  For example, pull a drive and then reboot the system.
This depends heavily on the metadata format used, some of them simply  
doesn't have the info to avoid this and some just ignores the problem.

1. If you reboot the system you can delete the array cleanly, but  
it returns
next time.  I can't figure out how to make this information go  
away, and
I've tried low-level formatting the disks :-(
You need to overwrite the metadata (se above) which are located in  
different places again depending on metadata format.

2. Removing the array using atacontrol delete after an  
atacontrol reinit
channel will always produce a page fault.  For example, if you  
have only a
single array in a system and you lose a drive, and then it returns  
later..

# atacontrol status 1
atacontrol: ioctl(ATARAIDSTATUS): Device not configured
# atacontrol reinit 5
...finds disk
# atacontrol status 1
ar1: ATA RAID1 subdisks: DOWN DOWN status: DEGRADED
# atacontrol delete 1
*Page Fault*
We can't run -current, so I'm hoping to find options to work with  
this as
is.  If you know for a fact that this has changed in the mkIII  
patches then
I'd be willing to investigate, but I will need to be certain.
ATA mkIII is exactly about getting ata-raid rewritten from the old  
cruft that originally was written before even ATA-ng was done, so yes  
I'd expect it to behave better but not necessarily solve all your  
problems as some of them might be features of the metadata

I know that you have no desire to work on this older code, but  
could you at
least clue me in on how to get atacontrol to drop these ghost arrays?
see above.
- Søren
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2005-05-18 Thread Joe Rhett
Soren, I've just retested all of this with 5.4-REL and most of the problems
listed here are solved.  The only problems appear to be related to these
ghost arrays that appear when it finds a drive that was taken offline
earlier.  For example, pull a drive and then reboot the system.

1. If you reboot the system you can delete the array cleanly, but it returns
next time.  I can't figure out how to make this information go away, and
I've tried low-level formatting the disks :-(

2. Removing the array using atacontrol delete after an atacontrol reinit
channel will always produce a page fault.  For example, if you have only a
single array in a system and you lose a drive, and then it returns later..

# atacontrol status 1
atacontrol: ioctl(ATARAIDSTATUS): Device not configured
# atacontrol reinit 5
...finds disk
# atacontrol status 1
ar1: ATA RAID1 subdisks: DOWN DOWN status: DEGRADED
# atacontrol delete 1  
*Page Fault*

We can't run -current, so I'm hoping to find options to work with this as
is.  If you know for a fact that this has changed in the mkIII patches then
I'd be willing to investigate, but I will need to be certain.

I know that you have no desire to work on this older code, but could you at
least clue me in on how to get atacontrol to drop these ghost arrays?

On Tue, Dec 14, 2004 at 04:53:59PM -0800, Joe Rhett wrote:
 Soren, do you have any thoughts on what I could do to alleviate or better
 debug this page fault?  I've found three ways to cause this:
 in all cases pull is either physical pull or atacontrol detach channel
 
 1. Pull a drive and rebuild onto hot spare. Pull hot spare *boom*
 
 2. Pull a drive and rebuild onto hot spare. Pull good disk *boom*
 ...should cause filesystem failure, but not page fault when it's not /
 
 3. Pull a drive and then put it back.  The system suddenly has a new array
 with just that drive in it. atacontrol delete new-array *boom*
 
 In particular, what's the story with the new array appearing when you
 insert a drive with array meta-data on it?  That array appears to be
 half-there (no devices, etc) which is probably what causes #2...
 
 On Tue, Dec 14, 2004 at 07:58:53AM +0100, Søren Schmidt wrote:
  Actually I'm in the process of rewriting the ATA RAID code, so things 
  are rolling, albeit slowly, time is a precious resource. I belive that 
  it can be made pretty robust, but the rest of the kernel still have 
  issues with disappearing devices etc thats out of ATA's realm.
  
  Anyhow. I can only test with the HW I have here in the lab, which by far 
  covers all possible permutations, so testing etc by the community is 
  very much needed here to get things sorted out...
 
 -- 
 Joe Rhett
 Senior Geek
 Meer.net
 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to [EMAIL PROTECTED]

-- 
Joe Rhett
senior geek
meer.net
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2004-12-16 Thread Peter Jeremy
On Wed, 2004-Dec-15 19:16:59 -0500, asym wrote:
[audio jukebox]
what would be your recommendations for this particular (and very limited) 
application?

Honestly I'd probably go for a RAID1+0 setup.  It wastes half the space in 
total for mirroring, but it has none of the performance penalties of 
RAID-5,

If you're just talking about audio, then RAID-5 would seem a better
choice.  You get much higher effective space utilisation (75-90%
rather than 50%) and even the degraded bandwidth is plenty for serving
a couple of audio streams.

 and upto half the drives in the array can fail without anything but 
speed being degraded.

Normally, you replace a drive soon after it fails.  The risks of a
second drive failing should be fairly low.  Note that you should try
to get drives from different batches - all vendors have the occasional
bad batch and you don't want all your drives to die at once.

RAID5 sacrifices write speed and redundancy for the sake of space.  Since 
you're using IDE and the drives are pretty cheap, I don't see the need for 
such a sacrifice.

For Gianluca's application, write speed wouldn't seem to be an issue.
Redundancy may or may not be an issue - it depends how quickly a
failed drive can be replaced and whether the risk of one of the
other drives failing during this period is acceptable.

The main advantage of RAID-5 is increased space - and this would seem
to be an important issue.

-- 
Peter Jeremy
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2004-12-15 Thread asym
At 18:16 12/15/2004, Gianluca wrote:
barracudas and at this point I wonder if it's best to go w/ a small hw
raid controller like the 3ware 7506-4LP or use sw raid. I don't really
care about speed (I know RAID5 is not the best for that) nor hot
swapping, my main concern is data integrity. I tried to look online
but I couldn't find anything w/ practical suggestions except for
tutorials on how to configure vinum.
If you don't care about hot-swapping, then you don't really care about (or 
need) RAID-5.  It doesn't offer any additional data integrity, but no 
RAID level does.  What RAID does for you is allow you to survive an 
outright drive failure without losing any data.  No RAID level can save you 
from buggy software writing garbage to the disk, transient disk errors, or 
the myriad other events that are far more common than a single drive just 
dying on you.

Using RAID-5 as an example, during normal operations, a chunk is written to 
the disk and the controller (or software) calculates the bitwise XOR of 
all the blocks involved and writes that value into the parity 
stripe.  During read operations, this parity data is not read or verified 
-- doing so would be pointless because there is no way to tell if it's the 
parity-stripe or the data-stripe that's lying if the two don't jive.

So, during normal operations (all drives up and functioning) RAID-5 
functions readwise as a RAID-0 with one less disk than you really have, and 
as a somewhat slower array during writes.

If a drive completely fails, then the parity stripe is always read up, and 
the missing data stripe is reconstructed from the parity data -- unless the 
parity stripe happens to fall on the missing drive for the stripe set 
you're currently accessing, in which case it is ignored and for that single 
access the array functions just as it would if a drive hadnot failed.

If you're thinking of using RAID instead of good timely backups, you need 
to go back to the drawing board, because that is not what RAID is intended 
to replace -- and is something it cannot replace.

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2004-12-15 Thread asym
At 18:57 12/15/2004, Gianluca wrote:
actually all the data I plan to keep on that server is gonna be backed up, 
either to cdr/dvdr or in the original audio cds that I still have. what I 
meant by integrity is trying to avoid having to go back to the backups to 
restore 120G (or more in this case) that were on a dead drive. I've done 
that before, and even if it's no mission-critical data, it remains a huge 
PITA :)
That's true.  Restoring is always a pain in the ass, no matter the media 
you use.


thanks for the detailed explanation of how RAID5 works, somehow I didn't 
really catch the distinction between the normal and degraded operations on 
the array.

what would be your recommendations for this particular (and very limited) 
application?
Honestly I'd probably go for a RAID1+0 setup.  It wastes half the space in 
total for mirroring, but it has none of the performance penalties of 
RAID-5, and upto half the drives in the array can fail without anything but 
speed being degraded.  You can sort of think of this as having a second 
dedicated array for 'backups' if you want, with the normal caveats -- 
namely that destroyed data cannot be recovered, such as things purposely 
deleted.

RAID5 sacrifices write speed and redundancy for the sake of space.  Since 
you're using IDE and the drives are pretty cheap, I don't see the need for 
such a sacrifice.

Just make sure the controller can do real 1+0.  Several vendoers are 
confused about what the differences are between 1+0, 0+1, and 10 -- they 
mistakenly call their raid 0+1 support RAID-10.

The difference is pretty important though.  If you have say 8 drives, in 
RAID 1+0 (aka 10) you would first create 4 RAID-1 mirrors with 2 disks 
each, and then use these 4 virtual disks in a RAID-0 stripe setup.  This 
would be optimal, as any 4 drives could fail provided they all came from 
different RAID-1 pairs.

In 0+1, you first create two 4-disk RAID-0 arrays and then use one as a 
mirror of the other to create one large RAID-1 disk.  In this setup, which 
has *no* benefits over 1+0, if any drive fails the entire 4-disk RAID-0 
stripe set that the disk is in goes offline and you are left with no 
redundancy -- the entire array is degraded running off the remaining 4-disk 
RAID-0 array, and if any of the drives in that array fail, you're smoked.

If you want redundancy to avoid having to possibly restore data, and you 
can afford more disks, go 1+0.  If you can't afford more disks, then one of 
the striped+parity solutions (-3, -4, -5) are all you can do.. but be ready 
to see write performance anywhere from ok on a $1500 controller, to 
annoying on a sub $500 controller, to downright retardedly slow on 
anything down in the cheap end -- including most IDE controllers -- Look up 
the controller, find out what I/O chip it's using (most are intel based, 
either StrongARM or i960) and see if the chip supports hardware XOR.  If it 
doesn't, you'll really wish it did.

___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2004-12-15 Thread Gianluca
Hello,
I've been following this thread w/ apprehension since I'm in the
process of putting together my first RAID server. maybe this problem
has nothing to do w/ what I have in mind but I figure I'd ask the
experts first.
I want to make a fileserver for home use, mostly as a music jukebox
and since I've had my share of failed drives already I decided I
wanted to do RAID5 and use a real OS. I'm already running 5.3 on my
desktop so I figured I'd use it on the server as well. I've got 4 400G
barracudas and at this point I wonder if it's best to go w/ a small hw
raid controller like the 3ware 7506-4LP or use sw raid. I don't really
care about speed (I know RAID5 is not the best for that) nor hot
swapping, my main concern is data integrity. I tried to look online
but I couldn't find anything w/ practical suggestions except for
tutorials on how to configure vinum.

thanks for any help/pointer. 

g.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2004-12-15 Thread Gianluca
 If you're thinking of using RAID instead of good timely backups, you need
 to go back to the drawing board, because that is not what RAID is intended
 to replace -- and is something it cannot replace.

actually all the data I plan to keep on that server is gonna be backed
up, either to cdr/dvdr or in the original audio cds that I still have.
what I meant by integrity is trying to avoid having to go back to the
backups to restore 120G (or more in this case) that were on a dead
drive. I've done that before, and even if it's no mission-critical
data, it remains a huge PITA :)

thanks for the detailed explanation of how RAID5 works, somehow I
didn't really catch the distinction between the normal and degraded
operations on the array.

what would be your recommendations for this particular (and very
limited) application?

thanks a lot for your help,

g.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2004-12-14 Thread Joe Rhett
Soren, do you have any thoughts on what I could do to alleviate or better
debug this page fault?  I've found three ways to cause this:
in all cases pull is either physical pull or atacontrol detach channel

1. Pull a drive and rebuild onto hot spare. Pull hot spare *boom*

2. Pull a drive and rebuild onto hot spare. Pull good disk *boom*
...should cause filesystem failure, but not page fault when it's not /

3. Pull a drive and then put it back.  The system suddenly has a new array
with just that drive in it. atacontrol delete new-array *boom*

In particular, what's the story with the new array appearing when you
insert a drive with array meta-data on it?  That array appears to be
half-there (no devices, etc) which is probably what causes #2...

On Tue, Dec 14, 2004 at 07:58:53AM +0100, Søren Schmidt wrote:
 Actually I'm in the process of rewriting the ATA RAID code, so things 
 are rolling, albeit slowly, time is a precious resource. I belive that 
 it can be made pretty robust, but the rest of the kernel still have 
 issues with disappearing devices etc thats out of ATA's realm.
 
 Anyhow. I can only test with the HW I have here in the lab, which by far 
 covers all possible permutations, so testing etc by the community is 
 very much needed here to get things sorted out...

-- 
Joe Rhett
Senior Geek
Meer.net
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2004-12-14 Thread Joe Rhett
On Tue, Dec 14, 2004 at 07:58:53AM +0100, Søren Schmidt wrote:
 Anyhow. I can only test with the HW I have here in the lab, which by far 
 covers all possible permutations, so testing etc by the community is 
 very much needed here to get things sorted out...
 
So this system is just my sandbox in the lab, and we'd be happy to let you
play with it (can't ship it to you, but ...)

What can I give you to help you out?

-- 
Joe Rhett
Senior Geek
Meer.net
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2004-12-13 Thread Doug White
On Sun, 12 Dec 2004, Joe Rhett wrote:

 On Sun, Dec 12, 2004 at 09:59:16PM -0800, Doug White wrote:
  Thats a nice shotgun you have there.

 Yessir.  And that's what testing is designed to uncover.  The question is
 why this works, and how do we prevent it?

I'm sure Soren appreciates you donating your feet to the cause :)

Why it works: the system assumes the administrator is competent enough to
not yank a disk that is being rebuilt to.

 Is there a proper way to handle these sort of events?  If so, where is it
 documented?

 And fyi just pulling the drives causes the same failure so that means that
 RAID1 buys you nothing because your system will also crash.

This is why I don't trust ATA RAID for fault tolerance -- it'll save your
data, but the system will tank.  Since the disk state is maintained by
the OS and not abstracted by a separate processor, if a disk dies in a
particularly bad way the system may not be able to cope.

-- 
Doug White|  FreeBSD: The Power to Serve
[EMAIL PROTECTED]  |  www.FreeBSD.org
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2004-12-13 Thread Joe Rhett
  On Sun, Dec 12, 2004 at 09:59:16PM -0800, Doug White wrote:
   Thats a nice shotgun you have there.

 On Sun, 12 Dec 2004, Joe Rhett wrote:
  Yessir.  And that's what testing is designed to uncover.  The question is
  why this works, and how do we prevent it?
 
On Mon, Dec 13, 2004 at 10:28:53AM -0800, Doug White wrote:
 I'm sure Soren appreciates you donating your feet to the cause :)
 
That's what sandbox feet are for ;-)

 Why it works: the system assumes the administrator is competent enough to
 not yank a disk that is being rebuilt to.
 
Yes, I and most others are.  But that's a bad assumption. The issue is
fairly simple --  what occurs if the disk goes offline for a hardware 
failure?  For example, that SATA interface starts having problems.  We 
replace the drive, assuming it is the drive.  The rebuild starts, and the 
interface dies again.  Bam! There goes the system.  Not good.

Or, perhaps it's a DOA drive and it fails during the rebuild?

  Is there a proper way to handle these sort of events?  If so, where is it
  documented?
 
  And fyi just pulling the drives causes the same failure so that means that
  RAID1 buys you nothing because your system will also crash.
 
 This is why I don't trust ATA RAID for fault tolerance -- it'll save your
 data, but the system will tank.  Since the disk state is maintained by
 the OS and not abstracted by a separate processor, if a disk dies in a
 particularly bad way the system may not be able to cope.
 
Yes, but SATA isn't limited by this problem.  It does have a processor per
disk. (this is all SATA, if I didn't make that clear)

-- 
Joe Rhett
Senior Geek
Meer.net
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2004-12-13 Thread Paul Mather
On Mon, 2004-12-13 at 10:28 -0800, Doug White wrote:
 On Sun, 12 Dec 2004, Joe Rhett wrote:
 
  On Sun, Dec 12, 2004 at 09:59:16PM -0800, Doug White wrote:
   Thats a nice shotgun you have there.
 
  Yessir.  And that's what testing is designed to uncover.  The question is
  why this works, and how do we prevent it?
 
 I'm sure Soren appreciates you donating your feet to the cause :)
 
 Why it works: the system assumes the administrator is competent enough to
 not yank a disk that is being rebuilt to.

That's not quite fair.  He was obviously testing to see how resilient
ATA RAID is to drive failures during rebuilding, as part of a series of
tests.  (Obviously, it is not.)  If you look at his original message, he
did not even yank the disk.  He detached it in a somewhat orderly
fashion using atacontrol detach.  (One can argue that physically
yanking it might have been a more accurate, if more severe failure
test.)  This makes the ensuing panic even more sad.  (Would the same
panic result if the disk being rebuilt fell victim to one of those
TIMEOUT - WRITE_DMA errors that are in vogue nowadays and was detached
by the system?  I get those errors occasionally [never used to under 5.1
on the exact same hardware] but my geom_mirror has coped with it so far,
thankfully.)

It's reasonable to conduct simulated failure testing of ATA RAID (or
others such as geom_mirror and geom_vinum) prior to adopting it on your
system.  I know I did in the case of ATA RAID and abandoned it precisely
because it turned out for me to be too flaky when it came to error
recovery.

Cheers,

Paul.
-- 
e-mail: [EMAIL PROTECTED]

Without music to decorate it, time is just a bunch of boring production
 deadlines or dates by which bills must be paid.
--- Frank Vincent Zappa
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2004-12-13 Thread Joe Rhett
On Mon, Dec 13, 2004 at 04:03:06PM -0500, Paul Mather wrote:
 That's not quite fair.  He was obviously testing to see how resilient
 ATA RAID is to drive failures during rebuilding, as part of a series of
 tests.  (Obviously, it is not.)  If you look at his original message, he
 did not even yank the disk.  He detached it in a somewhat orderly
 fashion using atacontrol detach.

Actually, I did both and both caused the same page fault :-(

-- 
Joe Rhett
Senior Geek
Meer.net
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2004-12-13 Thread Doug White
On Mon, 13 Dec 2004, Joe Rhett wrote:

  This is why I don't trust ATA RAID for fault tolerance -- it'll save your
  data, but the system will tank.  Since the disk state is maintained by
  the OS and not abstracted by a separate processor, if a disk dies in a
  particularly bad way the system may not be able to cope.

 Yes, but SATA isn't limited by this problem.  It does have a processor per
 disk. (this is all SATA, if I didn't make that clear)

Actually on SATA its worse -- the disk just stops responding to everything
and hangs.  If you don't detect this condition then you go into an
infinite wait.

In any case, yes the ATA RAID code could use a massive robustness pass. So
could the core ATA code.  Patches accepted :)

-- 
Doug White|  FreeBSD: The Power to Serve
[EMAIL PROTECTED]  |  www.FreeBSD.org
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2004-12-13 Thread Søren Schmidt
Doug White wrote:
On Mon, 13 Dec 2004, Joe Rhett wrote:

This is why I don't trust ATA RAID for fault tolerance -- it'll save your
data, but the system will tank.  Since the disk state is maintained by
the OS and not abstracted by a separate processor, if a disk dies in a
particularly bad way the system may not be able to cope.
Yes, but SATA isn't limited by this problem.  It does have a processor per
disk. (this is all SATA, if I didn't make that clear)
Actually on SATA its worse -- the disk just stops responding to everything
and hangs.  If you don't detect this condition then you go into an
infinite wait.
In any case, yes the ATA RAID code could use a massive robustness pass. So
could the core ATA code.  Patches accepted :)
Actually I'm in the process of rewriting the ATA RAID code, so things 
are rolling, albeit slowly, time is a precious resource. I belive that 
it can be made pretty robust, but the rest of the kernel still have 
issues with disappearing devices etc thats out of ATA's realm.

Anyhow. I can only test with the HW I have here in the lab, which by far 
covers all possible permutations, so testing etc by the community is 
very much needed here to get things sorted out...

--
-Søren
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


drive failure during rebuild causes page fault

2004-12-12 Thread Joe Rhett
And another, I can now confirm that it is fairly easy to kill 5.3-release
during the rebuilding process.  The following steps will cause a kernel
page fault consistently:

atacontrol create RAID0 ad6 ad10
atacontrol detach 5
log: ad10 deleted from ar0 disk1
log: ad10 WARNING - removed from configuration
atacontrol addspare 0 ad8
log: ad8 inserted into ar0 disk1 as spare
atacontrol rebuild 0
atacontrol detach 4
log: ad8 deleted from ar0 disk1
log: ad8 WARNING - removed from configuration

Fatal trap 12: page fault while in kernel mode
fault virtual address = 0x10

current process = 1063 (rebuilding ar0 1%)
trap number = 12
panic: page fault

(tell me if you want or need anything I skipped above.  Got lazy cause I
had to type it in by hand...)

-- 
Joe Rhett
Senior Geek
Meer.net
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2004-12-12 Thread Doug White
On Sun, 12 Dec 2004, Joe Rhett wrote:

 And another, I can now confirm that it is fairly easy to kill 5.3-release
 during the rebuilding process.  The following steps will cause a kernel
 page fault consistently:

 atacontrol create RAID0 ad6 ad10
 atacontrol detach 5
   log: ad10 deleted from ar0 disk1
   log: ad10 WARNING - removed from configuration
 atacontrol addspare 0 ad8
   log: ad8 inserted into ar0 disk1 as spare
 atacontrol rebuild 0
 atacontrol detach 4
   log: ad8 deleted from ar0 disk1
   log: ad8 WARNING - removed from configuration

 Fatal trap 12: page fault while in kernel mode
 fault virtual address = 0x10

Thats a nice shotgun you have there.

-- 
Doug White|  FreeBSD: The Power to Serve
[EMAIL PROTECTED]  |  www.FreeBSD.org
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2004-12-12 Thread Joe Rhett
And here's where I found even more interesting stuff.  (again with the 
sil3114 controller)

If you detach a channel and then attach the channel, a new raid device gets
created.  And the removed drive shows up in the new array...

# atacontrol create RAID0 ad6 ad8
# atacontrol detach 4
Dec 12 21:55:18 sandbox kernel: ad8: deleted from ar0 disk1
Dec 12 21:55:18 sandbox kernel: ar0: WARNING - mirror lost
Dec 12 21:55:18 sandbox kernel: ad8: WARNING - removed from 
configuration

sandbox# atacontrol status 1
atacontrol: ioctl(ATARAIDSTATUS): Device not configured

Okay, ar0 is broken, and raid array 1 doesn't exist.

# atacontrol attach 4 
Dec 12 21:55:57 sandbox kernel: ad8: 76319MB ST380013AS/3.18 
[155061/16/63] at ata4-master SATA150
sandbox# atacontrol status 1
ar1: ATA RAID1 subdisks: DOWN ad8 status: BROKEN

Hm? Where did this array come from?

Okay, so now someone will tell me that I'm doing things all out of order,
which I suspect.  But that leaves the obvious that Others will do this
and there is no documentation to suggest otherwise.

What about a command to show the current list of raid arrays?  either make 
'atacontrol status' return the status of all arrays in the system, or
make a new command that will list out which arrays are available.  I only
stumbled on this because I mistyped a number and then realized that I was
looking at the wrong thing (and the wrong thing should not exist!)

On Sun, Dec 12, 2004 at 09:42:00PM -0800, Joe Rhett wrote:
 And another, I can now confirm that it is fairly easy to kill 5.3-release
 during the rebuilding process.  The following steps will cause a kernel
 page fault consistently:
 
 atacontrol create RAID0 ad6 ad10
 atacontrol detach 5
   log: ad10 deleted from ar0 disk1
   log: ad10 WARNING - removed from configuration
 atacontrol addspare 0 ad8
   log: ad8 inserted into ar0 disk1 as spare
 atacontrol rebuild 0
 atacontrol detach 4
   log: ad8 deleted from ar0 disk1
   log: ad8 WARNING - removed from configuration
 
 Fatal trap 12: page fault while in kernel mode
 fault virtual address = 0x10
 
 current process = 1063 (rebuilding ar0 1%)
 trap number = 12
 panic: page fault
 
 (tell me if you want or need anything I skipped above.  Got lazy cause I
 had to type it in by hand...)
 
 -- 
 Joe Rhett
 Senior Geek
 Meer.net
 ___
 [EMAIL PROTECTED] mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to [EMAIL PROTECTED]

-- 
Joe Rhett
Senior Geek
Meer.net
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: drive failure during rebuild causes page fault

2004-12-12 Thread Joe Rhett
 On Sun, 12 Dec 2004, Joe Rhett wrote:
  And another, I can now confirm that it is fairly easy to kill 5.3-release
  during the rebuilding process.  The following steps will cause a kernel
  page fault consistently:
 
  atacontrol create RAID0 ad6 ad10
  atacontrol detach 5
  log: ad10 deleted from ar0 disk1
  log: ad10 WARNING - removed from configuration
  atacontrol addspare 0 ad8
  log: ad8 inserted into ar0 disk1 as spare
  atacontrol rebuild 0
  atacontrol detach 4
  log: ad8 deleted from ar0 disk1
  log: ad8 WARNING - removed from configuration
 
  Fatal trap 12: page fault while in kernel mode
  fault virtual address = 0x10
 
On Sun, Dec 12, 2004 at 09:59:16PM -0800, Doug White wrote:
 Thats a nice shotgun you have there.
 
Yessir.  And that's what testing is designed to uncover.  The question is
why this works, and how do we prevent it?

Is there a proper way to handle these sort of events?  If so, where is it
documented?

And fyi just pulling the drives causes the same failure so that means that
RAID1 buys you nothing because your system will also crash.

-- 
Joe Rhett
Senior Geek
Meer.net
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]