Re: drive failure during rebuild causes page fault
You need to overwrite the metadata (se above) which are located in different places again depending on metadata format. So where is it located with the sil3114 controler? (same as 3112, but with 4 ports...) On Sun, May 22, 2005 at 12:45:05AM +0200, Søren Schmidt wrote: Depends on what BIOS you have on there, several exists for the SiI chips, -current or mkIII would tell you which. Just null out the last 63 sectors on the disks and you should be fine since all possible formats are in that range... I know how to do this using dd from the start of the disk. How do I do this at the end of the disk? -- Joe Rhett senior geek meer.net ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
On 22/05/2005, at 18:11, Joe Rhett wrote: You need to overwrite the metadata (se above) which are located in different places again depending on metadata format. So where is it located with the sil3114 controler? (same as 3112, but with 4 ports...) On Sun, May 22, 2005 at 12:45:05AM +0200, Søren Schmidt wrote: Depends on what BIOS you have on there, several exists for the SiI chips, -current or mkIII would tell you which. Just null out the last 63 sectors on the disks and you should be fine since all possible formats are in that range... I know how to do this using dd from the start of the disk. How do I do this at the end of the disk? man dd ? :) you need to get the size of the disk in sectors (hint atacontrol) then you do dd if=/dev/zero of=/dev/adN oseek=(size-63) - Søren ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
On 21/05/2005, at 1:10, Joe Rhett wrote: On Thu, May 19, 2005 at 08:21:13AM +0200, Søren Schmidt wrote: On 19/05/2005, at 2.20, Joe Rhett wrote: Soren, I've just retested all of this with 5.4-REL and most of the problems listed here are solved. The only problems appear to be related to these ghost arrays that appear when it finds a drive that was taken offline earlier. For example, pull a drive and then reboot the system. This depends heavily on the metadata format used, some of them simply doesn't have the info to avoid this and some just ignores the problem. .... You need to overwrite the metadata (se above) which are located in different places again depending on metadata format. So where is it located with the sil3114 controler? (same as 3112, but with 4 ports...) Depends on what BIOS you have on there, several exists for the SiI chips, -current or mkIII would tell you which. Just null out the last 63 sectors on the disks and you should be fine since all possible formats are in that range... Is there anything I can do with userland utilities? ATA mkIII is exactly about getting ata-raid rewritten from the old cruft that originally was written before even ATA-ng was done, so yes I'd expect it to behave better but not necessarily solve all your problems as some of them might be features of the metadata So what do I need to know to determine the problem? The metadata format for one, thats the most important factor for getting this to work, but some of them has no generation or anything so its hard if not impossible to avoid this problem. - Søren ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
On Thu, May 19, 2005 at 08:21:13AM +0200, Søren Schmidt wrote: On 19/05/2005, at 2.20, Joe Rhett wrote: Soren, I've just retested all of this with 5.4-REL and most of the problems listed here are solved. The only problems appear to be related to these ghost arrays that appear when it finds a drive that was taken offline earlier. For example, pull a drive and then reboot the system. This depends heavily on the metadata format used, some of them simply doesn't have the info to avoid this and some just ignores the problem. .. .. You need to overwrite the metadata (se above) which are located in different places again depending on metadata format. So where is it located with the sil3114 controler? (same as 3112, but with 4 ports...) Is there anything I can do with userland utilities? ATA mkIII is exactly about getting ata-raid rewritten from the old cruft that originally was written before even ATA-ng was done, so yes I'd expect it to behave better but not necessarily solve all your problems as some of them might be features of the metadata So what do I need to know to determine the problem? -- Joe Rhett senior geek meer.net ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
On 19/05/2005, at 2.20, Joe Rhett wrote: Soren, I've just retested all of this with 5.4-REL and most of the problems listed here are solved. The only problems appear to be related to these ghost arrays that appear when it finds a drive that was taken offline earlier. For example, pull a drive and then reboot the system. This depends heavily on the metadata format used, some of them simply doesn't have the info to avoid this and some just ignores the problem. 1. If you reboot the system you can delete the array cleanly, but it returns next time. I can't figure out how to make this information go away, and I've tried low-level formatting the disks :-( You need to overwrite the metadata (se above) which are located in different places again depending on metadata format. 2. Removing the array using atacontrol delete after an atacontrol reinit channel will always produce a page fault. For example, if you have only a single array in a system and you lose a drive, and then it returns later.. # atacontrol status 1 atacontrol: ioctl(ATARAIDSTATUS): Device not configured # atacontrol reinit 5 ...finds disk # atacontrol status 1 ar1: ATA RAID1 subdisks: DOWN DOWN status: DEGRADED # atacontrol delete 1 *Page Fault* We can't run -current, so I'm hoping to find options to work with this as is. If you know for a fact that this has changed in the mkIII patches then I'd be willing to investigate, but I will need to be certain. ATA mkIII is exactly about getting ata-raid rewritten from the old cruft that originally was written before even ATA-ng was done, so yes I'd expect it to behave better but not necessarily solve all your problems as some of them might be features of the metadata I know that you have no desire to work on this older code, but could you at least clue me in on how to get atacontrol to drop these ghost arrays? see above. - Søren ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
Soren, I've just retested all of this with 5.4-REL and most of the problems listed here are solved. The only problems appear to be related to these ghost arrays that appear when it finds a drive that was taken offline earlier. For example, pull a drive and then reboot the system. 1. If you reboot the system you can delete the array cleanly, but it returns next time. I can't figure out how to make this information go away, and I've tried low-level formatting the disks :-( 2. Removing the array using atacontrol delete after an atacontrol reinit channel will always produce a page fault. For example, if you have only a single array in a system and you lose a drive, and then it returns later.. # atacontrol status 1 atacontrol: ioctl(ATARAIDSTATUS): Device not configured # atacontrol reinit 5 ...finds disk # atacontrol status 1 ar1: ATA RAID1 subdisks: DOWN DOWN status: DEGRADED # atacontrol delete 1 *Page Fault* We can't run -current, so I'm hoping to find options to work with this as is. If you know for a fact that this has changed in the mkIII patches then I'd be willing to investigate, but I will need to be certain. I know that you have no desire to work on this older code, but could you at least clue me in on how to get atacontrol to drop these ghost arrays? On Tue, Dec 14, 2004 at 04:53:59PM -0800, Joe Rhett wrote: Soren, do you have any thoughts on what I could do to alleviate or better debug this page fault? I've found three ways to cause this: in all cases pull is either physical pull or atacontrol detach channel 1. Pull a drive and rebuild onto hot spare. Pull hot spare *boom* 2. Pull a drive and rebuild onto hot spare. Pull good disk *boom* ...should cause filesystem failure, but not page fault when it's not / 3. Pull a drive and then put it back. The system suddenly has a new array with just that drive in it. atacontrol delete new-array *boom* In particular, what's the story with the new array appearing when you insert a drive with array meta-data on it? That array appears to be half-there (no devices, etc) which is probably what causes #2... On Tue, Dec 14, 2004 at 07:58:53AM +0100, Søren Schmidt wrote: Actually I'm in the process of rewriting the ATA RAID code, so things are rolling, albeit slowly, time is a precious resource. I belive that it can be made pretty robust, but the rest of the kernel still have issues with disappearing devices etc thats out of ATA's realm. Anyhow. I can only test with the HW I have here in the lab, which by far covers all possible permutations, so testing etc by the community is very much needed here to get things sorted out... -- Joe Rhett Senior Geek Meer.net ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] -- Joe Rhett senior geek meer.net ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
On Wed, 2004-Dec-15 19:16:59 -0500, asym wrote: [audio jukebox] what would be your recommendations for this particular (and very limited) application? Honestly I'd probably go for a RAID1+0 setup. It wastes half the space in total for mirroring, but it has none of the performance penalties of RAID-5, If you're just talking about audio, then RAID-5 would seem a better choice. You get much higher effective space utilisation (75-90% rather than 50%) and even the degraded bandwidth is plenty for serving a couple of audio streams. and upto half the drives in the array can fail without anything but speed being degraded. Normally, you replace a drive soon after it fails. The risks of a second drive failing should be fairly low. Note that you should try to get drives from different batches - all vendors have the occasional bad batch and you don't want all your drives to die at once. RAID5 sacrifices write speed and redundancy for the sake of space. Since you're using IDE and the drives are pretty cheap, I don't see the need for such a sacrifice. For Gianluca's application, write speed wouldn't seem to be an issue. Redundancy may or may not be an issue - it depends how quickly a failed drive can be replaced and whether the risk of one of the other drives failing during this period is acceptable. The main advantage of RAID-5 is increased space - and this would seem to be an important issue. -- Peter Jeremy ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
At 18:16 12/15/2004, Gianluca wrote: barracudas and at this point I wonder if it's best to go w/ a small hw raid controller like the 3ware 7506-4LP or use sw raid. I don't really care about speed (I know RAID5 is not the best for that) nor hot swapping, my main concern is data integrity. I tried to look online but I couldn't find anything w/ practical suggestions except for tutorials on how to configure vinum. If you don't care about hot-swapping, then you don't really care about (or need) RAID-5. It doesn't offer any additional data integrity, but no RAID level does. What RAID does for you is allow you to survive an outright drive failure without losing any data. No RAID level can save you from buggy software writing garbage to the disk, transient disk errors, or the myriad other events that are far more common than a single drive just dying on you. Using RAID-5 as an example, during normal operations, a chunk is written to the disk and the controller (or software) calculates the bitwise XOR of all the blocks involved and writes that value into the parity stripe. During read operations, this parity data is not read or verified -- doing so would be pointless because there is no way to tell if it's the parity-stripe or the data-stripe that's lying if the two don't jive. So, during normal operations (all drives up and functioning) RAID-5 functions readwise as a RAID-0 with one less disk than you really have, and as a somewhat slower array during writes. If a drive completely fails, then the parity stripe is always read up, and the missing data stripe is reconstructed from the parity data -- unless the parity stripe happens to fall on the missing drive for the stripe set you're currently accessing, in which case it is ignored and for that single access the array functions just as it would if a drive hadnot failed. If you're thinking of using RAID instead of good timely backups, you need to go back to the drawing board, because that is not what RAID is intended to replace -- and is something it cannot replace. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
At 18:57 12/15/2004, Gianluca wrote: actually all the data I plan to keep on that server is gonna be backed up, either to cdr/dvdr or in the original audio cds that I still have. what I meant by integrity is trying to avoid having to go back to the backups to restore 120G (or more in this case) that were on a dead drive. I've done that before, and even if it's no mission-critical data, it remains a huge PITA :) That's true. Restoring is always a pain in the ass, no matter the media you use. thanks for the detailed explanation of how RAID5 works, somehow I didn't really catch the distinction between the normal and degraded operations on the array. what would be your recommendations for this particular (and very limited) application? Honestly I'd probably go for a RAID1+0 setup. It wastes half the space in total for mirroring, but it has none of the performance penalties of RAID-5, and upto half the drives in the array can fail without anything but speed being degraded. You can sort of think of this as having a second dedicated array for 'backups' if you want, with the normal caveats -- namely that destroyed data cannot be recovered, such as things purposely deleted. RAID5 sacrifices write speed and redundancy for the sake of space. Since you're using IDE and the drives are pretty cheap, I don't see the need for such a sacrifice. Just make sure the controller can do real 1+0. Several vendoers are confused about what the differences are between 1+0, 0+1, and 10 -- they mistakenly call their raid 0+1 support RAID-10. The difference is pretty important though. If you have say 8 drives, in RAID 1+0 (aka 10) you would first create 4 RAID-1 mirrors with 2 disks each, and then use these 4 virtual disks in a RAID-0 stripe setup. This would be optimal, as any 4 drives could fail provided they all came from different RAID-1 pairs. In 0+1, you first create two 4-disk RAID-0 arrays and then use one as a mirror of the other to create one large RAID-1 disk. In this setup, which has *no* benefits over 1+0, if any drive fails the entire 4-disk RAID-0 stripe set that the disk is in goes offline and you are left with no redundancy -- the entire array is degraded running off the remaining 4-disk RAID-0 array, and if any of the drives in that array fail, you're smoked. If you want redundancy to avoid having to possibly restore data, and you can afford more disks, go 1+0. If you can't afford more disks, then one of the striped+parity solutions (-3, -4, -5) are all you can do.. but be ready to see write performance anywhere from ok on a $1500 controller, to annoying on a sub $500 controller, to downright retardedly slow on anything down in the cheap end -- including most IDE controllers -- Look up the controller, find out what I/O chip it's using (most are intel based, either StrongARM or i960) and see if the chip supports hardware XOR. If it doesn't, you'll really wish it did. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
Hello, I've been following this thread w/ apprehension since I'm in the process of putting together my first RAID server. maybe this problem has nothing to do w/ what I have in mind but I figure I'd ask the experts first. I want to make a fileserver for home use, mostly as a music jukebox and since I've had my share of failed drives already I decided I wanted to do RAID5 and use a real OS. I'm already running 5.3 on my desktop so I figured I'd use it on the server as well. I've got 4 400G barracudas and at this point I wonder if it's best to go w/ a small hw raid controller like the 3ware 7506-4LP or use sw raid. I don't really care about speed (I know RAID5 is not the best for that) nor hot swapping, my main concern is data integrity. I tried to look online but I couldn't find anything w/ practical suggestions except for tutorials on how to configure vinum. thanks for any help/pointer. g. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
If you're thinking of using RAID instead of good timely backups, you need to go back to the drawing board, because that is not what RAID is intended to replace -- and is something it cannot replace. actually all the data I plan to keep on that server is gonna be backed up, either to cdr/dvdr or in the original audio cds that I still have. what I meant by integrity is trying to avoid having to go back to the backups to restore 120G (or more in this case) that were on a dead drive. I've done that before, and even if it's no mission-critical data, it remains a huge PITA :) thanks for the detailed explanation of how RAID5 works, somehow I didn't really catch the distinction between the normal and degraded operations on the array. what would be your recommendations for this particular (and very limited) application? thanks a lot for your help, g. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
Soren, do you have any thoughts on what I could do to alleviate or better debug this page fault? I've found three ways to cause this: in all cases pull is either physical pull or atacontrol detach channel 1. Pull a drive and rebuild onto hot spare. Pull hot spare *boom* 2. Pull a drive and rebuild onto hot spare. Pull good disk *boom* ...should cause filesystem failure, but not page fault when it's not / 3. Pull a drive and then put it back. The system suddenly has a new array with just that drive in it. atacontrol delete new-array *boom* In particular, what's the story with the new array appearing when you insert a drive with array meta-data on it? That array appears to be half-there (no devices, etc) which is probably what causes #2... On Tue, Dec 14, 2004 at 07:58:53AM +0100, Søren Schmidt wrote: Actually I'm in the process of rewriting the ATA RAID code, so things are rolling, albeit slowly, time is a precious resource. I belive that it can be made pretty robust, but the rest of the kernel still have issues with disappearing devices etc thats out of ATA's realm. Anyhow. I can only test with the HW I have here in the lab, which by far covers all possible permutations, so testing etc by the community is very much needed here to get things sorted out... -- Joe Rhett Senior Geek Meer.net ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
On Tue, Dec 14, 2004 at 07:58:53AM +0100, Søren Schmidt wrote: Anyhow. I can only test with the HW I have here in the lab, which by far covers all possible permutations, so testing etc by the community is very much needed here to get things sorted out... So this system is just my sandbox in the lab, and we'd be happy to let you play with it (can't ship it to you, but ...) What can I give you to help you out? -- Joe Rhett Senior Geek Meer.net ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
On Sun, 12 Dec 2004, Joe Rhett wrote: On Sun, Dec 12, 2004 at 09:59:16PM -0800, Doug White wrote: Thats a nice shotgun you have there. Yessir. And that's what testing is designed to uncover. The question is why this works, and how do we prevent it? I'm sure Soren appreciates you donating your feet to the cause :) Why it works: the system assumes the administrator is competent enough to not yank a disk that is being rebuilt to. Is there a proper way to handle these sort of events? If so, where is it documented? And fyi just pulling the drives causes the same failure so that means that RAID1 buys you nothing because your system will also crash. This is why I don't trust ATA RAID for fault tolerance -- it'll save your data, but the system will tank. Since the disk state is maintained by the OS and not abstracted by a separate processor, if a disk dies in a particularly bad way the system may not be able to cope. -- Doug White| FreeBSD: The Power to Serve [EMAIL PROTECTED] | www.FreeBSD.org ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
On Sun, Dec 12, 2004 at 09:59:16PM -0800, Doug White wrote: Thats a nice shotgun you have there. On Sun, 12 Dec 2004, Joe Rhett wrote: Yessir. And that's what testing is designed to uncover. The question is why this works, and how do we prevent it? On Mon, Dec 13, 2004 at 10:28:53AM -0800, Doug White wrote: I'm sure Soren appreciates you donating your feet to the cause :) That's what sandbox feet are for ;-) Why it works: the system assumes the administrator is competent enough to not yank a disk that is being rebuilt to. Yes, I and most others are. But that's a bad assumption. The issue is fairly simple -- what occurs if the disk goes offline for a hardware failure? For example, that SATA interface starts having problems. We replace the drive, assuming it is the drive. The rebuild starts, and the interface dies again. Bam! There goes the system. Not good. Or, perhaps it's a DOA drive and it fails during the rebuild? Is there a proper way to handle these sort of events? If so, where is it documented? And fyi just pulling the drives causes the same failure so that means that RAID1 buys you nothing because your system will also crash. This is why I don't trust ATA RAID for fault tolerance -- it'll save your data, but the system will tank. Since the disk state is maintained by the OS and not abstracted by a separate processor, if a disk dies in a particularly bad way the system may not be able to cope. Yes, but SATA isn't limited by this problem. It does have a processor per disk. (this is all SATA, if I didn't make that clear) -- Joe Rhett Senior Geek Meer.net ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
On Mon, 2004-12-13 at 10:28 -0800, Doug White wrote: On Sun, 12 Dec 2004, Joe Rhett wrote: On Sun, Dec 12, 2004 at 09:59:16PM -0800, Doug White wrote: Thats a nice shotgun you have there. Yessir. And that's what testing is designed to uncover. The question is why this works, and how do we prevent it? I'm sure Soren appreciates you donating your feet to the cause :) Why it works: the system assumes the administrator is competent enough to not yank a disk that is being rebuilt to. That's not quite fair. He was obviously testing to see how resilient ATA RAID is to drive failures during rebuilding, as part of a series of tests. (Obviously, it is not.) If you look at his original message, he did not even yank the disk. He detached it in a somewhat orderly fashion using atacontrol detach. (One can argue that physically yanking it might have been a more accurate, if more severe failure test.) This makes the ensuing panic even more sad. (Would the same panic result if the disk being rebuilt fell victim to one of those TIMEOUT - WRITE_DMA errors that are in vogue nowadays and was detached by the system? I get those errors occasionally [never used to under 5.1 on the exact same hardware] but my geom_mirror has coped with it so far, thankfully.) It's reasonable to conduct simulated failure testing of ATA RAID (or others such as geom_mirror and geom_vinum) prior to adopting it on your system. I know I did in the case of ATA RAID and abandoned it precisely because it turned out for me to be too flaky when it came to error recovery. Cheers, Paul. -- e-mail: [EMAIL PROTECTED] Without music to decorate it, time is just a bunch of boring production deadlines or dates by which bills must be paid. --- Frank Vincent Zappa ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
On Mon, Dec 13, 2004 at 04:03:06PM -0500, Paul Mather wrote: That's not quite fair. He was obviously testing to see how resilient ATA RAID is to drive failures during rebuilding, as part of a series of tests. (Obviously, it is not.) If you look at his original message, he did not even yank the disk. He detached it in a somewhat orderly fashion using atacontrol detach. Actually, I did both and both caused the same page fault :-( -- Joe Rhett Senior Geek Meer.net ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
On Mon, 13 Dec 2004, Joe Rhett wrote: This is why I don't trust ATA RAID for fault tolerance -- it'll save your data, but the system will tank. Since the disk state is maintained by the OS and not abstracted by a separate processor, if a disk dies in a particularly bad way the system may not be able to cope. Yes, but SATA isn't limited by this problem. It does have a processor per disk. (this is all SATA, if I didn't make that clear) Actually on SATA its worse -- the disk just stops responding to everything and hangs. If you don't detect this condition then you go into an infinite wait. In any case, yes the ATA RAID code could use a massive robustness pass. So could the core ATA code. Patches accepted :) -- Doug White| FreeBSD: The Power to Serve [EMAIL PROTECTED] | www.FreeBSD.org ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
Doug White wrote: On Mon, 13 Dec 2004, Joe Rhett wrote: This is why I don't trust ATA RAID for fault tolerance -- it'll save your data, but the system will tank. Since the disk state is maintained by the OS and not abstracted by a separate processor, if a disk dies in a particularly bad way the system may not be able to cope. Yes, but SATA isn't limited by this problem. It does have a processor per disk. (this is all SATA, if I didn't make that clear) Actually on SATA its worse -- the disk just stops responding to everything and hangs. If you don't detect this condition then you go into an infinite wait. In any case, yes the ATA RAID code could use a massive robustness pass. So could the core ATA code. Patches accepted :) Actually I'm in the process of rewriting the ATA RAID code, so things are rolling, albeit slowly, time is a precious resource. I belive that it can be made pretty robust, but the rest of the kernel still have issues with disappearing devices etc thats out of ATA's realm. Anyhow. I can only test with the HW I have here in the lab, which by far covers all possible permutations, so testing etc by the community is very much needed here to get things sorted out... -- -Søren ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
drive failure during rebuild causes page fault
And another, I can now confirm that it is fairly easy to kill 5.3-release during the rebuilding process. The following steps will cause a kernel page fault consistently: atacontrol create RAID0 ad6 ad10 atacontrol detach 5 log: ad10 deleted from ar0 disk1 log: ad10 WARNING - removed from configuration atacontrol addspare 0 ad8 log: ad8 inserted into ar0 disk1 as spare atacontrol rebuild 0 atacontrol detach 4 log: ad8 deleted from ar0 disk1 log: ad8 WARNING - removed from configuration Fatal trap 12: page fault while in kernel mode fault virtual address = 0x10 current process = 1063 (rebuilding ar0 1%) trap number = 12 panic: page fault (tell me if you want or need anything I skipped above. Got lazy cause I had to type it in by hand...) -- Joe Rhett Senior Geek Meer.net ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
On Sun, 12 Dec 2004, Joe Rhett wrote: And another, I can now confirm that it is fairly easy to kill 5.3-release during the rebuilding process. The following steps will cause a kernel page fault consistently: atacontrol create RAID0 ad6 ad10 atacontrol detach 5 log: ad10 deleted from ar0 disk1 log: ad10 WARNING - removed from configuration atacontrol addspare 0 ad8 log: ad8 inserted into ar0 disk1 as spare atacontrol rebuild 0 atacontrol detach 4 log: ad8 deleted from ar0 disk1 log: ad8 WARNING - removed from configuration Fatal trap 12: page fault while in kernel mode fault virtual address = 0x10 Thats a nice shotgun you have there. -- Doug White| FreeBSD: The Power to Serve [EMAIL PROTECTED] | www.FreeBSD.org ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
And here's where I found even more interesting stuff. (again with the sil3114 controller) If you detach a channel and then attach the channel, a new raid device gets created. And the removed drive shows up in the new array... # atacontrol create RAID0 ad6 ad8 # atacontrol detach 4 Dec 12 21:55:18 sandbox kernel: ad8: deleted from ar0 disk1 Dec 12 21:55:18 sandbox kernel: ar0: WARNING - mirror lost Dec 12 21:55:18 sandbox kernel: ad8: WARNING - removed from configuration sandbox# atacontrol status 1 atacontrol: ioctl(ATARAIDSTATUS): Device not configured Okay, ar0 is broken, and raid array 1 doesn't exist. # atacontrol attach 4 Dec 12 21:55:57 sandbox kernel: ad8: 76319MB ST380013AS/3.18 [155061/16/63] at ata4-master SATA150 sandbox# atacontrol status 1 ar1: ATA RAID1 subdisks: DOWN ad8 status: BROKEN Hm? Where did this array come from? Okay, so now someone will tell me that I'm doing things all out of order, which I suspect. But that leaves the obvious that Others will do this and there is no documentation to suggest otherwise. What about a command to show the current list of raid arrays? either make 'atacontrol status' return the status of all arrays in the system, or make a new command that will list out which arrays are available. I only stumbled on this because I mistyped a number and then realized that I was looking at the wrong thing (and the wrong thing should not exist!) On Sun, Dec 12, 2004 at 09:42:00PM -0800, Joe Rhett wrote: And another, I can now confirm that it is fairly easy to kill 5.3-release during the rebuilding process. The following steps will cause a kernel page fault consistently: atacontrol create RAID0 ad6 ad10 atacontrol detach 5 log: ad10 deleted from ar0 disk1 log: ad10 WARNING - removed from configuration atacontrol addspare 0 ad8 log: ad8 inserted into ar0 disk1 as spare atacontrol rebuild 0 atacontrol detach 4 log: ad8 deleted from ar0 disk1 log: ad8 WARNING - removed from configuration Fatal trap 12: page fault while in kernel mode fault virtual address = 0x10 current process = 1063 (rebuilding ar0 1%) trap number = 12 panic: page fault (tell me if you want or need anything I skipped above. Got lazy cause I had to type it in by hand...) -- Joe Rhett Senior Geek Meer.net ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] -- Joe Rhett Senior Geek Meer.net ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: drive failure during rebuild causes page fault
On Sun, 12 Dec 2004, Joe Rhett wrote: And another, I can now confirm that it is fairly easy to kill 5.3-release during the rebuilding process. The following steps will cause a kernel page fault consistently: atacontrol create RAID0 ad6 ad10 atacontrol detach 5 log: ad10 deleted from ar0 disk1 log: ad10 WARNING - removed from configuration atacontrol addspare 0 ad8 log: ad8 inserted into ar0 disk1 as spare atacontrol rebuild 0 atacontrol detach 4 log: ad8 deleted from ar0 disk1 log: ad8 WARNING - removed from configuration Fatal trap 12: page fault while in kernel mode fault virtual address = 0x10 On Sun, Dec 12, 2004 at 09:59:16PM -0800, Doug White wrote: Thats a nice shotgun you have there. Yessir. And that's what testing is designed to uncover. The question is why this works, and how do we prevent it? Is there a proper way to handle these sort of events? If so, where is it documented? And fyi just pulling the drives causes the same failure so that means that RAID1 buys you nothing because your system will also crash. -- Joe Rhett Senior Geek Meer.net ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]