Re: RAID 5 - serious problem
On Wed, Oct 15, 2008 at 03:27:26PM +0200, MeX wrote: Hello, On Str, Október 15, 2008 15:15, Jeremy Chadwick wrote: Are you using the Matrix Storage Technology? If so, immediately stop. FreeBSD's support for this is very, very bad, and will nearly guarantee data loss. There are many of us who have tried it, and it's known to be buggy on FreeBSD. http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting I recommend you stop using this feature and start using ZFS or gvinum for what you need. do you think that ZFS code in FreeBSD 7.0-R (7.1-R) is enough stable for using it in production environment for RAID-1 even also for RAID-5? What you're asking is completely unrelated to the topic of this thread. The OP should consider his alternatives to Intel MatrixRAID -- it really doesn't matter what the alternatives are, because they're going to be better than the situation/scenario he's in right now. Heck, in this situation, he'd be better off not using MatrixRAID at all, and instead having one filesystem per disk! Losing one filesystem would be better than losing *all* of them. If you want RAID-1, you have the choice of ccd, gmirror, or ZFS. If you want RAID-5, you have the choice of gvinum or ZFS. None of them are perfect, but they're all decent. There are still ZFS stability issues which need to be hammered out, but most of those have been addressed in CURRENT. I'm familiar enough with ZFS tuning at this point that I would trust using it on RELENG_7 (sans root filesystem on ZFS -- I don't want to deal with the hassles involved). About 5 years ago I tried vinum for RAID1 if I remember clearly, but I never tried gvinum. It's production ready piece of cake? I have no personal experience with gmirror or gvinum. I experimented with vinum on FreeBSD 3.x, and I was thoroughly disappointed. When reading that comment, take into consideration that vinum != gvinum, and that was in the FreeBSD 3.x days. Those who have used gmirror have reported immense success with it. -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-hardware@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RAID 5 - serious problem
FreeBSD 7.0-Release Intel D975XBX2 motherboard (Intel Matrix Storage Technology) 3 WD Raptor 74 GB in a RAID 5 array 1 WD Raptor 150 GB as a standalone disk / and /var mounted on the standalone,, /usr on the RAID 5 I believe what happened was that one of the disks didn't respond for such a long time, that is was marked bad. And afterwards the same thing happened for the other disks. When I try to boot the system, all three disks are marked Offline. I am very desperate not to lose my data, In that case, step one is to use dd(1) to make a bit-for-bit copy of the three drives to some trusted media. Since they are marked bad/offline, you might need to move them to a controller that doesn't know anything about RAID. (Note that there is risk here, and in almost anything you do at this point.) Once you have this bit-for-bit backup, you can run any experiment you like to attempt to recover your data. If the experiment goes bad, you can dd the exact original contents back using dd, then try a different experiment. While you're at it, make a normal backup using dump(8) or whatever you normally use, of / and /var. Once you have *everything* backed up, you can do risky experiments like booting linux. My personal approach to avoiding data loss is (a) avoid buggy things like inthell and linux. (b) FFS with softdeps and the disk write cache turned off, (c) full backups. I don't have enough ports to run RAID. :-( The downside is that FreeBSD doesn't have NCQ support yet (when? when? when?) so writes are slow. :-( ___ freebsd-hardware@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to [EMAIL PROTECTED]
RAID 5 - serious problem
Dear list, Something happened that I don't think should be possible. I lost all three disks in my RAID 5 array simultaneously after approx. two years without any problem. And I fear I will never see my data again. But I really hope some of you clever persons can give me some hints. My system is: FreeBSD 7.0-Release Intel D975XBX2 motherboard (Intel Matrix Storage Technology) 3 WD Raptor 74 GB in a RAID 5 array 1 WD Raptor 150 GB as a standalone disk / and /var mounted on the standalone,, /usr on the RAID 5 I believe what happened was that one of the disks didn't respond for such a long time, that is was marked bad. And afterwards the same thing happened for the other disks. When I try to boot the system, all three disks are marked Offline. The BIOS utility for the host controller has no option to force the disks back online. I have another machine with a S5000XVN board and Intel Embedded Server RAID Technology II. The BIOS configuration utility on this board has the option to force offline drives back online. I am very desperate not to lose my data, so I don't know if I dare moving the drives to the other machine and try to make them online again. Do you think I should try? In general, are there any procedures I can try to recover my RAID array? Or is the offline status definitive – and all data definitely lost? I guess some specialized companies have the expertise to recover lost data from a broken RAID array, but I don't know. And I don't know the price of such a service. I would really, really appreciate any kind of help. I have backups of most user data, but not of the system configuration (and maybe even not the databases). This is of course pretty stupid. In the future, I will not rely on RAID 5 as a foolproof solution… Regards, Jon -- *Jon Theil Nielsen* ___ freebsd-hardware@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RAID 5 - serious problem
On Wed, Oct 15, 2008 at 02:32:25PM +0200, Jon Theil Nielsen wrote: Dear list, Something happened that I don't think should be possible. I lost all three disks in my RAID 5 array simultaneously after approx. two years without any problem. And I fear I will never see my data again. But I really hope some of you clever persons can give me some hints. My system is: FreeBSD 7.0-Release Intel D975XBX2 motherboard (Intel Matrix Storage Technology) Are you using the Matrix Storage Technology? If so, immediately stop. FreeBSD's support for this is very, very bad, and will nearly guarantee data loss. There are many of us who have tried it, and it's known to be buggy on FreeBSD. http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting I recommend you stop using this feature and start using ZFS or gvinum for what you need. 3 WD Raptor 74 GB in a RAID 5 array 1 WD Raptor 150 GB as a standalone disk / and /var mounted on the standalone,, /usr on the RAID 5 I believe what happened was that one of the disks didn't respond for such a long time, that is was marked bad. And afterwards the same thing happened for the other disks. When I try to boot the system, all three disks are marked Offline. The BIOS utility for the host controller has no option to force the disks back online. I have another machine with a S5000XVN board and Intel Embedded Server RAID Technology II. The BIOS configuration utility on this board has the option to force offline drives back online. Any embedded RAID is usually BIOS RAID managed by either a software RAID IC (e.g. an IC on the motherboard that handles LBA/CHS addressing for creating a pseudo-array, but the OS still does all of the management and does not off-load anything). I am very desperate not to lose my data, so I don't know if I dare moving the drives to the other machine and try to make them online again. Do you think I should try? No, but you might not have any choice. It honestly sounds like the metadata on your disks is in a bad state. I would recommend you try booting Linux, since their support for MatrixRAID is significantly better/more advanced. Ideally, you should be able to bring the RAID members back online using their tools, then reboot into FreeBSD and cross your fingers that your data becomes accessible. Once accessible, offload it somewhere immediately, and follow my above recommendations. In general, are there any procedures I can try to recover my RAID array? Or is the offline status definitive ? and all data definitely lost? I guess some specialized companies have the expertise to recover lost data from a broken RAID array, but I don't know. And I don't know the price of such a service. I would really, really appreciate any kind of help. I have backups of most user data, but not of the system configuration (and maybe even not the databases). This is of course pretty stupid. In the future, I will not rely on RAID 5 as a foolproof solution? RAID 5 is a fine solution, but you have learned a very valuable lesson, one which I will enclose in asterisks to make it crystal clear: ***RAID DOES NOT REPLACE BACKUPS***. Repeat this mantra over and over until you accept it. :-) -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-hardware@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RAID 5 - serious problem
On Wed, Oct 15, 2008 at 10:14:42AM +0100, Dieter wrote: FreeBSD 7.0-Release Intel D975XBX2 motherboard (Intel Matrix Storage Technology) 3 WD Raptor 74 GB in a RAID 5 array 1 WD Raptor 150 GB as a standalone disk / and /var mounted on the standalone,, /usr on the RAID 5 I believe what happened was that one of the disks didn't respond for such a long time, that is was marked bad. And afterwards the same thing happened for the other disks. When I try to boot the system, all three disks are marked Offline. I am very desperate not to lose my data, In that case, step one is to use dd(1) to make a bit-for-bit copy of the three drives to some trusted media. Since they are marked bad/offline, you might need to move them to a controller that doesn't know anything about RAID. (Note that there is risk here, and in almost anything you do at this point.) Once you have this bit-for-bit backup, you can run any experiment you like to attempt to recover your data. If the experiment goes bad, you can dd the exact original contents back using dd, then try a different experiment. While you're at it, make a normal backup using dump(8) or whatever you normally use, of / and /var. Once you have *everything* backed up, you can do risky experiments like booting linux. My personal approach to avoiding data loss is (a) avoid buggy things like inthell and linux. Interesting, being as we have another thread going as of late that seems to link transparent data loss with AMD AM2-based systems with certain models of Adaptec and possibly LSI Logic controller cards. I like Intel as much as I like AMD -- but it's important to remember that it's becoming more and more difficult to provide flawless stability on things as the complexities increase. And I have no idea what your beef is with Linux. If the OP is successfully able to bring his array on-line using Linux, I would think that says something about the state of things in FreeBSD, would you agree? Both OSes have their pros and cons. (b) FFS with softdeps and the disk write cache turned off, This has been fully discussed by developers, particularly Matt Dillon. I can point you to a thread discussing why doing this is not only silly, but a bad idea. And if you'd like, I can show you just how bad the performance is on disks with WC disabled using UFS2 + softupdates. When I say bad, I'm serious -- we're talking horrid. And yes, I have tried it -- see PR 127717 for evidence that I *have* tried it. :-) There *may* be advantages to disabling a disk's write cache when using a hardware RAID controller that offers its own on-board cache (DIMMs, etc.), but that cache should be battery-backed for safety reasons. (c) full backups. I'm curious what your logic is here too -- this one is debatable, so I'd like to hear your view. I don't have enough ports to run RAID. :-( The downside is that FreeBSD doesn't have NCQ support yet (when? when? when?) so writes are slow. :-( NCQ will not necessarily improve write performance. There have been numerous studies done proving this fact, and I can point you to those as well. TCQ, on the other hand, does offer performance benefits when there are a large number of simultaneous transactions occurring (think: it's more like SCSI's command queueing). I believe Andrey Elsukov is working on getting NCQ support working when AHCI is in use (assuming I remember correctly). -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-hardware@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RAID 5 - serious problem
My personal approach to avoiding data loss is (a) avoid buggy things like inthell and linux. Interesting, being as we have another thread going as of late that seems to link transparent data loss with AMD AM2-based systems with certain models of Adaptec and possibly LSI Logic controller cards. This is the SCSI with = 4 GiB thread? Sounds like an address map problem. I like Intel as much as I like AMD That is your right. Inthell has a long history of buggy products, attempting to hide/ignore bugs, poor customer support, outright theft, etc. AMD isn't perfect, but the list of bad things is far far shorter. And there are other companies to consider besides just inthell and AMD. -- but it's important to remember that it's becoming more and more difficult to provide flawless stability on things as the complexities increase. Computers are complex devices and always have been. Yes this makes it difficult to get everything right. Yet it is possible to achieve very high levels of reliability, better than 5 9s. And I have no idea what your beef is with Linux. The quality is crap. Endless problems, including scrambled data. If the OP is successfully able to bring his array on-line using Linux, I would think that says something about the state of things in FreeBSD, would you agree? Both OSes have their pros and cons. It says linux got something right that FreeBSD got wrong. I never said that BSD gets *everything* right, or that linux gets *everything* wrong. (b) FFS with softdeps and the disk write cache turned off, This has been fully discussed by developers, particularly Matt Dillon. I can point you to a thread discussing why doing this is not only silly, but a bad idea. And if you'd like, I can show you just how bad the performance is on disks with WC disabled using UFS2 + softupdates. When I say bad, I'm serious -- we're talking horrid. And yes, I have tried it -- see PR 127717 for evidence that I *have* tried it. :-) I am WELL aware of how bad write performance is on disks with the write cache turned off. I get only about 10% of what the hardware can do, and with large files that is very noticeable. :-( But data integrity is important. (c) full backups. I'm curious what your logic is here too -- this one is debatable, so I'd like to hear your view. Things go wrong, and when they do backups are useful. The obvious problem is that a backup quickly becomes out of date as data changes. RAID stays current, but doesn't help with accidental file deletions, in cases where the entire machine dies (fire. flood, etc.), and so on. A proper RAID (that actually helps reliability rather than hurting it) plus off site backups gets you pretty close. A RAID with an off site mirror plus off site backups would be about as reliable as you can get. But if the rate of data changes is high the communication charges could be prohibitive. It all comes down to how important your data is and how much money is available. NCQ will not necessarily improve write performance. I doubt it will help if you have the disk's write cache turned on. I'm pretty sure it will help with write cache turned off. I believe Andrey Elsukov is working on getting NCQ support working when AHCI is in use (assuming I remember correctly). I look forward to having NCQ available. Write performance without it is really pathetic. ___ freebsd-hardware@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: RAID 5 - serious problem
On Wed, Oct 15, 2008 at 01:28:43PM +0100, Dieter wrote: My personal approach to avoiding data loss is (a) avoid buggy things like inthell and linux. Interesting, being as we have another thread going as of late that seems to link transparent data loss with AMD AM2-based systems with certain models of Adaptec and possibly LSI Logic controller cards. This is the SCSI with = 4 GiB thread? Sounds like an address map problem. It's the am2 MBs - 4g + SCSI wipes out root partition thread. I like Intel as much as I like AMD That is your right. Inthell has a long history of buggy products, attempting to hide/ignore bugs, poor customer support, outright theft, etc. AMD isn't perfect, but the list of bad things is far far shorter. And there are other companies to consider besides just inthell and AMD. I'd rather not debate this, as it's off-topic. We can take it up privately if you desire, but keep in mind that my ideal system would be an AMD processor on an Intel chipset board -- but I'll probably be dead by the time that ever happens. Both companies could have much to learn from one another. And I have no idea what your beef is with Linux. The quality is crap. Endless problems, including scrambled data. I'm not even going to touch this one. If the OP is successfully able to bring his array on-line using Linux, I would think that says something about the state of things in FreeBSD, would you agree? Both OSes have their pros and cons. It says linux got something right that FreeBSD got wrong. I never said that BSD gets *everything* right, or that linux gets *everything* wrong. I don't really consider it an issue of right or wrong; a very different, and unique viewpoint you have! (And I do mean that sincerely) (b) FFS with softdeps and the disk write cache turned off, This has been fully discussed by developers, particularly Matt Dillon. I can point you to a thread discussing why doing this is not only silly, but a bad idea. And if you'd like, I can show you just how bad the performance is on disks with WC disabled using UFS2 + softupdates. When I say bad, I'm serious -- we're talking horrid. And yes, I have tried it -- see PR 127717 for evidence that I *have* tried it. :-) I am WELL aware of how bad write performance is on disks with the write cache turned off. I get only about 10% of what the hardware can do, and with large files that is very noticeable. :-( But data integrity is important. Your 10% claim is about right. Here's some actual tests I just did (filesystem layer is in the way, but you get the idea): atapci0: Intel ICH5 SATA150 controller port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f irq 18 at device 31.2 on pci0 ata0: ATA channel 0 on atapci0 ata0: [ITHREAD] ad0: 114473MB Seagate ST3120026AS 3.05 at ata0-master SATA150 testbox# ./atacontrol cap ad0 | grep write write cacheyes yes testbox# dd if=/dev/zero of=/usr/testfile bs=1m count=1024 1024+0 records in 1024+0 records out 1073741824 bytes transferred in 20.199726 secs (53156257 bytes/sec) testbox# ./atacontrol wc ad0 off testbox# ./atacontrol cap ad0 | grep write write cacheyes no testbox# dd if=/dev/zero of=/usr/testfile bs=1m count=1024 1024+0 records in 1024+0 records out 1073741824 bytes transferred in 155.745314 secs (6894216 bytes/sec) That's about 13% of the full capability. No administrator in their right mind is going to disable WC unless the disks are behind some form of controller that does caching. (For NCQ stuff, see below.) As for the reading material: http://lists.freebsd.org/pipermail/freebsd-stable/2008-September/045495.html http://lists.freebsd.org/pipermail/freebsd-stable/2008-September/045542.html (c) full backups. I'm curious what your logic is here too -- this one is debatable, so I'd like to hear your view. Things go wrong, and when they do backups are useful. The obvious problem is that a backup quickly becomes out of date as data changes. RAID stays current, but doesn't help with accidental file deletions, in cases where the entire machine dies (fire. flood, etc.), and so on. A proper RAID (that actually helps reliability rather than hurting it) plus off site backups gets you pretty close. A RAID with an off site mirror plus off site backups would be about as reliable as you can get. But if the rate of data changes is high the communication charges could be prohibitive. It all comes down to how important your data is and how much money is available. Ah sorry, I misinterpreted what you wrote! For some reason I thought you were advocating *not* performing full level-0 backups. :-) NCQ will not necessarily improve write performance. I doubt it will help if you have the disk's write cache turned on. I'm pretty sure it will help with write cache turned off. One thing I haven't tested or experimented with is disabling write
Re: RAID 5 - serious problem
I like Intel as much as I like AMD That is your right. Inthell has a long history of buggy products, attempting to hide/ignore bugs, poor customer support, outright theft, etc. AMD isn't perfect, but the list of bad things is far far shorter. And there are other companies to consider besides just inthell and AMD. I'd rather not debate this, as it's off-topic. We can take it up privately if you desire, but keep in mind that my ideal system would be an AMD processor on an Intel chipset board -- but I'll probably be dead by the time that ever happens. Both companies could have much to learn from one another. Inthell apparently has some good fab people. If they were a designless fab house they might not be on my black list. No administrator in their right mind is going to disable WC unless the disks are behind some form of controller that does caching. (For NCQ stuff, see below.) The only setup I have found that doesn't lose data is FFS+softdep+WC off. So you think I am insane for wanting to not lose data? NCQ will not necessarily improve write performance. I doubt it will help if you have the disk's write cache turned on. I'm pretty sure it will help with write cache turned off. One thing I haven't tested or experimented with is disabling write caching on a drive that has NCQ. Since FreeBSD lacks NCQ right now, we could test this on Linux to see what the I/O difference is (I'm talking purely from a dd or bonnie++ perspective). The filesystem may be significant, and last time I looked, linux didn't support FFS r/w. I read something indicating that recent disks do NCQ much better than earlier ones, so NCQ support isn't binary. This, and people testing NCQ with the write cache on, could explain the results where NCQ doesn't help. ___ freebsd-hardware@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hardware To unsubscribe, send any mail to [EMAIL PROTECTED]