Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?
Hi, On Tue, 11 Jan 2000 15:03:03 +0100, mauelsha <[EMAIL PROTECTED]> said: >> THIS IS EXPECTED. RAID-5 isn't proof against multiple failures, and the >> only way you can get bitten by this failure mode is to have a system >> failure and a disk failure at the same time. > To try to avoid this kind of problem some brands do have additional > logging (to disk which is slow for sure or to NVRAM) in place, which > enables them to at least recognize the fault to avoid the > reconstruction of invalid data or even enables them to recover the > data by using redundant copies of it in NVRAM + logging information > what could be written to the disks and what not. Absolutely: the only way to avoid it is to make the data+parity updates atomic, either in NVRAM or via transactions. I'm not aware of any software RAID solutions which do such logging at the moment: do you know of any? --Stephen
Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?
"Stephen C. Tweedie" wrote: > > Hi, > > This is a FAQ: I've answered it several times, but in different places, > THIS IS EXPECTED. RAID-5 isn't proof against multiple failures, and the > only way you can get bitten by this failure mode is to have a system > failure and a disk failure at the same time. > To try to avoid this kind of problem some brands do have additional logging (to disk which is slow for sure or to NVRAM) in place, which enables them to at least recognize the fault to avoid the reconstruction of invalid data or even enables them to recover the data by using redundant copies of it in NVRAM + logging information what could be written to the disks and what not. Heinz
Re: optimising raid performance
[ Tuesday, January 11, 2000 ] [EMAIL PROTECTED] wrote: > >what stripe/chunk sizes are you using in the raid? My exp. has been > >smaller is better down to 4k, although I'm not sure why :) > > We're currently using 8k but with our load then if I can go smaller > I will do. > Is there any merit in using -R on mke2fs if we're doing raid1? I've always interpreted -R stride= as meaning "how many ext2 blocks to gather before sending to the lower-level block device". This way the block device can deal with things more efficiently. Since the stride= must default to 1 (I can't see how it could pick a different one) then any time your device (h/w or s/w raid) is using larger block sizes -R would seem to be a good choice (for 8K block sizes, stride=2) The raid1 shouldn't matter as much, so try without stride= and then with stride=2 (if still using 8K block sizes) I get the feeling that the parallelism vs. efficiency tradeoff in block sizes still isn't fully understood, but lots of random writes should almost certainly do best with the smallest block sizes available down to a single page (4k) As always, I'd like to solicit other views on this :) James -- Miscellaneous Engineer --- IBM Netfinity Performance Development
Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?
Hi, On Tue, 11 Jan 2000 20:17:22 +0100, Benno Senoner <[EMAIL PROTECTED]> said: > Assume all RAID code - FS interaction problems get fixed, since a > linux soft-RAID5 box has no battery backup, does this mean that we > will loose data ONLY if there is a power failure AND successive disk > failure ? If we loose the power and then after reboot all disks > remain intact can the RAID layer reconstruct all information in a safe > way ? Yes. --Stephen
Re: large ide raid system
[ Tuesday, January 11, 2000 ] John Burton wrote: > Performance is pretty good - these numbers are for a first generation > smartcan (spring '99) Could you re-run the raidzone and softraid with a size of 512MB or larger? Could you run the tiobench.pl from http://www.iki.fi/miku/tiotest (after "make" to build tiotest) Those would be great results to see. Thanks, James -- Miscellaneous Engineer --- IBM Netfinity Performance Development
Re: large ide raid system
On Tue, Jan 11, 2000 at 04:25:27PM +0100, Benno Senoner wrote: > Jan Edler wrote: > > I wasn't advising against IDE, only against the use of slaves. > > With UDMA-33 or -66, masters work quite well, > > if you can deal with the other constraints that I mentioned > > (cable length, PCI slots, etc). > > Do you have any numbers handy ? Sorry, I can't seem to find any quantitative results on that right now. > will the performance of master/slave setup be at least HALF of the > master-only setup. I did run some tests, and my recollection is that it was much worse. > For some apps cost is really important, and software IDE RAID has a very low > price/Megabyte. > If the app doesn't need killer performance , then I think it is the best > solution. It all depends on your minimum acceptable performance level. I know my master/slave test setup couldn't keep up with fast ethernet (10 MByte/s). I don't remember if it was >1 Mbyte/s or not. I was also wondering about the reliability of using slaves. Does anyone know about the likelihood of a single failed drive bringing down the whole master/slave pair? Since I have tended to stay away from slaves, for performance reasons, I don't know how they influence reliability. Maybe it's ok. Jan Edler NEC Research Institute
Re: optimising raid performance
[ Tuesday, January 11, 2000 ] [EMAIL PROTECTED] wrote: > >I'd really love to see you do a s/w raid 1 over 2 6-disk raid0's from > >the card and check that performance-wise... I believe putting the raid1 > >and raid0 logic on sep. processors could help, and worst case it'll > >give a nice test case for any read-balancing patches floating around > >(although you've noted that you are more write-intensive) > > Which would you like me to try all software or do part in software > and part in hardware and if the latter which part? The raid card > seems pretty good (233MHz strongarm onboard) so I doubt that is limiting > us. dual PII-500 >> single 233 :) s/w raid 1 over 2 6-disk h/w raid0's is what I meant to ask for I trust the strongarm to handle raid0, but that's about it :) what stripe/chunk sizes are you using in the raid? My exp. has been smaller is better down to 4k, although I'm not sure why :) James -- Miscellaneous Engineer --- IBM Netfinity Performance Development
Re: large ide raid system
john b said: > Performance is pretty good - these numbers are for a first generation > smartcan (spring '99) these numbers are also useless since they are much too close to your ram size, and bonnie only shows how fast your system runs bonnie :) a better benchmark would be to see how this runs with multiple concurrent accesses to even larger files. perhaps something like tiotest? > Using "top": > - With "Softraid" bonnie and the md Raid-5 software were sharing the > cpu equally but what was the total? > - With "raidzone" bonnie was consuming most (>85%) of the cpu and no > other processes >and "system" < 15% but even with bonnie getting more cpu time, the speed did not seem terribly different. this makes me wonder about how fast the smartcan's logic really is... > > Getting back to the discussion of Hardware vs. Software raid... > Can someone say *definitively* *where* the raid-5 code is being run on a > *current* Raidzone product? Originally, it was an "md" process running > on the system cpu. Currently I'm not so sure. The SmartCan *does* have > its own BIOS, so there is *some* intelligence there, but what exactly is > the division of responsibility here... > i cant tell you about the division of responsiblility, but i can tell you i keep closed source, binary modules out of my kernel. i have enough problem with vendors who dont release specs to their equipment, let alone those who ride on the backs of the kernel developers by taking advantage of open code, and keeping theirs closed. vote with your dollars i say.
Re: large ide raid system
Benno Senoner wrote: > > Jan Edler wrote: > > > On Mon, Jan 10, 2000 at 12:49:29PM -0800, Dan Hollis wrote: > > > On Mon, 10 Jan 2000, Jan Edler wrote: > > > > - Performance is really horrible if you use IDE slaves. > > > >Even though you say you aren't performance-sensitive, I'd > > > >recommend against it if possible. > > > > > > My tests indicate UDMA performs favorably with ultrascsi, at about 1/6 the > > > cost. Cost is often a big factor. > > > > I wasn't advising against IDE, only against the use of slaves. > > With UDMA-33 or -66, masters work quite well, > > if you can deal with the other constraints that I mentioned > > (cable length, PCI slots, etc). > > Do you have any numbers handy ? > > will the performance of master/slave setup be at least HALF of the > master-only setup. Well, this depends on how it's used. If you were saturating your I/O bus, then things would be REALLY ugly. Say you've got a controller running in UDMA/33 mode, with two disks attached. If you have drives that are reasonably fast, say recent 5400 RPM UDMA drives, then this will actually hinder performance compared to having just one drive. If you're doing 16MB/sec of I/O, then your performance will be slightly less than half the performance of having just one drive on that channel (consider overhead, IDE controller context switches, etc). If you only need the space, then this is an accptable solution, for low throughput applications. I don't know jack schitt about ext2, the linux ide drivers (patches or old ones), or about the RAID code, except that they work. > > For some apps cost is really important, and software IDE RAID has a very low > price/Megabyte. > If the app doesn't need killer performance , then I think it is the best > solution. It's a very good solution for a small number of disks, where you can keep everything in a small case. It may actually be superior to SCSI for situations where you have 4 or fewer disks and can put just a single disk on a controller. > > now if we only had soft-RAID + journaled FS + power failure safeness right now > ... I'll be happy as long as it gets there relatively soon, I'll be happy. fsck'ing is the only thing that really bugs me... Greg
Re: large ide raid system
John Burton wrote: > > Thomas Davis wrote: > > > > James Manning wrote: > > > > > > Well, it's kind of on-topic thanks to this post... > > > > > > Has anyone used the systems/racks/appliances/etc from raidzone.com? > > > If you believe their site, it certainly looks like a good possibility. > > > > > > > Yes. > > > > It's pricey. Not much cheaper that SCSI chassis. You only save money > > on the drives. > > > > Interesting... The 100GB Internal RAID-5 SmartCan I purchased from > RaidZone was approx. $5k. The quotes I got for a SCSI equivalent ranged > from $10k to $15K. Personally I consider half the cost significantly > cheaper. I also was quite impressed with a qoute for a 1TB rackmount > system in the $50K range, again SCSI equivalents were significantly > higher... > We paid $25k x 4, for: 2x450mhz cpu 256mb ram 15x37gb IBM 5400 drives (550 gb of drive space) Intel system board, w/eepro tulip card (channel bonded into cisco5500) > > Performance is pretty good - these numbers are for a first generation > smartcan (spring '99) > > ---Sequential Output ---Sequential Input-- > --Random-- > -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- > --Seeks--- > MachineMB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU > /sec %CPU > raidzone 100 6923 89.7 25987 26.6 14230 28.9 7297 89.4 215121 77.7 > 16407.3 69.7 > raidzone 200 6537 86.2 22175 21.5 14297 30.2 7667 92.5 56355 36.0 > 377.5 3.1 > > Softraid 100 6598 86.0 43411 36.5 12077 27.4 6180 77.9 54022 46.4 > 721.4 4.1 > Softraid 200 8337 87.9 25373 24.0 9009 18.8 8952 87.1 34413 21.7 > 301.1 2.2 You made a mistake. :-) Your bonnie size is smaller than the amount of memory in the machine your tested on - so you tested the memory, NOT the drive system. Our current large machine(s) (15x37gb IBM drives, 500gb file system, 4kb blocks, v2.2.13 kernel, fixed knfsd, channel bonding, raidzone 1.2.0b3) does: ---Sequential Output ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- MachineMB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU pdsfdv10 1024 14076 85.1 18487 24.3 12089 35.8 20182 83.0 63064 69.8 344.4 7.1 I've also hit it with 8 machines, doing an NFS copy of about 60gb onto it, and it sustained about a 20mb/sec write rate. > > Using "top": > - With "Softraid" bonnie and the md Raid-5 software were sharing the > cpu equally > - With "raidzone" bonnie was consuming most (>85%) of the cpu and no > other processes >and "system" < 15% > I've seen load averages in the 5's and 6's. This is on a dual processor machine w/256mb of ram. My biggest complaint is the raid rebuild code runs as the highest priority, so on a crash/reboot, it takes _forever_ for fsck to complete (because the rebuild thread is taking all of the CPU and disk bandwidth). The raidzone code also appears to be single threaded - it doesn't take advantage of multiple CPU's. (although, user space code benefits from having a second CPU then) > Getting back to the discussion of Hardware vs. Software raid... > Can someone say *definitively* *where* the raid-5 code is being run on a > *current* Raidzone product? Originally, it was an "md" process running > on the system cpu. Currently I'm not so sure. The SmartCan *does* have > its own BIOS, so there is *some* intelligence there, but what exactly is > the division of responsibility here... > None of the RAID code runs in the smartcan, or the controller. It all runs in the kernel. the current code has several kernel threads, and a user space thread: root 6 0.0 0.0 00 ?SW Jan04 0:02 [rzft-syncd] root 7 0.0 0.0 00 ?SW Jan04 0:00 [rzft-rcvryd] root 8 0.1 0.0 00 ?SW< Jan04 14:41 [rzft-dpcd] root 620 0.0 0.0 5640 ?SW Jan04 0:00 [rzmpd] root 621 0.0 0.1 2080 296 ?SJan04 3:30 rzmpd root 3372 0.0 0.0 00 ?ZJan10 0:00 [rzmpd ] root 3806 0.0 0.1 1240 492 pts/1S09:57 0:00 grep rz -- +-- Thomas Davis| PDSF Project Leader [EMAIL PROTECTED] | (510) 486-4524 | "Only a petabyte of data this year?"
Ribbon Cabling (was Re: large ide raid system)
On Tue, 11 Jan 2000, Gregory Leblanc wrote: > If you cut the cable > lengthwise (no, don't cut the wires) between wires (don't break the > insulation on the wires themselves, just the connecting plastic) you can > get your cables to be 1/4 the normal width (up until you get to the > connector). I don't know about IDE, but I'm pretty sure that's a big no-no for SCSI cables. The alternating conductors in the ribbon cable are sig, gnd, sig, gnd, sig, etc. And it's electrically important (for proper impedance and noise and cross-talk rejection) that they stay that way. I think the same is probably true for the schmancy UDMA66 cables too... -Andy
Re: soft RAID5 + journalled FS + power failure = problems ?
"Stephen C. Tweedie" wrote: > Hi, > > On Fri, 07 Jan 2000 13:26:21 +0100, Benno Senoner <[EMAIL PROTECTED]> > said: > > > what happens when I run RAID5+ jornaled FS and the box is just writing > > data to the disk and then a power outage occurs ? > > > Will this lead to a corrupted filesystem or will only the data which > > was just written, be lost ? > > It's more complex than that. Right now, without any other changes, the > main danger is that the raid code can sometimes lead to the filesystem's > updates being sent to disk in the wrong order, so that on reboot, the > journaling corrupts things unpredictably and silently. > There is a second effect, which is that if the journaling code tries to > prevent a buffer being written early by keeping its dirty bit clear, > then raid can miscalculate parity by assuming that the buffer matches > what is on disk, and that can actually cause damage to other data than > the data being written if a disk dies and we have to start using parity > for that stripe. do you know if using soft RAID5 + regular etx2 causes the same sort of damages, or if the corruption chances are lower when using a non journaled FS ? is the potential corruption caused by the RAID layer or by the FS layer ? ( does need the FS code or the RAID code to be fixed ?) if it's caused by the FS layer, how does behave XFS (not here yet ;-) ) or ReiserFS in this case ? cheers, Benno. > > > Both are fixable, but for now, be careful... > > --Stephen
Which filesystem(s) on RAID for speed
I'm back to running everything from my dog-slow UDMA drive again, because I have bad feelings about my stripe set. But once I get things cleared up, which filesystem(s) should I put on a RAID-0 device for best system performance. The two drives in the stripe set are identical, because this should be the best way to go about it (right?). I'm looking to learn more about software RAID on *nix systems, after some bad times with software RAID on NT, so any good links are appreciated. Thanks, Greg
Re: large ide raid system
SCSI works quite well with many devices connected to the same cable. The PCI bus turns out to be the bottleneck with the faster scsi modes, so it doesn't matter how many channels you have. If performance was the issue, but the original poster wasn't interested in performance, multiple channels would improve performance if the slower (single ended) devices are used. <>< Lance Dan Hollis wrote: > Cable length is not so much a pain as the number of cables. Of course with > scsi you want multiple channels anyway for performance, so the situation > is very similar to ide. A cable mess.
Proper settings for fstab
I've managed to create a RAID stripe set (RAID 0) out of a pair of SCSI2-W (20MB/Sec) drives, and it looks happy. I'd like to mount some part of my filesystem to this new device, but when I add it to fstab in an out-of-the way location with 1 2 following that entry is fstab, it always has errors on boot. They are usually something about attempting to read thus-and-such block caused a short read. fsck'ing that drive by hand generally doesn't find any errors, although every third or fourth time something will turn up (same error, respond ignore, then fix a couple of minor errors). Any ideas on how to track this down? Thanks, Greg
Re: Swapping Drives on RAID?
Scott, 1. Use raidhotremove to take out the IDE drive. Example: raidhotremove /dev/md0 /dev/hda5 2. Use raidhotadd to add the SCSI drive. Example: raidhotadd /dev/md0 /dev/sda5 3. Correct your /etc/raidtab file with the changed device. <>< Lance. Scott Patten wrote: > I'm sorry if this is covered somewhere. I couldn't find it. > > 1 - I have a raid1 consisting of 2 drives. For strange > historical reasons one is SCSI and the other IDE. Although > the IDE is fairly fast the SCSI is much faster and since I > now have another SCSI drive to add, I would like to replace > the IDE with the SCSI. Can I unplug the IDE drive, run in > degraded mode, edit the raid.conf and somehow mkraid > without loosing data or do I need to restore from tape. > BYW, I'm using 2.2.13ac1. >
Re: [FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?
"Stephen C. Tweedie" wrote: (...) > > 3) The soft-raid backround rebuild code reads and writes through the >buffer cache with no synchronisation at all with other fs activity. >After a crash, this background rebuild code will kill the >write-ordering attempts of any journalling filesystem. > >This affects both ext3 and reiserfs, under both RAID-1 and RAID-5. > > Interaction 3) needs a bit more work from the raid core to fix, but it's > still not that hard to do. > > So, can any of these problems affect other, non-journaled filesystems > too? Yes, 1) can: throughout the kernel there are places where buffers > are modified before the dirty bits are set. In such places we will > always mark the buffers dirty soon, so the window in which an incorrect > parity can be calculated is _very_ narrow (almost non-existant on > non-SMP machines), and the window in which it will persist on disk is > also very small. > > This is not a problem. It is just another example of a race window > which exists already with _all_ non-battery-backed RAID-5 systems (both > software and hardware): even with perfect parity calculations, it is > simply impossible to guarantee that an entire stipe update on RAID-5 > completes in a single, atomic operation. If you write a single data > block and its parity block to the RAID array, then on an unexpected > reboot you will always have some risk that the parity will have been > written, but not the data. On a reboot, if you lose a disk then you can > reconstruct it incorrectly due to the bogus parity. > > THIS IS EXPECTED. RAID-5 isn't proof against multiple failures, and the > only way you can get bitten by this failure mode is to have a system > failure and a disk failure at the same time. > > > --Stephen thank you very much for these clear explanations, Last doubt: :-) Assume all RAID code - FS interaction problems get fixed, since a linux soft-RAID5 box has no battery backup, does this mean that we will loose data ONLY if there is a power failure AND successive disk failure ? If we loose the power and then after reboot all disks remain intact can the RAID layer reconstruct all information in a safe way ? The problem is that power outages are unpredictable even in presence of UPSes therefore it is important to have some protection against power losses. regards, Benno.
Re: RedHat 6.1
I am running RAID 5 on my news server with Red Hat 6.1 out of the box with no problems. I also am running RAID 1 on one of my static content web servers as well with no problems. On our news server I am using UW SCSI 2, and on the web server I am running EIDE UDMA drives with no problems. We have been running this way for almost a month now with no problems. So I would say that the RAID stuff is very stable at this time. I have been following the RAID driver development for over a year. Since as an ISP we really want/need RAID support. Tim Jung System Admin Internet Gateway Inc. [EMAIL PROTECTED] - Original Message - From: "Jochen Scharrlach" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: "RAID Mailinglist" <[EMAIL PROTECTED]> Sent: Tuesday, January 11, 2000 3:16 AM Subject: Re: RedHat 6.1 > Tim Niemueller writes: > > I will get a new computer in some days and I want to build up an array. > > I will use a derivate of RedHat Linux 6.1 (Halloween 4). There is RAID > > support in the graphical installation tool, so I think the RAID patches > > are already attached to the kernel. > > Yes, just like the knfs-patches and some other stuff. > > > Any hints what I must change, any settings to do? If I compile my own > > kernel with the supplied kernel source, will this kernel support RAID > > and can I use it without any changes and can I install the RAID tools > > from RPM? > > The partitioning tool is (IMHO) a bit confusing - you'll have to > define first partitions of the type "Linux-RAID" which you'll have to > combine with the "make RAID device" button. Don't let you confuse by > the fact that the partition numbers are changing every time you add a > partition... > > The default kernel options are set to include all RAID-stuff, so this > is no problem - I made the experience that usually it isn't necessary to > rebuild a kernel on RH 5.x/6.x, unless you need special drivers or > want to use a newer kernel revision. > > The raidtools are (of course) in the default install set. > > Bye, > Jochen > > -- > > # mgm ComputerSysteme und -Service GmbH > # Sophienstr. 26 / 70178 Stuttgart / Germany / Voice: +49.711.96683-5 > > > The Internet treats censorship as a malfunction and routes around it. >--John Perry Barlow
Re: optimising raid performance
> - What kinds of numbers are you getting for performance now? Kinda hard to say, we're far more interested in random IO rather than sequential stuff. > and do a make and then ./tiobench.pl --threads 16 /tiotest/ is a single root disk /data1 is the 8 disc raid 1+0 /data2 is the 3 disc raid 5 All discs are IBM DMVS18D so 18Gb 10k rpm 2mb cache, sca2 discs. http://www.storage.ibm.com/hardsoft/diskdrdl/prod/us18lzx36zx.htm MachineDirectory Size(MB) BlkSz Threads Read Write Seeks --- --- - --- - -- --- --- /tiotest/ 512 4096 1 18.092 6.053 2116.40 /tiotest/ 512 4096 2 16.363 5.792 829.876 /tiotest/ 512 4096 4 17.164 5.882 1520.91 /tiotest/ 512 4096 8 14.533 5.852 932.401 /tiotest/ 512 4096 16 16.244 5.806 1731.60 /data1/tiot512 4096 1 29.257 14.406 2234.63 /data1/tiot512 4096 2 38.124 13.734 .11 /data1/tiot512 4096 4 31.373 12.864 5128.20 /data1/tiot512 4096 8 29.341 12.460 4705.88 /data1/tiot512 4096 16 34.806 12.121 .55 /data2/tiot512 4096 1 23.063 16.269 1851.85 /data2/tiot512 4096 2 21.576 16.754 1498.12 /data2/tiot512 4096 4 17.908 17.021 3125.00 /data2/tiot512 4096 8 15.773 17.107 3478.26 /data2/tiot512 4096 16 15.394 16.920 4166.66 > - Did you get a chance to benchmark raid 1+0 against 0+1? > - Of the 12 disks over 2 channels, which are in the raid0+1, which > in the raid5, which spare? how are the drive packs configured? 6 discs on each channel, discs 1-4 of each pack form the raid 6 stripe, discs 5 and 6 on ch1 and 5 on ch2 are in the raid5, disc 6 on ch6 is the spare. > - Is the card using its write cache? write-back or write-through? Its using write back on both devices. > - Do you have the latest firmware on the card? Pretty much, the firmware changelog implies the only real change is to support PCI hotswap. > - Which kernel are you using? Standard Redhat 6.1 kernel Linux xxx.yyy.zzz 2.2.12-20smp #1 SMP Mon Sep 27 10:34:45 EDT 1999 i686 unknown > - What block size is the filesystem? Did you create with a -R param? 4k blocksize, didn't use the -R as this is currently hardware raid > - What is your percentage of I/O operations that are writes? Approx 50% >IMO raid 1+0 for 2 stripes of 6 discs (better be around when a drive goes, >though, as that second failure will have about a 55% chance of taking >out the array :) Can't fault your logic there... But don't you mean 0+1 ie 2 stripes of 6 discs mirrored together rather than 1+0 (6 mirroring pairs striped together). >I'd really love to see you do a s/w raid 1 over 2 6-disk raid0's from >the card and check that performance-wise... I believe putting the raid1 >and raid0 logic on sep. processors could help, and worst case it'll >give a nice test case for any read-balancing patches floating around >(although you've noted that you are more write-intensive) Which would you like me to try all software or do part in software and part in hardware and if the latter which part? The raid card seems pretty good (233MHz strongarm onboard) so I doubt that is limiting us. thanks, Chris -- Chris Good - Dialog Corp. The Westbrook Centre, Milton Rd, Cambridge UK Phone: 01223 715000 Fax: 01223 715001 http://www.dialog.com
[FAQ-answer] Re: soft RAID5 + journalled FS + power failure = problems ?
Hi, This is a FAQ: I've answered it several times, but in different places, so here's a definitive answer which will be my last one: future questions will be directed to the list archives. :-) On Tue, 11 Jan 2000 16:20:35 +0100, Benno Senoner <[EMAIL PROTECTED]> said: >> then raid can miscalculate parity by assuming that the buffer matches >> what is on disk, and that can actually cause damage to other data >> than the data being written if a disk dies and we have to start using >> parity for that stripe. > do you know if using soft RAID5 + regular etx2 causes the same sort of > damages, or if the corruption chances are lower when using a non > journaled FS ? Sort of. See below. > is the potential corruption caused by the RAID layer or by the FS > layer ? ( does need the FS code or the RAID code to be fixed ?) It is caused by neither: it is an interaction effect. > if it's caused by the FS layer, how does behave XFS (not here yet ;-) > ) or ReiserFS in this case ? They will both fail in the same way. Right, here's the problem: The semantics of the linux-2.2 buffer cache are not well defined with respect to write ordering. There is no policy to guide what gets written and when: the writeback caching can trickle to disk at any time, and other system components such as filesystems and the VM can force a write-back of data to disk at any time. Journaling imposes write ordering constraints which insist that data in the buffer cache *MUST NOT* be written to disk unless the filesystem explicitly says so. RAID-5 needs to interact directly with the buffer cache in order to be able to improve performance. There are three nasty interactions which result: 1) RAID-5 tries to bunch writes of dirty buffers up so that all the data in a stripe gets written to disk at once. For RAID-5, this is very much faster than dribbling the stripe back one disk at a time. Unfortunately, this can result in dirty buffers being written to disk earlier than the filesystem expected, with the result that on a crash, the filesystem journal may not be entirely consistent. This interaction hits ext3, which stores its pending transaction buffer updates in the buffer cache with the b_dirty bit set. 2) RAID-5 peeks into the buffer cache to look for buffer contents in order to calculate parity without reading all of the disks in a stripe. If a journaling system tries to prevent modified data from being flushed to disk by deferring the setting of the buffer dirty flag, then RAID-5 will think that the buffer, being clean, matches the state of the disk and so it will calculate parity which doesn't actually match what is on disk. If we crash and one disk fails on reboot, wrong parity may prevent recovery of the lost data. This interaction hits reiserfs, which stores its pending transaction buffer updates in the buffer cache with the b_dirty bit clear. Both interactions 1) and 2) can be solved by making RAID-5 completely avoid buffers which have an incremented b_count reference count, and making sure that the filesystems all hold that count raised when the buffers are in an inconsistent or pinned state. 3) The soft-raid backround rebuild code reads and writes through the buffer cache with no synchronisation at all with other fs activity. After a crash, this background rebuild code will kill the write-ordering attempts of any journalling filesystem. This affects both ext3 and reiserfs, under both RAID-1 and RAID-5. Interaction 3) needs a bit more work from the raid core to fix, but it's still not that hard to do. So, can any of these problems affect other, non-journaled filesystems too? Yes, 1) can: throughout the kernel there are places where buffers are modified before the dirty bits are set. In such places we will always mark the buffers dirty soon, so the window in which an incorrect parity can be calculated is _very_ narrow (almost non-existant on non-SMP machines), and the window in which it will persist on disk is also very small. This is not a problem. It is just another example of a race window which exists already with _all_ non-battery-backed RAID-5 systems (both software and hardware): even with perfect parity calculations, it is simply impossible to guarantee that an entire stipe update on RAID-5 completes in a single, atomic operation. If you write a single data block and its parity block to the RAID array, then on an unexpected reboot you will always have some risk that the parity will have been written, but not the data. On a reboot, if you lose a disk then you can reconstruct it incorrectly due to the bogus parity. THIS IS EXPECTED. RAID-5 isn't proof against multiple failures, and the only way you can get bitten by this failure mode is to have a system failure and a disk failure at the same time. --Stephen
Re: large ide raid system
Thomas Davis wrote: > > James Manning wrote: > > > > Well, it's kind of on-topic thanks to this post... > > > > Has anyone used the systems/racks/appliances/etc from raidzone.com? > > If you believe their site, it certainly looks like a good possibility. > > > > Yes. > > It's pricey. Not much cheaper that SCSI chassis. You only save money > on the drives. > Interesting... The 100GB Internal RAID-5 SmartCan I purchased from RaidZone was approx. $5k. The quotes I got for a SCSI equivalent ranged from $10k to $15K. Personally I consider half the cost significantly cheaper. I also was quite impressed with a qoute for a 1TB rackmount system in the $50K range, again SCSI equivalents were significantly higher... > Performance is ok. Has a few other problems - your stuck with the > kernels they support; the raid code is NOT open sourced. Performance is pretty good - these numbers are for a first generation smartcan (spring '99) ---Sequential Output ---Sequential Input-- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- MachineMB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU raidzone 100 6923 89.7 25987 26.6 14230 28.9 7297 89.4 215121 77.7 16407.3 69.7 raidzone 200 6537 86.2 22175 21.5 14297 30.2 7667 92.5 56355 36.0 377.5 3.1 Softraid 100 6598 86.0 43411 36.5 12077 27.4 6180 77.9 54022 46.4 721.4 4.1 Softraid 200 8337 87.9 25373 24.0 9009 18.8 8952 87.1 34413 21.7 301.1 2.2 The two sets of numbers were measured on the same computer & hardware setup (500Mhz PIII w/ 128MB, 100GB Smartcan w/ 5 24GB IBM drives). "raidzone" is using Raidzone's most recent pre-release version of their Linux software (BIOS upgrades & all). "Softraid" was based on early alpha release of RaidZone's linux support which basically allowed you to access the individual drives. RAID was handled by the Software Raid support available under RedHat Linux 6.0 & 6.1. Both were set up as RAID-5 Using "top": - With "Softraid" bonnie and the md Raid-5 software were sharing the cpu equally - With "raidzone" bonnie was consuming most (>85%) of the cpu and no other processes and "system" < 15% Getting back to the discussion of Hardware vs. Software raid... Can someone say *definitively* *where* the raid-5 code is being run on a *current* Raidzone product? Originally, it was an "md" process running on the system cpu. Currently I'm not so sure. The SmartCan *does* have its own BIOS, so there is *some* intelligence there, but what exactly is the division of responsibility here... John -- John Burton, Ph.D. Senior Associate GATS, Inc. [EMAIL PROTECTED] 11864 Canon Blvd - Suite 101 [EMAIL PROTECTED] (personal) Newport News, VA 23606 (757) 873-5920 (voice) (757) 873-5924 (fax)
Re: tiotest
On Mon, 10 Jan 2000, James Manning wrote: > [ Monday, January 10, 2000 ] Dietmar Goldbeck wrote: > > On Mon, Nov 29, 1999 at 04:20:45PM -0500, James Manning wrote: > > > tiotest is a nice start to what I would like to see: a replacement > > > for bonnie... While stripping out the character-based stuff from > > > bonnie would bring it closer to what I'd like to see, threading > > > would be a bit of a pain so starting with tiotest as a base might > > > not be a bad idea if people are willing to help out... > > > > Is tiotest Open Source? > > Yes, under the GNU GPL > > > Can you give me an URL please > > http://www.icon.fi/~mak/tiotest/ This works, but for future: If you use http://www.iki.fi/miku/tiotest, you will always get redirected to correct place. -- Mika <[EMAIL PROTECTED]>
Re: large ide raid system
Jan Edler wrote: > On Mon, Jan 10, 2000 at 12:49:29PM -0800, Dan Hollis wrote: > > On Mon, 10 Jan 2000, Jan Edler wrote: > > > - Performance is really horrible if you use IDE slaves. > > >Even though you say you aren't performance-sensitive, I'd > > >recommend against it if possible. > > > > My tests indicate UDMA performs favorably with ultrascsi, at about 1/6 the > > cost. Cost is often a big factor. > > I wasn't advising against IDE, only against the use of slaves. > With UDMA-33 or -66, masters work quite well, > if you can deal with the other constraints that I mentioned > (cable length, PCI slots, etc). Do you have any numbers handy ? will the performance of master/slave setup be at least HALF of the master-only setup. For some apps cost is really important, and software IDE RAID has a very low price/Megabyte. If the app doesn't need killer performance , then I think it is the best solution. now if we only had soft-RAID + journaled FS + power failure safeness right now ... cheers, Benno.
Re: Swapping Drives on RAID?
On Mon, Jan 10, 2000 at 11:16:27AM -0700, Scott Patten wrote: > 1 - I have a raid1 consisting of 2 drives. For strange > historical reasons one is SCSI and the other IDE. Although > the IDE is fairly fast the SCSI is much faster and since I > now have another SCSI drive to add, I would like to replace > the IDE with the SCSI. Can I unplug the IDE drive, run in > degraded mode, edit the raid.conf and somehow mkraid > without loosing data or do I need to restore from tape. > BYW, I'm using 2.2.13ac1. I assume you configured your raid to "auto-start", i.e. you mkraid'ed it with persistent_superblock set to 1, and set all the partition types to$ 0xfd. If not, please tell me so and we'll work out what you have to do. But in the case of auto-starting raid, it's quite easy, but a bit lengthy to explain: * halt your computer * remove the IDE drive (keep it in a safe place, in case I screwed up :) ) * attach the second SCSI drive * boot. Your /dev/md devices should come up fully useable, but in degraded mode * partition the second SCSI drive exactly like the first one * for each md device, "raidhotadd" the new disk to it. Assuming you have /dev/md5 that consisted of /dev/sda5 and /dev/hda5 (which is now removed), and you partitoined /dev/sdb like /dev/sda, do raidhotadd /dev/md5 /dev/sdb5 It's actually quite simple, but difficult to explain. * check /proc/mdstat. It should show that the devices are being resynced * if you are finished, and everything works, be sure to change /etc/raidtab to reflect your new settings. > 2 - Which is better, 2.2.13ac3 or a patched 2.2.14? Will > there be a 2.2.14ac series? Is there a place besides this > list with this kind of information? I've been using a patched 2.2.14 for some days without any problems. Only Alan Cox knows if there will be a 2.2.14ac. He usually writes his intentions into his diary, http://www.linux.org.uk/diary/ -- Andreas Trottmann <[EMAIL PROTECTED]>
Re: RedHat 6.1
Tim Niemueller writes: > I will get a new computer in some days and I want to build up an array. > I will use a derivate of RedHat Linux 6.1 (Halloween 4). There is RAID > support in the graphical installation tool, so I think the RAID patches > are already attached to the kernel. Yes, just like the knfs-patches and some other stuff. > Any hints what I must change, any settings to do? If I compile my own > kernel with the supplied kernel source, will this kernel support RAID > and can I use it without any changes and can I install the RAID tools > from RPM? The partitioning tool is (IMHO) a bit confusing - you'll have to define first partitions of the type "Linux-RAID" which you'll have to combine with the "make RAID device" button. Don't let you confuse by the fact that the partition numbers are changing every time you add a partition... The default kernel options are set to include all RAID-stuff, so this is no problem - I made the experience that usually it isn't necessary to rebuild a kernel on RH 5.x/6.x, unless you need special drivers or want to use a newer kernel revision. The raidtools are (of course) in the default install set. Bye, Jochen -- # mgm ComputerSysteme und -Service GmbH # Sophienstr. 26 / 70178 Stuttgart / Germany / Voice: +49.711.96683-5 The Internet treats censorship as a malfunction and routes around it. --John Perry Barlow
Re: large ide raid system
Dan Hollis wrote: > > On Mon, 10 Jan 2000, Jan Edler wrote: > > Cable length is not so much a pain as the number of cables. Of course with > scsi you want multiple channels anyway for performance, so the situation > is very similar to ide. A cable mess. There's a (relatively) nice way to get around this, if you make your own IDE cables (or are brave enough to cut some up). If you cut the cable lengthwise (no, don't cut the wires) between wires (don't break the insulation on the wires themselves, just the connecting plastic) you can get your cables to be 1/4 the normal width (up until you get to the connector). This also makes a big difference for airflow, since those big, flat ribbon cables are really bad for that. Greg
Re: optimising raid performance
[ Monday, January 10, 2000 ] [EMAIL PROTECTED] wrote: > I've currently got a hardware raid system that I'm maxing out so > any ideas on how to speed it up would be gratefully received. Just some quick questions for additional info... - What kinds of numbers are you getting for performance now? - I'd check bonnie with a filesize of twice your RAM and then get http://www.icon.fi/~mak/tiotest/tiotest-0.16.tar.gz and do a make and then ./tiobench.pl --threads 16 - Did you get a chance to benchmark raid 1+0 against 0+1? - Of the 12 disks over 2 channels, which are in the raid0+1, which in the raid5, which spare? how are the drive packs configured? - Is the card using its write cache? write-back or write-through? - Do you have the latest firmware on the card? - Which kernel are you using? - What block size is the filesystem? Did you create with a -R param? - What is your percentage of I/O operations that are writes? > Since there is a relatively high proportion of writes a single raid5 set > seem to be out. The next best thing looks like a mirror but which is going > to be better performance wise, 6 mirror pairs striped together or mirroring > 2 stripes of 6 discs? IMO raid 1+0 for 2 stripes of 6 discs (better be around when a drive goes, though, as that second failure will have about a 55% chance of taking out the array :) > Does the kernel get any scheduling benefit by seeing the discs and doing > things in software? As you can see the machine has a very low cpu load > so I'd quite hapily trade some cpu for io throughput... I'd really love to see you do a s/w raid 1 over 2 6-disk raid0's from the card and check that performance-wise... I believe putting the raid1 and raid0 logic on sep. processors could help, and worst case it'll give a nice test case for any read-balancing patches floating around (although you've noted that you are more write-intensive) James -- Miscellaneous Engineer --- IBM Netfinity Performance Development
Re: tiotest
[ Monday, January 10, 2000 ] Dietmar Goldbeck wrote: > On Mon, Nov 29, 1999 at 04:20:45PM -0500, James Manning wrote: > > tiotest is a nice start to what I would like to see: a replacement > > for bonnie... While stripping out the character-based stuff from > > bonnie would bring it closer to what I'd like to see, threading > > would be a bit of a pain so starting with tiotest as a base might > > not be a bad idea if people are willing to help out... > > Is tiotest Open Source? Yes, under the GNU GPL > Can you give me an URL please http://www.icon.fi/~mak/tiotest/ I'm going to re-do the filesize stuff soon, but until then just try and use the number of megabytes of RAM in your machine as the --size parameter tio tiobench.pl once you "make" to build the tiotest program. James -- Miscellaneous Engineer --- IBM Netfinity Performance Development
Re: 2.2.14 + raid-2.2.14-B1 on PPC failing on bootup
It is possible that the problem is a result of the raid code not beeing PPC friendly when concerning byte boundries. Open up linux/include/linux/raid/md_p.h At line 161 you should have something ressembling the following __u32 sb_csum; /* 6 checksum of the whole superblock */ __u64 events; /* 7 number of superblock updates (64-bit!) */ __u32 gstate_sreserved[MD_SB_GENERIC_STATE_WORDS - 9]; Try swapping the __u32 sb_csum and __u64 events around so that it looks like __u64 events; /* 7 number of superblock updates (64-bit!) */ __u32 sb_csum; /* 6 checksum of the whole superblock */ __u32 gstate_sreserved[MD_SB_GENERIC_STATE_WORDS - 9]; This should fix the byte boundry problem that seems to cause a few issues on PPC systems. This problem and solution was previously reported by Corey Minyard whom notated that the PPC is a bit more picky about byte boundries then the x86 architecture. "Kevin M. Myer" wrote: > Hi, > > I am running kernel 2.2.14 + Ingo's latest RAID patches on an Apple > Network Server. I have (had) a RAID 5 array with 5 4Gb Seagate drives in > it working nicely with 2.2.11 and I had to do something silly, like > upgrade the kernel so I can use the big LCD display on the front to > display cute messages. > > Now, I seem to have a major problem - I can make the array fine. I can > create a filesystem fine. I can start and stop the array fine. But I > can't reboot. Once I reboot, the kernel loads until it reaches the raid > detection. It detects the five drives and identifies them as a RAID5 > array and then, endlessly, the following streams across my screen: > > <[dev 00:00]><[dev 00:00]><[dev 00:00]><[dev 00:00]><[dev 00:00]><[dev > 00:00]><[dev 00:00]> > > ad infiniteum and forever. > > I have no choice but to reboot with an old kernel, run mkraid on the whole > array again, remake the file system and download the 5 Gigs of Linux and > BSD software that I had mirrored. > > Can anyone tell me where to start looking for clues as to whats going > on? I'm using persistent superblocks and as far as I can tell, everything > is getting updated when I shutdown the machine and reboot > it. Unfortunately, the kernel nevers gets to the point where it can dump > the stuff from dmesg into syslog so I have no record of what its actually > stumbling over. > > Any ideas of what to try? Need more information? > > Thanks, > > Kevin > > -- > ~Kevin M. Myer > . . Network/System Administrator > /V\ ELANCO School District >// \\ > /( )\ >^`~'^