Re: raid5 software vs hardware: parity calculations?
On 1/12/07, James Ralston [EMAIL PROTECTED] wrote: On 2007-01-12 at 09:39-08 dean gaudet [EMAIL PROTECTED] wrote: On Thu, 11 Jan 2007, James Ralston wrote: I'm having a discussion with a coworker concerning the cost of md's raid5 implementation versus hardware raid5 implementations. Specifically, he states: The performance [of raid5 in hardware] is so much better with the write-back caching on the card and the offload of the parity, it seems to me that the minor increase in work of having to upgrade the firmware if there's a buggy one is a highly acceptable trade-off to the increased performance. The md driver still commits you to longer run queues since IO calls to disk, parity calculator and the subsequent kflushd operations are non-interruptible in the CPU. A RAID card with write-back cache releases the IO operation virtually instantaneously. It would seem that his comments have merit, as there appears to be work underway to move stripe operations outside of the spinlock: http://lwn.net/Articles/184102/ What I'm curious about is this: for real-world situations, how much does this matter? In other words, how hard do you have to push md raid5 before doing dedicated hardware raid5 becomes a real win? hardware with battery backed write cache is going to beat the software at small write traffic latency essentially all the time but it's got nothing to do with the parity computation. I'm not convinced that's true. No, it's true. md implements a write-through cache to ensure that data reaches the disk. What my coworker is arguing is that md raid5 code spinlocks while it is performing this sequence of operations: 1. executing the write not performed under the lock 2. reading the blocks necessary for recalculating the parity not performed under the lock 3. recalculating the parity 4. updating the parity block My [admittedly cursory] read of the code, coupled with the link above, leads me to believe that my coworker is correct, which is why I was for trolling for [informed] opinions about how much of a performance hit the spinlock causes. The spinlock is not a source of performance loss, the reason for moving parity calculations outside the lock is to maximize the benefit of using asynchronous xor+copy engines. The hardware vs software raid trade-offs are well documented here: http://linux.yyz.us/why-software-raid.html Regards, Dan - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 software vs hardware: parity calculations?
Dan Williams wrote: On 1/12/07, James Ralston [EMAIL PROTECTED] wrote: On 2007-01-12 at 09:39-08 dean gaudet [EMAIL PROTECTED] wrote: On Thu, 11 Jan 2007, James Ralston wrote: I'm having a discussion with a coworker concerning the cost of md's raid5 implementation versus hardware raid5 implementations. Specifically, he states: The performance [of raid5 in hardware] is so much better with the write-back caching on the card and the offload of the parity, it seems to me that the minor increase in work of having to upgrade the firmware if there's a buggy one is a highly acceptable trade-off to the increased performance. The md driver still commits you to longer run queues since IO calls to disk, parity calculator and the subsequent kflushd operations are non-interruptible in the CPU. A RAID card with write-back cache releases the IO operation virtually instantaneously. It would seem that his comments have merit, as there appears to be work underway to move stripe operations outside of the spinlock: http://lwn.net/Articles/184102/ What I'm curious about is this: for real-world situations, how much does this matter? In other words, how hard do you have to push md raid5 before doing dedicated hardware raid5 becomes a real win? hardware with battery backed write cache is going to beat the software at small write traffic latency essentially all the time but it's got nothing to do with the parity computation. I'm not convinced that's true. No, it's true. md implements a write-through cache to ensure that data reaches the disk. What my coworker is arguing is that md raid5 code spinlocks while it is performing this sequence of operations: 1. executing the write not performed under the lock 2. reading the blocks necessary for recalculating the parity not performed under the lock 3. recalculating the parity 4. updating the parity block My [admittedly cursory] read of the code, coupled with the link above, leads me to believe that my coworker is correct, which is why I was for trolling for [informed] opinions about how much of a performance hit the spinlock causes. The spinlock is not a source of performance loss, the reason for moving parity calculations outside the lock is to maximize the benefit of using asynchronous xor+copy engines. The hardware vs software raid trade-offs are well documented here: http://linux.yyz.us/why-software-raid.html There have been several recent threads on the list regarding software RAID-5 performance. The reference might be updated to reflect the poor write performance of RAID-5 until/unless significant tuning is done. Read that as tuning obscure parameters and throwing a lot of memory into stripe cache. The reasons for hardware RAID should include performance of RAID-5 writes is usually much better than software RAID-5 with default tuning. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On 12 Jan 2007, Ernst Herzberg told this: Then every about 60 sec 4 times event=SpareActive mddev=/dev/md3 I see exactly this on both my RAID-5 arrays, neither of which have any spare device --- nor have any active devices transitioned to spare (which is what that event is actually supposed to mean). mdadm-2.6 bug, I fear. I haven't tracked it down yet but will look shortly: I can't afford to not run mdadm --monitor... odd, that code hasn't changed during 2.6 development. -- `He accused the FSF of being something of a hypocrit, which shows that he neither understands hypocrisy nor can spell.' --- jimmybgood - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On Fri, 12 Jan 2007, Neil Brown might have said: On Thursday January 11, [EMAIL PROTECTED] wrote: Can someone tell me what this means please? I just received this in an email from one of my servers: A FailSpare event had been detected on md device /dev/md2. It could be related to component device /dev/sde2. It means that mdadm has just noticed that /dev/sde2 is a spare and is faulty. You would normally expect this if the array is rebuilding a spare and a write to the spare fails however... md2 : active raid5 sdf2[4] sde2[5](F) sdd2[3] sdc2[2] sdb2[1] sda2[0] 560732160 blocks level 5, 256k chunk, algorithm 2 [5/5] [U] That isn't the case here - your array doesn't need rebuilding. Possible a superblock-update failed. Possibly mdadm only just started monitoring the array and the spare has been faulty for some time. Does the email message mean drive sde2[5] has failed? I know the sde2 refers to the second partition of /dev/sde. Here is the partition table It means that md thinks sde2 cannot be trusted. To find out why you would need to look at kernel logs for IO errors. I have partition 2 of drive sde as one of the raid devices for md. Does the (S) on sde3[2](S) mean the device is a spare for md1 and the same for md0? Yes, (S) means the device is spare. You don't have (S) next to sde2 on md2 because (F) (failed) overrides (S). You can tell by the position [5], that it isn't part of the array (being a 5 disk array, the active positions are 0,1,2,3,4). NeilBrown I have cleared the error by: # mdadm --manage /dev/md2 -f /dev/sde2 ( make sure it has failed ) # mdadm --manage /dev/md2 -r /dev/sde2 ( remove from the array ) # mdadm --manage /dev/md2 -a /dev/sde2 ( add the device back to the array ) # mdadm --detail /dev/md2 ( verify there are no faults and the array knows about the spare ) - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 software vs hardware: parity calculations?
Bill Davidsen wrote: There have been several recent threads on the list regarding software RAID-5 performance. The reference might be updated to reflect the poor write performance of RAID-5 until/unless significant tuning is done. Read that as tuning obscure parameters and throwing a lot of memory into stripe cache. The reasons for hardware RAID should include performance of RAID-5 writes is usually much better than software RAID-5 with default tuning. Could you point me at a source of documentation describing how to perform such tuning? Specifically, I have 8x500GB WD STAT drives on a Supermicro PCI-X 8-port SATA card configured as a single RAID6 array (~3TB available space) Thanks, R. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On 13 Jan 2007, [EMAIL PROTECTED] spake thusly: On 12 Jan 2007, Ernst Herzberg told this: Then every about 60 sec 4 times event=SpareActive mddev=/dev/md3 I see exactly this on both my RAID-5 arrays, neither of which have any spare device --- nor have any active devices transitioned to spare (which is what that event is actually supposed to mean). Hm, the manual says that it means that a spare has transitioned to active (which seems more likely). Perhaps the comment at line 82 of Monitor.c is wrong, or I just don't understand what a `reverse transition' is supposed to be. -- `He accused the FSF of being something of a hypocrit, which shows that he neither understands hypocrisy nor can spell.' --- jimmybgood - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FailSpare event?
On 13 Jan 2007, [EMAIL PROTECTED] uttered the following: On 12 Jan 2007, Ernst Herzberg told this: Then every about 60 sec 4 times event=SpareActive mddev=/dev/md3 I see exactly this on both my RAID-5 arrays, neither of which have any spare device --- nor have any active devices transitioned to spare (which is what that event is actually supposed to mean). One oddity has already come to light. My /proc/mdstat says md2 : active raid5 sdb7[0] hda5[3] sda7[1] 19631104 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU] md1 : active raid5 sda6[0] hdc5[3] sdb6[1] 76807296 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU] hda5 and hdc5 look odd. Indeed, --examine says Number Major Minor RaidDevice State 0 860 active sync /dev/sda6 1 8 221 active sync /dev/sdb6 3 2252 active sync /dev/hdc5 Number Major Minor RaidDevice State 0 8 230 active sync /dev/sdb7 1 871 active sync /dev/sda7 3 352 active sync /dev/hda5 0, 1, and *3*. Where has number 2 gone? (And how does `Number' differ from `RaidDevice'? Why have both?) -- `He accused the FSF of being something of a hypocrit, which shows that he neither understands hypocrisy nor can spell.' --- jimmybgood - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid5 software vs hardware: parity calculations?
On Sat, 13 Jan 2007, Robin Bowes wrote: Bill Davidsen wrote: There have been several recent threads on the list regarding software RAID-5 performance. The reference might be updated to reflect the poor write performance of RAID-5 until/unless significant tuning is done. Read that as tuning obscure parameters and throwing a lot of memory into stripe cache. The reasons for hardware RAID should include performance of RAID-5 writes is usually much better than software RAID-5 with default tuning. Could you point me at a source of documentation describing how to perform such tuning? Specifically, I have 8x500GB WD STAT drives on a Supermicro PCI-X 8-port SATA card configured as a single RAID6 array (~3TB available space) linux sw raid6 small write performance is bad because it reads the entire stripe, merges the small write, and writes back the changed disks. unlike raid5 where a small write can get away with a partial stripe read (i.e. the smallest raid5 write will read the target disk, read the parity, write the target, and write the updated parity)... afaik this optimization hasn't been implemented in raid6 yet. depending on your use model you might want to go with raid5+spare. benchmark if you're not sure. for raid5/6 i always recommend experimenting with moving your fs journal to a raid1 device instead (on separate spindles -- such as your root disks). if this is for a database or fs requiring lots of small writes then raid5/6 are generally a mistake... raid10 is the only way to get performance. (hw raid5/6 with nvram support can help a bit in this area, but you just can't beat raid10 if you need lots of writes/s.) beyond those config choices you'll want to become friendly with /sys/block and all the myriad of subdirectories and options under there. in particular: /sys/block/*/queue/scheduler /sys/block/*/queue/read_ahead_kb /sys/block/*/queue/nr_requests /sys/block/mdX/md/stripe_cache_size for * = any of the component disks or the mdX itself... some systems have an /etc/sysfs.conf you can place these settings in to have them take effect on reboot. (sysfsutils package on debuntu) -dean - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux Software RAID 5 Performance Optimizations: 2.6.19.1: (211MB/s read 195MB/s write)
On Sat, 13 Jan 2007, Al Boldi wrote: Justin Piszcz wrote: On Sat, 13 Jan 2007, Al Boldi wrote: Justin Piszcz wrote: Btw, max sectors did improve my performance a little bit but stripe_cache+read_ahead were the main optimizations that made everything go faster by about ~1.5x. I have individual bonnie++ benchmarks of [only] the max_sector_kb tests as well, it improved the times from 8min/bonnie run - 7min 11 seconds or so, see below and then after that is what you requested. Can you repeat with /dev/sda only? For sda-- (is a 74GB raptor only)-- but ok. Do you get the same results for the 150GB-raptor on sd{e,g,i,k}? # uptime 16:25:38 up 1 min, 3 users, load average: 0.23, 0.14, 0.05 # cat /sys/block/sda/queue/max_sectors_kb 512 # echo 3 /proc/sys/vm/drop_caches # dd if=/dev/sda of=/dev/null bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 150.891 seconds, 71.2 MB/s # echo 192 /sys/block/sda/queue/max_sectors_kb # echo 3 /proc/sys/vm/drop_caches # dd if=/dev/sda of=/dev/null bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 150.192 seconds, 71.5 MB/s # echo 128 /sys/block/sda/queue/max_sectors_kb # echo 3 /proc/sys/vm/drop_caches # dd if=/dev/sda of=/dev/null bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 150.15 seconds, 71.5 MB/s Does this show anything useful? Probably a latency issue. md is highly latency sensitive. What CPU type/speed do you have? Bootlog/dmesg? Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html What CPU type/speed do you have? Bootlog/dmesg? Core Duo E6300 The speed is great since I have tweaked the various settings.. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html