Re: raid5 software vs hardware: parity calculations?

2007-01-13 Thread Dan Williams

On 1/12/07, James Ralston [EMAIL PROTECTED] wrote:

On 2007-01-12 at 09:39-08 dean gaudet [EMAIL PROTECTED] wrote:

 On Thu, 11 Jan 2007, James Ralston wrote:

  I'm having a discussion with a coworker concerning the cost of
  md's raid5 implementation versus hardware raid5 implementations.
 
  Specifically, he states:
 
   The performance [of raid5 in hardware] is so much better with
   the write-back caching on the card and the offload of the
   parity, it seems to me that the minor increase in work of having
   to upgrade the firmware if there's a buggy one is a highly
   acceptable trade-off to the increased performance.  The md
   driver still commits you to longer run queues since IO calls to
   disk, parity calculator and the subsequent kflushd operations
   are non-interruptible in the CPU.  A RAID card with write-back
   cache releases the IO operation virtually instantaneously.
 
  It would seem that his comments have merit, as there appears to be
  work underway to move stripe operations outside of the spinlock:
 
  http://lwn.net/Articles/184102/
 
  What I'm curious about is this: for real-world situations, how
  much does this matter?  In other words, how hard do you have to
  push md raid5 before doing dedicated hardware raid5 becomes a real
  win?

 hardware with battery backed write cache is going to beat the
 software at small write traffic latency essentially all the time but
 it's got nothing to do with the parity computation.

I'm not convinced that's true.

No, it's true.  md implements a write-through cache to ensure that
data reaches the disk.


What my coworker is arguing is that md
raid5 code spinlocks while it is performing this sequence of
operations:

1.  executing the write

not performed under the lock

2.  reading the blocks necessary for recalculating the parity

not performed under the lock

3.  recalculating the parity
4.  updating the parity block

My [admittedly cursory] read of the code, coupled with the link above,
leads me to believe that my coworker is correct, which is why I was
for trolling for [informed] opinions about how much of a performance
hit the spinlock causes.


The spinlock is not a source of performance loss, the reason for
moving parity calculations outside the lock is to maximize the benefit
of using asynchronous xor+copy engines.

The hardware vs software raid trade-offs are well documented here:
http://linux.yyz.us/why-software-raid.html

Regards,
Dan
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 software vs hardware: parity calculations?

2007-01-13 Thread Bill Davidsen

Dan Williams wrote:

On 1/12/07, James Ralston [EMAIL PROTECTED] wrote:

On 2007-01-12 at 09:39-08 dean gaudet [EMAIL PROTECTED] wrote:

 On Thu, 11 Jan 2007, James Ralston wrote:

  I'm having a discussion with a coworker concerning the cost of
  md's raid5 implementation versus hardware raid5 implementations.
 
  Specifically, he states:
 
   The performance [of raid5 in hardware] is so much better with
   the write-back caching on the card and the offload of the
   parity, it seems to me that the minor increase in work of having
   to upgrade the firmware if there's a buggy one is a highly
   acceptable trade-off to the increased performance.  The md
   driver still commits you to longer run queues since IO calls to
   disk, parity calculator and the subsequent kflushd operations
   are non-interruptible in the CPU.  A RAID card with write-back
   cache releases the IO operation virtually instantaneously.
 
  It would seem that his comments have merit, as there appears to be
  work underway to move stripe operations outside of the spinlock:
 
  http://lwn.net/Articles/184102/
 
  What I'm curious about is this: for real-world situations, how
  much does this matter?  In other words, how hard do you have to
  push md raid5 before doing dedicated hardware raid5 becomes a real
  win?

 hardware with battery backed write cache is going to beat the
 software at small write traffic latency essentially all the time but
 it's got nothing to do with the parity computation.

I'm not convinced that's true.

No, it's true.  md implements a write-through cache to ensure that
data reaches the disk.


What my coworker is arguing is that md
raid5 code spinlocks while it is performing this sequence of
operations:

1.  executing the write

not performed under the lock

2.  reading the blocks necessary for recalculating the parity

not performed under the lock

3.  recalculating the parity
4.  updating the parity block

My [admittedly cursory] read of the code, coupled with the link above,
leads me to believe that my coworker is correct, which is why I was
for trolling for [informed] opinions about how much of a performance
hit the spinlock causes.


The spinlock is not a source of performance loss, the reason for
moving parity calculations outside the lock is to maximize the benefit
of using asynchronous xor+copy engines.

The hardware vs software raid trade-offs are well documented here:
http://linux.yyz.us/why-software-raid.html 


There have been several recent threads on the list regarding software 
RAID-5 performance. The reference might be updated to reflect the poor 
write performance of RAID-5 until/unless significant tuning is done. 
Read that as tuning obscure parameters and throwing a lot of memory into 
stripe cache. The reasons for hardware RAID should include performance 
of RAID-5 writes is usually much better than software RAID-5 with 
default tuning.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FailSpare event?

2007-01-13 Thread Nix
On 12 Jan 2007, Ernst Herzberg told this:
 Then every about 60 sec 4 times

 event=SpareActive
 mddev=/dev/md3

I see exactly this on both my RAID-5 arrays, neither of which have any
spare device --- nor have any active devices transitioned to spare
(which is what that event is actually supposed to mean).

mdadm-2.6 bug, I fear. I haven't tracked it down yet but will look
shortly: I can't afford to not run mdadm --monitor... odd, that
code hasn't changed during 2.6 development.

-- 
`He accused the FSF of being something of a hypocrit, which
 shows that he neither understands hypocrisy nor can spell.'
   --- jimmybgood
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FailSpare event?

2007-01-13 Thread Mike
On Fri, 12 Jan 2007, Neil Brown might have said:

 On Thursday January 11, [EMAIL PROTECTED] wrote:
  Can someone tell me what this means please? I just received this in
  an email from one of my servers:
  
 
 
  
  A FailSpare event had been detected on md device /dev/md2.
  
  It could be related to component device /dev/sde2.
 
 It means that mdadm has just noticed that /dev/sde2 is a spare and is faulty.
 
 You would normally expect this if the array is rebuilding a spare and
 a write to the spare fails however...
 
  
  md2 : active raid5 sdf2[4] sde2[5](F) sdd2[3] sdc2[2] sdb2[1] sda2[0]
  560732160 blocks level 5, 256k chunk, algorithm 2 [5/5] [U]
 
 That isn't the case here - your array doesn't need rebuilding.
 Possible a superblock-update failed.  Possibly mdadm only just started
 monitoring the array and the spare has been faulty for some time.
 
  
  Does the email message mean drive sde2[5] has failed? I know the sde2 refers
  to the second partition of /dev/sde. Here is the partition table
 
 It means that md thinks sde2 cannot be trusted.  To find out why you
 would need to look at kernel logs for IO errors.
 
  
  I have partition 2 of drive sde as one of the raid devices for md. Does the 
  (S)
  on sde3[2](S) mean the device is a spare for md1 and the same for md0?
  
 
 Yes, (S) means the device is spare.  You don't have (S) next to sde2
 on md2 because (F) (failed) overrides (S).
 You can tell by the position [5], that it isn't part of the array
 (being a 5 disk array, the active positions are 0,1,2,3,4).
 
 NeilBrown
 

I have cleared the error by:

# mdadm --manage /dev/md2 -f /dev/sde2
( make sure it has failed )
# mdadm --manage /dev/md2 -r /dev/sde2
( remove from the array )
# mdadm --manage /dev/md2 -a /dev/sde2
( add the device back to the array )
# mdadm --detail /dev/md2
( verify there are no faults and the array knows about the spare )
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 software vs hardware: parity calculations?

2007-01-13 Thread Robin Bowes
Bill Davidsen wrote:

 There have been several recent threads on the list regarding software
 RAID-5 performance. The reference might be updated to reflect the poor
 write performance of RAID-5 until/unless significant tuning is done.
 Read that as tuning obscure parameters and throwing a lot of memory into
 stripe cache. The reasons for hardware RAID should include performance
 of RAID-5 writes is usually much better than software RAID-5 with
 default tuning.

Could you point me at a source of documentation describing how to
perform such tuning?

Specifically, I have 8x500GB WD STAT drives on a Supermicro PCI-X 8-port
SATA card configured as a single RAID6 array (~3TB available space)

Thanks,

R.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FailSpare event?

2007-01-13 Thread Nix
On 13 Jan 2007, [EMAIL PROTECTED] spake thusly:

 On 12 Jan 2007, Ernst Herzberg told this:
 Then every about 60 sec 4 times

 event=SpareActive
 mddev=/dev/md3

 I see exactly this on both my RAID-5 arrays, neither of which have any
 spare device --- nor have any active devices transitioned to spare
 (which is what that event is actually supposed to mean).

Hm, the manual says that it means that a spare has transitioned to
active (which seems more likely). Perhaps the comment at line 82 of
Monitor.c is wrong, or I just don't understand what a `reverse
transition' is supposed to be.

-- 
`He accused the FSF of being something of a hypocrit, which
 shows that he neither understands hypocrisy nor can spell.'
   --- jimmybgood
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FailSpare event?

2007-01-13 Thread Nix
On 13 Jan 2007, [EMAIL PROTECTED] uttered the following:

 On 12 Jan 2007, Ernst Herzberg told this:
 Then every about 60 sec 4 times

 event=SpareActive
 mddev=/dev/md3

 I see exactly this on both my RAID-5 arrays, neither of which have any
 spare device --- nor have any active devices transitioned to spare
 (which is what that event is actually supposed to mean).

One oddity has already come to light. My /proc/mdstat says

md2 : active raid5 sdb7[0] hda5[3] sda7[1]
  19631104 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md1 : active raid5 sda6[0] hdc5[3] sdb6[1]
  76807296 blocks super 1.2 level 5, 64k chunk, algorithm 2 [3/3] [UUU]

hda5 and hdc5 look odd. Indeed, --examine says

Number   Major   Minor   RaidDevice State
   0   860  active sync   /dev/sda6
   1   8   221  active sync   /dev/sdb6
   3  2252  active sync   /dev/hdc5

Number   Major   Minor   RaidDevice State
   0   8   230  active sync   /dev/sdb7
   1   871  active sync   /dev/sda7
   3   352  active sync   /dev/hda5

0, 1, and *3*. Where has number 2 gone? (And how does `Number' differ
from `RaidDevice'? Why have both?)

-- 
`He accused the FSF of being something of a hypocrit, which
 shows that he neither understands hypocrisy nor can spell.'
   --- jimmybgood
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5 software vs hardware: parity calculations?

2007-01-13 Thread dean gaudet
On Sat, 13 Jan 2007, Robin Bowes wrote:

 Bill Davidsen wrote:
 
  There have been several recent threads on the list regarding software
  RAID-5 performance. The reference might be updated to reflect the poor
  write performance of RAID-5 until/unless significant tuning is done.
  Read that as tuning obscure parameters and throwing a lot of memory into
  stripe cache. The reasons for hardware RAID should include performance
  of RAID-5 writes is usually much better than software RAID-5 with
  default tuning.
 
 Could you point me at a source of documentation describing how to
 perform such tuning?
 
 Specifically, I have 8x500GB WD STAT drives on a Supermicro PCI-X 8-port
 SATA card configured as a single RAID6 array (~3TB available space)

linux sw raid6 small write performance is bad because it reads the entire 
stripe, merges the small write, and writes back the changed disks.  
unlike raid5 where a small write can get away with a partial stripe read 
(i.e. the smallest raid5 write will read the target disk, read the parity, 
write the target, and write the updated parity)... afaik this optimization 
hasn't been implemented in raid6 yet.

depending on your use model you might want to go with raid5+spare.  
benchmark if you're not sure.

for raid5/6 i always recommend experimenting with moving your fs journal 
to a raid1 device instead (on separate spindles -- such as your root 
disks).

if this is for a database or fs requiring lots of small writes then 
raid5/6 are generally a mistake... raid10 is the only way to get 
performance.  (hw raid5/6 with nvram support can help a bit in this area, 
but you just can't beat raid10 if you need lots of writes/s.)

beyond those config choices you'll want to become friendly with /sys/block 
and all the myriad of subdirectories and options under there.

in particular:

/sys/block/*/queue/scheduler
/sys/block/*/queue/read_ahead_kb
/sys/block/*/queue/nr_requests
/sys/block/mdX/md/stripe_cache_size

for * = any of the component disks or the mdX itself...

some systems have an /etc/sysfs.conf you can place these settings in to 
have them take effect on reboot.  (sysfsutils package on debuntu)

-dean
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Linux Software RAID 5 Performance Optimizations: 2.6.19.1: (211MB/s read 195MB/s write)

2007-01-13 Thread Justin Piszcz


On Sat, 13 Jan 2007, Al Boldi wrote:

 Justin Piszcz wrote:
  On Sat, 13 Jan 2007, Al Boldi wrote:
   Justin Piszcz wrote:
Btw, max sectors did improve my performance a little bit but
stripe_cache+read_ahead were the main optimizations that made
everything go faster by about ~1.5x.   I have individual bonnie++
benchmarks of [only] the max_sector_kb tests as well, it improved the
times from 8min/bonnie run - 7min 11 seconds or so, see below and
then after that is what you requested.
  
   Can you repeat with /dev/sda only?
 
  For sda-- (is a 74GB raptor only)-- but ok.
 
 Do you get the same results for the 150GB-raptor on sd{e,g,i,k}?
 
  # uptime
   16:25:38 up 1 min,  3 users,  load average: 0.23, 0.14, 0.05
  # cat /sys/block/sda/queue/max_sectors_kb
  512
  # echo 3  /proc/sys/vm/drop_caches
  # dd if=/dev/sda of=/dev/null bs=1M count=10240
  10240+0 records in
  10240+0 records out
  10737418240 bytes (11 GB) copied, 150.891 seconds, 71.2 MB/s
  # echo 192  /sys/block/sda/queue/max_sectors_kb
  # echo 3  /proc/sys/vm/drop_caches
  # dd if=/dev/sda of=/dev/null bs=1M count=10240
  10240+0 records in
  10240+0 records out
  10737418240 bytes (11 GB) copied, 150.192 seconds, 71.5 MB/s
  # echo 128  /sys/block/sda/queue/max_sectors_kb
  # echo 3  /proc/sys/vm/drop_caches
  # dd if=/dev/sda of=/dev/null bs=1M count=10240
  10240+0 records in
  10240+0 records out
  10737418240 bytes (11 GB) copied, 150.15 seconds, 71.5 MB/s
 
 
  Does this show anything useful?
 
 Probably a latency issue.  md is highly latency sensitive.
 
 What CPU type/speed do you have?  Bootlog/dmesg?
 
 
 Thanks!
 
 --
 Al
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-raid in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

 What CPU type/speed do you have?  Bootlog/dmesg?
Core Duo E6300

The speed is great since I have tweaked the various settings..
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html