Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction

2006-10-10 Thread Neil Brown
[dropped akpm from the Cc: as current discussion isn't directly
relevant to him]
On Tuesday October 10, [EMAIL PROTECTED] wrote:
> On 10/8/06, Neil Brown <[EMAIL PROTECTED]> wrote:
> 
> > Is there something really important I have missed?
> No, nothing important jumps out.  Just a follow up question/note about
> the details.
> 
> You imply that the async path and the sync path are unified in this
> implementation.  I think it is doable but it will add some complexity
> since the sync case is not a distinct subset of the async case.  For
> example "Clear a target cache block" is required for the sync case,
> but it can go away when using hardware engines.  Engines typically
> have their own accumulator buffer to store the temporary result,
> whereas software only operates on memory.
> 
> What do you think of adding async tests for these situations?
> test_bit(XOR, &conf->async)
> 
> Where a flag is set if calls to async_ may be routed to
> hardware engine?  Otherwise skip any async specific details.

I'd rather try to come up with an interface that was equally
appropriate to both offload and inline.  I appreciate that it might
not be possible to get an interface that gets best performance out of
both, but I'd like to explore that direction first.

I'd guess from what you say that the dma engine is given a bunch of
sources and a destination and it xor's all the sources together into
an accumulation buffer, and then writes the accum buffer to the
destination.  Would that be right?  Can you use the destination as one
of the sources?

That can obviously be done inline too with some changes to the xor
code, and avoiding the initial memset might be good for performance
too. 

So I would suggest we drop the memset idea, and define the async_xor
interface to xor a number of sources into a destination, where the
destination is allowed to be the same as the first source, but
doesn't need to be.
Then the inline version could use a memset followed by the current xor
operations, or could use newly written xor operations, and the offload
version could equally do whatever is appropriate.

Another place where combining operations might make sense is copy-in
and post-xor.  In some cases it might be more efficient to only read
the source once, and both write it to the destination and xor it into
the target.  Would your DMA engine be able to optimise this
combination?  I think current processors could certainly do better if
the two were combined.

So there is definitely room to move, but would rather avoid flags if I
could.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction

2006-10-10 Thread Dan Williams

On 9/14/06, Jakob Oestergaard <[EMAIL PROTECTED]> wrote:

On Wed, Sep 13, 2006 at 12:17:55PM -0700, Dan Williams wrote:
...
> >Out of curiosity; how does accelerated compare to non-accelerated?
>
> One quick example:
> 4-disk SATA array rebuild on iop321 without acceleration - 'top'
> reports md0_resync and md0_raid5 dueling for the CPU each at ~50%
> utilization.
>
> With acceleration - 'top' reports md0_resync cpu utilization at ~90%
> with the rest split between md0_raid5 and md0_raid5_ops.
>
> The sync speed reported by /proc/mdstat is ~40% higher in the accelerated
> case.

Ok, nice :)

>
> That being said, array resync is a special case, so your mileage may
> vary with other applications.

Every-day usage I/O performance data would be nice indeed :)

> I will put together some data from bonnie++, iozone, maybe contest,
> and post it on SourceForge.

Great!


I have posted some Iozone data and graphs showing the performance
impact of the patches across the three iop processors iop321, iop331,
and iop341.  The general take away from the data is that using dma
engines extends the region that Iozone calls the "buffer cache
effect".  Write performance benefited the most as expected, but read
performance showed some modest gains as well.  There are some regions
(smaller file size and record length) that show a performance
disadvantage but it is typically less than 5%.

The graphs map the relative performance multiplier that the raid
patches generate ('2.6.18-rc6 performance' x 'performance multiplier'
= '2.6.18-rc6-raid performance') .  A value of '1' designates equal
performance.  The large cliff that drops to zero is a "not measured"
region, i.e. the record length is larger than the file size.  Iozone
outputs to Excel, but I have also made pdf's of the graphs available.
Note: Openoffice-calc can view the data but it does not support the 3D
surface graphs that Iozone uses.

Excel:
http://prdownloads.sourceforge.net/xscaleiop/iozone_raid_accel.xls?download

PDF Graphs:
http://prdownloads.sourceforge.net/xscaleiop/iop-iozone-graphs-20061010.tar.bz2?download

Regards,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction

2006-10-10 Thread Dan Williams

On 10/8/06, Neil Brown <[EMAIL PROTECTED]> wrote:



On Monday September 11, [EMAIL PROTECTED] wrote:
> Neil,
>
> The following patches implement hardware accelerated raid5 for the Intel
> Xscale(r) series of I/O Processors.  The MD changes allow stripe
> operations to run outside the spin lock in a work queue.  Hardware
> acceleration is achieved by using a dma-engine-aware work queue routine
> instead of the default software only routine.

Hi Dan,
 Sorry for the delay in replying.
 I've looked through these patches at last (mostly the raid-specific
 bits) and while there is clearly a lot of good stuff here, it does
 'feel' right - it just seems too complex.

 The particular issues that stand out to me are:
   - 33 new STRIPE_OP_* flags.  I'm sure there doesn't need to be that
  many new flags.
   - the "raid5 dma client" patch moves far too much internal
 knowledge about raid5 into drivers/dma.

 Clearly there are some complex issues being dealt with and some
 complexity is to be expected, but I feel there must be room for some
 serious simplification.

A valid criticism.  There was definitely a push to just get it
functional, so I can now see how the complexity crept into the
implementation.  The primary cause was the choice to explicitly handle
channel switching in raid5-dma.  However, relieving "client" code from
this responsibility is something I am taking care of in the async api
changes.



 Let me try to describe how I envisage it might work.

 As you know, the theory-of-operation of handle_stripe is that it
 assesses the state of a stripe deciding what actions to perform and
 then performs them.  Synchronous actions (e.g. current parity calcs)
 are performed 'in-line'.  Async actions (reads, writes) and actions
 that cannot be performed under a spinlock (->b_end_io) are recorded
 as being needed and then are initiated at the end of handle_stripe
 outside of the sh->lock.

 The proposal is to bring the parity and other bulk-memory operations
 out of the spinlock and make them optionally asynchronous.

 The set of tasks that might be needed to be performed on a stripe
 are:
Clear a target cache block
pre-xor various cache blocks into a target
copy data out of bios into cache blocks. (drain)
post-xor various cache blocks into a target
copy data into bios out of cache blocks (fill)
test if a cache block is all zeros
start a read on a cache block
start a write on a cache block

 (There is also a memcpy when expanding raid5.  I think I would try to
  simply avoid that copy and move pointers around instead).

 Some of these steps require sequencing. e.g.
   clear, pre-xor, copy, post-xor, write
 for a rwm cycle.
 We could require handle_stripe to be called again for each step.
 i.e. first call just clears the target and flags it as clear.  Next
 call initiates the pre-xor and flags that as done.  Etc.  However I
 think that would make the non-offloaded case too slow, or at least
 too clumsy.

 So instead we set flags to say what needs to be done and have a
 workqueue system that does it.

 (so far this is all quite similar to what you have done.)

 So handle_stripe would set various flag and other things (like
 identify which block was the 'target' block) and run the following
 in a workqueue:

raid5_do_stuff(struct stripe_head *sh)
{
raid5_cont_t *conf = sh->raid_conf;

if (test_bit(CLEAR_TARGET, &sh->ops.pending)) {
struct page = *p->sh->dev[sh->ops.target].page;
rv = async_memset(p, 0, 0, PAGE_SIZE, ops_done, sh);
if (rv != BUSY)
clear_bit(CLEAR_TARGET, &sh->ops.pending);
if (rv != COMPLETE)
goto out;
}

while (test_bit(PRE_XOR, &sh->ops.pending)) {
struct page *plist[XOR_MAX];
int offset[XOR_MAX];
int pos = 0;
int d;

for (d = sh->ops.nextdev;
 d < conf->raid_disks && pos < XOR_MAX ;
 d++) {
if (sh->ops.nextdev == sh->ops.target)
continue;
if (!test_bit(R5_WantPreXor, &sh->dev[d].flags))
continue;
plist[pos] = sh->dev[d].page;
offset[pos++] = 0;
}
if (pos) {
struct page *p = sh->dev[sh->ops.target].page;
rv = async_xor(p, 0, plist, offset, pos, PAGE_SIZE,
   ops_done, sh);
if (rv != BUSY)
sh->ops.nextdev = d;
if (rv != COMPLETE)
goto out;
} else {
clear_bit(PRE_XOR, &sh->ops.pending);
sh->ops.nextdev = 0;
}
  

Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction

2006-10-08 Thread Neil Brown


On Monday September 11, [EMAIL PROTECTED] wrote:
> Neil,
> 
> The following patches implement hardware accelerated raid5 for the Intel
> Xscale® series of I/O Processors.  The MD changes allow stripe
> operations to run outside the spin lock in a work queue.  Hardware
> acceleration is achieved by using a dma-engine-aware work queue routine
> instead of the default software only routine.

Hi Dan,
 Sorry for the delay in replying.
 I've looked through these patches at last (mostly the raid-specific
 bits) and while there is clearly a lot of good stuff here, it does
 'feel' right - it just seems too complex.

 The particular issues that stand out to me are:
   - 33 new STRIPE_OP_* flags.  I'm sure there doesn't need to be that
  many new flags.
   - the "raid5 dma client" patch moves far too much internal
 knowledge about raid5 into drivers/dma.

 Clearly there are some complex issues being dealt with and some
 complexity is to be expected, but I feel there must be room for some
 serious simplification.

 Let me try to describe how I envisage it might work.

 As you know, the theory-of-operation of handle_stripe is that it
 assesses the state of a stripe deciding what actions to perform and
 then performs them.  Synchronous actions (e.g. current parity calcs)
 are performed 'in-line'.  Async actions (reads, writes) and actions
 that cannot be performed under a spinlock (->b_end_io) are recorded
 as being needed and then are initiated at the end of handle_stripe
 outside of the sh->lock.

 The proposal is to bring the parity and other bulk-memory operations
 out of the spinlock and make them optionally asynchronous.

 The set of tasks that might be needed to be performed on a stripe
 are:
Clear a target cache block
pre-xor various cache blocks into a target
copy data out of bios into cache blocks. (drain)
post-xor various cache blocks into a target
copy data into bios out of cache blocks (fill)
test if a cache block is all zeros
start a read on a cache block
start a write on a cache block

 (There is also a memcpy when expanding raid5.  I think I would try to
  simply avoid that copy and move pointers around instead).

 Some of these steps require sequencing. e.g.
   clear, pre-xor, copy, post-xor, write
 for a rwm cycle.
 We could require handle_stripe to be called again for each step.
 i.e. first call just clears the target and flags it as clear.  Next
 call initiates the pre-xor and flags that as done.  Etc.  However I
 think that would make the non-offloaded case too slow, or at least
 too clumsy.

 So instead we set flags to say what needs to be done and have a
 workqueue system that does it.

 (so far this is all quite similar to what you have done.)

 So handle_stripe would set various flag and other things (like
 identify which block was the 'target' block) and run the following
 in a workqueue:

raid5_do_stuff(struct stripe_head *sh)
{
raid5_cont_t *conf = sh->raid_conf;

if (test_bit(CLEAR_TARGET, &sh->ops.pending)) {
struct page = *p->sh->dev[sh->ops.target].page;
rv = async_memset(p, 0, 0, PAGE_SIZE, ops_done, sh);
if (rv != BUSY)
clear_bit(CLEAR_TARGET, &sh->ops.pending);
if (rv != COMPLETE)
goto out;
}

while (test_bit(PRE_XOR, &sh->ops.pending)) {
struct page *plist[XOR_MAX];
int offset[XOR_MAX];
int pos = 0;
int d;

for (d = sh->ops.nextdev;
 d < conf->raid_disks && pos < XOR_MAX ;
 d++) {
if (sh->ops.nextdev == sh->ops.target)
continue;
if (!test_bit(R5_WantPreXor, &sh->dev[d].flags))
continue;
plist[pos] = sh->dev[d].page;
offset[pos++] = 0;
}
if (pos) {
struct page *p = sh->dev[sh->ops.target].page;
rv = async_xor(p, 0, plist, offset, pos, PAGE_SIZE,
   ops_done, sh);
if (rv != BUSY)
sh->ops.nextdev = d;
if (rv != COMPLETE)
goto out;
} else {
clear_bit(PRE_XOR, &sh->ops.pending);
sh->ops.nextdev = 0;
}
}

while (test_bit(COPY_IN, &sh0>ops.pending)) {
...
}


if (test_bit(START_IO, &sh->ops.pending)) {
int d;
for (d = 0 ; d < conf->raid_disk ; d++) {
/* all that code from the end of handle_stripe */
}

release_stripe(conf, sh);
return;

 ou

Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction

2006-09-14 Thread Jakob Oestergaard
On Wed, Sep 13, 2006 at 12:17:55PM -0700, Dan Williams wrote:
...
> >Out of curiosity; how does accelerated compare to non-accelerated?
> 
> One quick example:
> 4-disk SATA array rebuild on iop321 without acceleration - 'top'
> reports md0_resync and md0_raid5 dueling for the CPU each at ~50%
> utilization.
> 
> With acceleration - 'top' reports md0_resync cpu utilization at ~90%
> with the rest split between md0_raid5 and md0_raid5_ops.
> 
> The sync speed reported by /proc/mdstat is ~40% higher in the accelerated 
> case.

Ok, nice :)

> 
> That being said, array resync is a special case, so your mileage may
> vary with other applications.

Every-day usage I/O performance data would be nice indeed :)

> I will put together some data from bonnie++, iozone, maybe contest,
> and post it on SourceForge.

Great!

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction

2006-09-13 Thread Dan Williams

On 9/13/06, Jakob Oestergaard <[EMAIL PROTECTED]> wrote:

On Mon, Sep 11, 2006 at 04:00:32PM -0700, Dan Williams wrote:
> Neil,
>
...
>
> Concerning the context switching performance concerns raised at the
> previous release, I have observed the following.  For the hardware
> accelerated case it appears that performance is always better with the
> work queue than without since it allows multiple stripes to be operated
> on simultaneously.  I expect the same for an SMP platform, but so far my
> testing has been limited to IOPs.  For a single-processor
> non-accelerated configuration I have not observed performance
> degradation with work queue support enabled, but in the Kconfig option
> help text I recommend disabling it (CONFIG_MD_RAID456_WORKQUEUE).

Out of curiosity; how does accelerated compare to non-accelerated?


One quick example:
4-disk SATA array rebuild on iop321 without acceleration - 'top'
reports md0_resync and md0_raid5 dueling for the CPU each at ~50%
utilization.

With acceleration - 'top' reports md0_resync cpu utilization at ~90%
with the rest split between md0_raid5 and md0_raid5_ops.

The sync speed reported by /proc/mdstat is ~40% higher in the accelerated case.

That being said, array resync is a special case, so your mileage may
vary with other applications.

I will put together some data from bonnie++, iozone, maybe contest,
and post it on SourceForge.


 / jakob


Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction

2006-09-13 Thread Jakob Oestergaard
On Mon, Sep 11, 2006 at 04:00:32PM -0700, Dan Williams wrote:
> Neil,
> 
...
> 
> Concerning the context switching performance concerns raised at the
> previous release, I have observed the following.  For the hardware
> accelerated case it appears that performance is always better with the
> work queue than without since it allows multiple stripes to be operated
> on simultaneously.  I expect the same for an SMP platform, but so far my
> testing has been limited to IOPs.  For a single-processor
> non-accelerated configuration I have not observed performance
> degradation with work queue support enabled, but in the Kconfig option
> help text I recommend disabling it (CONFIG_MD_RAID456_WORKQUEUE).

Out of curiosity; how does accelerated compare to non-accelerated?

-- 

 / jakob

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction

2006-09-12 Thread Jeff Garzik

Dan Williams wrote:

On 9/11/06, Jeff Garzik <[EMAIL PROTECTED]> wrote:

Dan Williams wrote:
> This is a frequently asked question, Alan Cox had the same one at OLS.
> The answer is "probably."  The only complication I currently see is
> where/how the stripe cache is maintained.  With the IOPs its easy
> because the DMA engines operate directly on kernel memory.  With the
> Promise card I believe they have memory on the card and it's not clear
> to me if the XOR engines on the card can deal with host memory.  Also,
> MD would need to be modified to handle a stripe cache located on a
> device, or somehow synchronize its local cache with card in a manner
> that is still able to beat software only MD.

sata_sx4 operates through [standard PC] memory on the card, and you use
a DMA engine to copy memory to/from the card.

[select chipsets supported by] sata_promise operates directly on host
memory.

So, while sata_sx4 is farther away from your direct-host-memory model,
it also has much more potential for RAID acceleration:  ideally, RAID1
just copies data to the card once, then copies the data to multiple
drives from there.  Similarly with RAID5, you can eliminate copies and
offload XOR, presuming the drives are all connected to the same card.

In the sata_promise case its straight forward, all that is needed is
dmaengine drivers for the xor and memcpy engines.  This would be
similar to the current I/OAT model where dma resources are provided by
a PCI function.  The sata_sx4 case would need a different flavor of
the dma_do_raid5_block_ops routine, one that understands where the
cache is located.  MD would also need the capability to bypass the
block layer since the data will have already been transferred to the
card by a stripe cache operation

The RAID1 case give me pause because it seems any work along these
lines requires that the implementation work for both MD and DM, which
then eventually leads to being tasked with merging the two.


RAID5 has similar properties.  If all devices in a RAID5 array are 
attached to a single SX4 card, then a high level write to the RAID5 
array is passed directly to the card, which then performs XOR, striping, 
etc.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction

2006-09-11 Thread Dan Williams

On 9/11/06, Jeff Garzik <[EMAIL PROTECTED]> wrote:

Dan Williams wrote:
> This is a frequently asked question, Alan Cox had the same one at OLS.
> The answer is "probably."  The only complication I currently see is
> where/how the stripe cache is maintained.  With the IOPs its easy
> because the DMA engines operate directly on kernel memory.  With the
> Promise card I believe they have memory on the card and it's not clear
> to me if the XOR engines on the card can deal with host memory.  Also,
> MD would need to be modified to handle a stripe cache located on a
> device, or somehow synchronize its local cache with card in a manner
> that is still able to beat software only MD.

sata_sx4 operates through [standard PC] memory on the card, and you use
a DMA engine to copy memory to/from the card.

[select chipsets supported by] sata_promise operates directly on host
memory.

So, while sata_sx4 is farther away from your direct-host-memory model,
it also has much more potential for RAID acceleration:  ideally, RAID1
just copies data to the card once, then copies the data to multiple
drives from there.  Similarly with RAID5, you can eliminate copies and
offload XOR, presuming the drives are all connected to the same card.

In the sata_promise case its straight forward, all that is needed is
dmaengine drivers for the xor and memcpy engines.  This would be
similar to the current I/OAT model where dma resources are provided by
a PCI function.  The sata_sx4 case would need a different flavor of
the dma_do_raid5_block_ops routine, one that understands where the
cache is located.  MD would also need the capability to bypass the
block layer since the data will have already been transferred to the
card by a stripe cache operation

The RAID1 case give me pause because it seems any work along these
lines requires that the implementation work for both MD and DM, which
then eventually leads to being tasked with merging the two.


Jeff


Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction

2006-09-11 Thread Jeff Garzik

Dan Williams wrote:

This is a frequently asked question, Alan Cox had the same one at OLS.
The answer is "probably."  The only complication I currently see is
where/how the stripe cache is maintained.  With the IOPs its easy
because the DMA engines operate directly on kernel memory.  With the
Promise card I believe they have memory on the card and it's not clear
to me if the XOR engines on the card can deal with host memory.  Also,
MD would need to be modified to handle a stripe cache located on a
device, or somehow synchronize its local cache with card in a manner
that is still able to beat software only MD.


sata_sx4 operates through [standard PC] memory on the card, and you use 
a DMA engine to copy memory to/from the card.


[select chipsets supported by] sata_promise operates directly on host 
memory.


So, while sata_sx4 is farther away from your direct-host-memory model, 
it also has much more potential for RAID acceleration:  ideally, RAID1 
just copies data to the card once, then copies the data to multiple 
drives from there.  Similarly with RAID5, you can eliminate copies and 
offload XOR, presuming the drives are all connected to the same card.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction

2006-09-11 Thread Dan Williams

On 9/11/06, Jeff Garzik <[EMAIL PROTECTED]> wrote:

Dan Williams wrote:
> Neil,
>
> The following patches implement hardware accelerated raid5 for the Intel
> Xscale(r) series of I/O Processors.  The MD changes allow stripe
> operations to run outside the spin lock in a work queue.  Hardware
> acceleration is achieved by using a dma-engine-aware work queue routine
> instead of the default software only routine.
>
> Since the last release of the raid5 changes many bug fixes and other
> improvements have been made as a result of stress testing.  See the per
> patch change logs for more information about what was fixed.  This
> release is the first release of the full dma implementation.
>
> The patches touch 3 areas, the md-raid5 driver, the generic dmaengine
> interface, and a platform device driver for IOPs.  The raid5 changes
> follow your comments concerning making the acceleration implementation
> similar to how the stripe cache handles I/O requests.  The dmaengine
> changes are the second release of this code.  They expand the interface
> to handle more than memcpy operations, and add a generic raid5-dma
> client.  The iop-adma driver supports dma memcpy, xor, xor zero sum, and
> memset across all IOP architectures (32x, 33x, and 13xx).
>
> Concerning the context switching performance concerns raised at the
> previous release, I have observed the following.  For the hardware
> accelerated case it appears that performance is always better with the
> work queue than without since it allows multiple stripes to be operated
> on simultaneously.  I expect the same for an SMP platform, but so far my
> testing has been limited to IOPs.  For a single-processor
> non-accelerated configuration I have not observed performance
> degradation with work queue support enabled, but in the Kconfig option
> help text I recommend disabling it (CONFIG_MD_RAID456_WORKQUEUE).
>
> Please consider the patches for -mm.
>
> -Dan
>
> [PATCH 01/19] raid5: raid5_do_soft_block_ops
> [PATCH 02/19] raid5: move write operations to a workqueue
> [PATCH 03/19] raid5: move check parity operations to a workqueue
> [PATCH 04/19] raid5: move compute block operations to a workqueue
> [PATCH 05/19] raid5: move read completion copies to a workqueue
> [PATCH 06/19] raid5: move the reconstruct write expansion operation to a 
workqueue
> [PATCH 07/19] raid5: remove compute_block and compute_parity5
> [PATCH 08/19] dmaengine: enable multiple clients and operations
> [PATCH 09/19] dmaengine: reduce backend address permutations
> [PATCH 10/19] dmaengine: expose per channel dma mapping characteristics to 
clients
> [PATCH 11/19] dmaengine: add memset as an asynchronous dma operation
> [PATCH 12/19] dmaengine: dma_async_memcpy_err for DMA engines that do not 
support memcpy
> [PATCH 13/19] dmaengine: add support for dma xor zero sum operations
> [PATCH 14/19] dmaengine: add dma_sync_wait
> [PATCH 15/19] dmaengine: raid5 dma client
> [PATCH 16/19] dmaengine: Driver for the Intel IOP 32x, 33x, and 13xx RAID 
engines
> [PATCH 17/19] iop3xx: define IOP3XX_REG_ADDR[32|16|8] and clean up DMA/AAU 
defs
> [PATCH 18/19] iop3xx: Give Linux control over PCI (ATU) initialization
> [PATCH 19/19] iop3xx: IOP 32x and 33x support for the iop-adma driver

Can devices like drivers/scsi/sata_sx4.c or drivers/scsi/sata_promise.c
take advantage of this?  Promise silicon supports RAID5 XOR offload.

If so, how?  If not, why not?  :)

This is a frequently asked question, Alan Cox had the same one at OLS.
The answer is "probably."  The only complication I currently see is
where/how the stripe cache is maintained.  With the IOPs its easy
because the DMA engines operate directly on kernel memory.  With the
Promise card I believe they have memory on the card and it's not clear
to me if the XOR engines on the card can deal with host memory.  Also,
MD would need to be modified to handle a stripe cache located on a
device, or somehow synchronize its local cache with card in a manner
that is still able to beat software only MD.


Jeff


Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/19] Hardware Accelerated MD RAID5: Introduction

2006-09-11 Thread Jeff Garzik

Dan Williams wrote:

Neil,

The following patches implement hardware accelerated raid5 for the Intel
Xscale® series of I/O Processors.  The MD changes allow stripe
operations to run outside the spin lock in a work queue.  Hardware
acceleration is achieved by using a dma-engine-aware work queue routine
instead of the default software only routine.

Since the last release of the raid5 changes many bug fixes and other
improvements have been made as a result of stress testing.  See the per
patch change logs for more information about what was fixed.  This
release is the first release of the full dma implementation.

The patches touch 3 areas, the md-raid5 driver, the generic dmaengine
interface, and a platform device driver for IOPs.  The raid5 changes
follow your comments concerning making the acceleration implementation
similar to how the stripe cache handles I/O requests.  The dmaengine
changes are the second release of this code.  They expand the interface
to handle more than memcpy operations, and add a generic raid5-dma
client.  The iop-adma driver supports dma memcpy, xor, xor zero sum, and
memset across all IOP architectures (32x, 33x, and 13xx).

Concerning the context switching performance concerns raised at the
previous release, I have observed the following.  For the hardware
accelerated case it appears that performance is always better with the
work queue than without since it allows multiple stripes to be operated
on simultaneously.  I expect the same for an SMP platform, but so far my
testing has been limited to IOPs.  For a single-processor
non-accelerated configuration I have not observed performance
degradation with work queue support enabled, but in the Kconfig option
help text I recommend disabling it (CONFIG_MD_RAID456_WORKQUEUE).

Please consider the patches for -mm.

-Dan

[PATCH 01/19] raid5: raid5_do_soft_block_ops
[PATCH 02/19] raid5: move write operations to a workqueue
[PATCH 03/19] raid5: move check parity operations to a workqueue
[PATCH 04/19] raid5: move compute block operations to a workqueue
[PATCH 05/19] raid5: move read completion copies to a workqueue
[PATCH 06/19] raid5: move the reconstruct write expansion operation to a 
workqueue
[PATCH 07/19] raid5: remove compute_block and compute_parity5
[PATCH 08/19] dmaengine: enable multiple clients and operations
[PATCH 09/19] dmaengine: reduce backend address permutations
[PATCH 10/19] dmaengine: expose per channel dma mapping characteristics to 
clients
[PATCH 11/19] dmaengine: add memset as an asynchronous dma operation
[PATCH 12/19] dmaengine: dma_async_memcpy_err for DMA engines that do not 
support memcpy
[PATCH 13/19] dmaengine: add support for dma xor zero sum operations
[PATCH 14/19] dmaengine: add dma_sync_wait
[PATCH 15/19] dmaengine: raid5 dma client
[PATCH 16/19] dmaengine: Driver for the Intel IOP 32x, 33x, and 13xx RAID 
engines
[PATCH 17/19] iop3xx: define IOP3XX_REG_ADDR[32|16|8] and clean up DMA/AAU defs
[PATCH 18/19] iop3xx: Give Linux control over PCI (ATU) initialization
[PATCH 19/19] iop3xx: IOP 32x and 33x support for the iop-adma driver


Can devices like drivers/scsi/sata_sx4.c or drivers/scsi/sata_promise.c 
take advantage of this?  Promise silicon supports RAID5 XOR offload.


If so, how?  If not, why not?  :)

Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/19] Hardware Accelerated MD RAID5: Introduction

2006-09-11 Thread Dan Williams
Neil,

The following patches implement hardware accelerated raid5 for the Intel
Xscale® series of I/O Processors.  The MD changes allow stripe
operations to run outside the spin lock in a work queue.  Hardware
acceleration is achieved by using a dma-engine-aware work queue routine
instead of the default software only routine.

Since the last release of the raid5 changes many bug fixes and other
improvements have been made as a result of stress testing.  See the per
patch change logs for more information about what was fixed.  This
release is the first release of the full dma implementation.

The patches touch 3 areas, the md-raid5 driver, the generic dmaengine
interface, and a platform device driver for IOPs.  The raid5 changes
follow your comments concerning making the acceleration implementation
similar to how the stripe cache handles I/O requests.  The dmaengine
changes are the second release of this code.  They expand the interface
to handle more than memcpy operations, and add a generic raid5-dma
client.  The iop-adma driver supports dma memcpy, xor, xor zero sum, and
memset across all IOP architectures (32x, 33x, and 13xx).

Concerning the context switching performance concerns raised at the
previous release, I have observed the following.  For the hardware
accelerated case it appears that performance is always better with the
work queue than without since it allows multiple stripes to be operated
on simultaneously.  I expect the same for an SMP platform, but so far my
testing has been limited to IOPs.  For a single-processor
non-accelerated configuration I have not observed performance
degradation with work queue support enabled, but in the Kconfig option
help text I recommend disabling it (CONFIG_MD_RAID456_WORKQUEUE).

Please consider the patches for -mm.

-Dan

[PATCH 01/19] raid5: raid5_do_soft_block_ops
[PATCH 02/19] raid5: move write operations to a workqueue
[PATCH 03/19] raid5: move check parity operations to a workqueue
[PATCH 04/19] raid5: move compute block operations to a workqueue
[PATCH 05/19] raid5: move read completion copies to a workqueue
[PATCH 06/19] raid5: move the reconstruct write expansion operation to a 
workqueue
[PATCH 07/19] raid5: remove compute_block and compute_parity5
[PATCH 08/19] dmaengine: enable multiple clients and operations
[PATCH 09/19] dmaengine: reduce backend address permutations
[PATCH 10/19] dmaengine: expose per channel dma mapping characteristics to 
clients
[PATCH 11/19] dmaengine: add memset as an asynchronous dma operation
[PATCH 12/19] dmaengine: dma_async_memcpy_err for DMA engines that do not 
support memcpy
[PATCH 13/19] dmaengine: add support for dma xor zero sum operations
[PATCH 14/19] dmaengine: add dma_sync_wait
[PATCH 15/19] dmaengine: raid5 dma client
[PATCH 16/19] dmaengine: Driver for the Intel IOP 32x, 33x, and 13xx RAID 
engines
[PATCH 17/19] iop3xx: define IOP3XX_REG_ADDR[32|16|8] and clean up DMA/AAU defs
[PATCH 18/19] iop3xx: Give Linux control over PCI (ATU) initialization
[PATCH 19/19] iop3xx: IOP 32x and 33x support for the iop-adma driver

Note, the iop3xx patches apply against the iop3xx platform code
re-factoring done by Lennert Buytenhek.  His patches are reproduced,
with permission, on the Xscale IOP SourceForge site.

Also available on SourceForge:

Linux Symposium Paper: MD RAID Acceleration Support for Asynchronous
DMA/XOR Engines
http://prdownloads.sourceforge.net/xscaleiop/ols_paper_2006.pdf?download

Tar archive of the patch set
http://prdownloads.sourceforge.net/xscaleiop/md_raid_accel-2.6.18-rc6.tar.gz?download

[PATCH 01/19] 
http://prdownloads.sourceforge.net/xscaleiop/md-add-raid5-do-soft-block-ops.patch?download
[PATCH 02/19] 
http://prdownloads.sourceforge.net/xscaleiop/md-move-write-operations-to-a-workqueue.patch?download
[PATCH 03/19] 
http://prdownloads.sourceforge.net/xscaleiop/md-move-check-parity-operations-to-a-workqueue.patch?download
[PATCH 04/19] 
http://prdownloads.sourceforge.net/xscaleiop/md-move-compute-block-operations-to-a-workqueue.patch?download
[PATCH 05/19] 
http://prdownloads.sourceforge.net/xscaleiop/md-move-read-completion-copies-to-a-workqueue.patch?download
[PATCH 06/19] 
http://prdownloads.sourceforge.net/xscaleiop/md-move-expansion-operations-to-a-workqueue.patch?download
[PATCH 07/19] 
http://prdownloads.sourceforge.net/xscaleiop/md-remove-compute_block-and-compute_parity5.patch?download
[PATCH 08/19] 
http://prdownloads.sourceforge.net/xscaleiop/dmaengine-multiple-clients-and-multiple-operations.patch?download
[PATCH 09/19] 
http://prdownloads.sourceforge.net/xscaleiop/dmaengine-unite-backend-address-types.patch?download
[PATCH 10/19] 
http://prdownloads.sourceforge.net/xscaleiop/dmaengine-dma-async-map-page.patch?download
[PATCH 11/19] 
http://prdownloads.sourceforge.net/xscaleiop/dmaengine-dma-async-memset.patch?download
[PATCH 12/19] 
http://prdownloads.sourceforge.net/xscaleiop/dmaengine-dma-async-memcpy-err.patch?download
[PATCH 13/19] 
http://prdownloads.sourceforge.net/xscale