Re: [PATCH 3/6] async_tx: Handle DMA devices having support for fewer PQ coefficients

2017-02-05 Thread Anup Patel
On Sat, Feb 4, 2017 at 12:12 AM, Dan Williams  wrote:
> On Fri, Feb 3, 2017 at 2:59 AM, Anup Patel  wrote:
>>
>>
>> On Thu, Feb 2, 2017 at 11:31 AM, Dan Williams 
>> wrote:
>>>
>>> On Wed, Feb 1, 2017 at 8:47 PM, Anup Patel 
>>> wrote:
>>> > The DMAENGINE framework assumes that if PQ offload is supported by a
>>> > DMA device then all 256 PQ coefficients are supported. This assumption
>>> > does not hold anymore because we now have BCM-SBA-RAID offload engine
>>> > which supports PQ offload with limited number of PQ coefficients.
>>> >
>>> > This patch extends async_tx APIs to handle DMA devices with support
>>> > for fewer PQ coefficients.
>>> >
>>> > Signed-off-by: Anup Patel 
>>> > Reviewed-by: Scott Branden 
>>> > ---
>>> >  crypto/async_tx/async_pq.c  |  3 +++
>>> >  crypto/async_tx/async_raid6_recov.c | 12 ++--
>>> >  include/linux/dmaengine.h   | 19 +++
>>> >  include/linux/raid/pq.h |  3 +++
>>> >  4 files changed, 35 insertions(+), 2 deletions(-)
>>>
>>> So, I hate the way async_tx does these checks on each operation, and
>>> it's ok for me to say that because it's my fault. Really it's md that
>>> should be validating engine offload capabilities once at the beginning
>>> of time. I'd rather we move in that direction than continue to pile
>>> onto a bad design.
>>
>>
>> Yes, indeed. All async_tx APIs have lot of checks and for high throughput
>> RAID offload engine these checks can add some overhead.
>>
>> I think doing checks in Linux md would be certainly better but this would
>> mean lot of changes in Linux md as well as remove checks in async_tx.
>>
>> Also, async_tx APIs should not find DMA channel on its own instead it
>> should rely on Linux md to provide DMA channel pointer as parameter.
>>
>> It's better to do checks cleanup in async_tx as separate patchset and
>> keep this patchset simple.
>
> That's been the problem with async_tx being broken like this for
> years. Once you get this "small / simple" patch upstream, that
> arguably makes async_tx a little bit worse, there is no longer any
> motivation to fix the underlying issues. If you care about the long
> term health of raid offload and are enabling new hardware support you
> should first tackle the known problems with it before adding new
> features.

Apart from the checks related issue you pointed there are other
issues with async_tx APIs such as:

1. The mechanism to do update PQ (or RAID6 update) operation
in current async_tx APIs is to call async_gen_syndrome() twice
with ASYNC_TX_PQ_XOR_DST flag set. Also, async_gen_syndrome()
will always prefer SW approach when ASYNC_TX_PQ_XOR_DST flag
is set. This means async_tx API is forcing SW approach for update
PQ operation and in-addition we require two async_gen_syndrome()
calls to achieve update PQ. This limitations of async_gen_syndrome()
reduces performance of async_tx APIs. Instead of this we should
have a dedicated async_update_pq() API which will allow RAID
offload engine drivers (such as BCM-FS4-RAID) to implement
update PQ using HW offload and this new API will fall-back to
SW approach using async_gen_syndrome() if no DMA channel
provides update PQ HW offload.

2. In our stress testing, we have observed that dma_map_page()
and dma_unmap_page() used in various async_tx APIs are the
major cause of overhead. If we directly call DMA channel callbacks
with pre-DMA-mapped pages then we get very high throughput.
The async_tx APIs should provide a way for pre-DMA-mapped
pages so that Linux MD can exploit this fact for better performance.

3. We really don't have a test module to stress/benchmark all
async_tx APIs using multi-threading and batching large number
of request in each thread. This kind of test module is very much
required for performance benchmarking and stressing high
throughput (hundreds of Gbps) RAID offload engines (such as
BCM-FS4-RAID).

>From the above, we already have async_tx_test module to
address point3. We also plan to address point1 above but
this would also require changes in Linux MD to use new
async_update_pq() API.

As you can see, this patchset is not end of story of us if we
want best possible utilization of BCM-FS4-RAID.

Regards,
Anup


Re: [PATCH 3/6] async_tx: Handle DMA devices having support for fewer PQ coefficients

2017-02-03 Thread Dan Williams
On Fri, Feb 3, 2017 at 2:59 AM, Anup Patel  wrote:
>
>
> On Thu, Feb 2, 2017 at 11:31 AM, Dan Williams 
> wrote:
>>
>> On Wed, Feb 1, 2017 at 8:47 PM, Anup Patel 
>> wrote:
>> > The DMAENGINE framework assumes that if PQ offload is supported by a
>> > DMA device then all 256 PQ coefficients are supported. This assumption
>> > does not hold anymore because we now have BCM-SBA-RAID offload engine
>> > which supports PQ offload with limited number of PQ coefficients.
>> >
>> > This patch extends async_tx APIs to handle DMA devices with support
>> > for fewer PQ coefficients.
>> >
>> > Signed-off-by: Anup Patel 
>> > Reviewed-by: Scott Branden 
>> > ---
>> >  crypto/async_tx/async_pq.c  |  3 +++
>> >  crypto/async_tx/async_raid6_recov.c | 12 ++--
>> >  include/linux/dmaengine.h   | 19 +++
>> >  include/linux/raid/pq.h |  3 +++
>> >  4 files changed, 35 insertions(+), 2 deletions(-)
>>
>> So, I hate the way async_tx does these checks on each operation, and
>> it's ok for me to say that because it's my fault. Really it's md that
>> should be validating engine offload capabilities once at the beginning
>> of time. I'd rather we move in that direction than continue to pile
>> onto a bad design.
>
>
> Yes, indeed. All async_tx APIs have lot of checks and for high throughput
> RAID offload engine these checks can add some overhead.
>
> I think doing checks in Linux md would be certainly better but this would
> mean lot of changes in Linux md as well as remove checks in async_tx.
>
> Also, async_tx APIs should not find DMA channel on its own instead it
> should rely on Linux md to provide DMA channel pointer as parameter.
>
> It's better to do checks cleanup in async_tx as separate patchset and
> keep this patchset simple.

That's been the problem with async_tx being broken like this for
years. Once you get this "small / simple" patch upstream, that
arguably makes async_tx a little bit worse, there is no longer any
motivation to fix the underlying issues. If you care about the long
term health of raid offload and are enabling new hardware support you
should first tackle the known problems with it before adding new
features.


Re: [PATCH 3/6] async_tx: Handle DMA devices having support for fewer PQ coefficients

2017-02-03 Thread Anup Patel
On Thu, Feb 2, 2017 at 11:31 AM, Dan Williams  wrote:
>
> On Wed, Feb 1, 2017 at 8:47 PM, Anup Patel  wrote:
> > The DMAENGINE framework assumes that if PQ offload is supported by a
> > DMA device then all 256 PQ coefficients are supported. This assumption
> > does not hold anymore because we now have BCM-SBA-RAID offload engine
> > which supports PQ offload with limited number of PQ coefficients.
> >
> > This patch extends async_tx APIs to handle DMA devices with support
> > for fewer PQ coefficients.
> >
> > Signed-off-by: Anup Patel 
> > Reviewed-by: Scott Branden 
> > ---
> >  crypto/async_tx/async_pq.c  |  3 +++
> >  crypto/async_tx/async_raid6_recov.c | 12 ++--
> >  include/linux/dmaengine.h   | 19 +++
> >  include/linux/raid/pq.h |  3 +++
> >  4 files changed, 35 insertions(+), 2 deletions(-)
>
> So, I hate the way async_tx does these checks on each operation, and
> it's ok for me to say that because it's my fault. Really it's md that
> should be validating engine offload capabilities once at the beginning
> of time. I'd rather we move in that direction than continue to pile
> onto a bad design.

Yes, indeed. All async_tx APIs have lot of checks and for high throughput
RAID offload engine these checks can add some overhead.

I think doing checks in Linux md would be certainly better but this would
mean lot of changes in Linux md as well as remove checks in async_tx.

Also, async_tx APIs should not find DMA channel on its own instead it
should rely on Linux md to provide DMA channel pointer as parameter.

It's better to do checks cleanup in async_tx as separate patchset and
keep this patchset simple.

Regards,
Anup


Re: [PATCH 3/6] async_tx: Handle DMA devices having support for fewer PQ coefficients

2017-02-01 Thread Dan Williams
On Wed, Feb 1, 2017 at 8:47 PM, Anup Patel  wrote:
> The DMAENGINE framework assumes that if PQ offload is supported by a
> DMA device then all 256 PQ coefficients are supported. This assumption
> does not hold anymore because we now have BCM-SBA-RAID offload engine
> which supports PQ offload with limited number of PQ coefficients.
>
> This patch extends async_tx APIs to handle DMA devices with support
> for fewer PQ coefficients.
>
> Signed-off-by: Anup Patel 
> Reviewed-by: Scott Branden 
> ---
>  crypto/async_tx/async_pq.c  |  3 +++
>  crypto/async_tx/async_raid6_recov.c | 12 ++--
>  include/linux/dmaengine.h   | 19 +++
>  include/linux/raid/pq.h |  3 +++
>  4 files changed, 35 insertions(+), 2 deletions(-)

So, I hate the way async_tx does these checks on each operation, and
it's ok for me to say that because it's my fault. Really it's md that
should be validating engine offload capabilities once at the beginning
of time. I'd rather we move in that direction than continue to pile
onto a bad design.


[PATCH 3/6] async_tx: Handle DMA devices having support for fewer PQ coefficients

2017-02-01 Thread Anup Patel
The DMAENGINE framework assumes that if PQ offload is supported by a
DMA device then all 256 PQ coefficients are supported. This assumption
does not hold anymore because we now have BCM-SBA-RAID offload engine
which supports PQ offload with limited number of PQ coefficients.

This patch extends async_tx APIs to handle DMA devices with support
for fewer PQ coefficients.

Signed-off-by: Anup Patel 
Reviewed-by: Scott Branden 
---
 crypto/async_tx/async_pq.c  |  3 +++
 crypto/async_tx/async_raid6_recov.c | 12 ++--
 include/linux/dmaengine.h   | 19 +++
 include/linux/raid/pq.h |  3 +++
 4 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/crypto/async_tx/async_pq.c b/crypto/async_tx/async_pq.c
index f83de99..16c6526 100644
--- a/crypto/async_tx/async_pq.c
+++ b/crypto/async_tx/async_pq.c
@@ -187,6 +187,9 @@ async_gen_syndrome(struct page **blocks, unsigned int 
offset, int disks,
 
BUG_ON(disks > 255 || !(P(blocks, disks) || Q(blocks, disks)));
 
+   if (device && dma_maxpqcoef(device) < src_cnt)
+   device = NULL;
+
if (device)
unmap = dmaengine_get_unmap_data(device->dev, disks, 
GFP_NOWAIT);
 
diff --git a/crypto/async_tx/async_raid6_recov.c 
b/crypto/async_tx/async_raid6_recov.c
index 8fab627..2916f95 100644
--- a/crypto/async_tx/async_raid6_recov.c
+++ b/crypto/async_tx/async_raid6_recov.c
@@ -352,6 +352,7 @@ async_raid6_2data_recov(int disks, size_t bytes, int faila, 
int failb,
 {
void *scribble = submit->scribble;
int non_zero_srcs, i;
+   struct dma_chan *chan = async_dma_find_channel(DMA_PQ);
 
BUG_ON(faila == failb);
if (failb < faila)
@@ -359,12 +360,15 @@ async_raid6_2data_recov(int disks, size_t bytes, int 
faila, int failb,
 
pr_debug("%s: disks: %d len: %zu\n", __func__, disks, bytes);
 
+   if (chan && dma_maxpqcoef(chan->device) < RAID6_PQ_MAX_COEF)
+   chan = NULL;
+
/* if a dma resource is not available or a scribble buffer is not
 * available punt to the synchronous path.  In the 'dma not
 * available' case be sure to use the scribble buffer to
 * preserve the content of 'blocks' as the caller intended.
 */
-   if (!async_dma_find_channel(DMA_PQ) || !scribble) {
+   if (!chan || !scribble) {
void **ptrs = scribble ? scribble : (void **) blocks;
 
async_tx_quiesce(&submit->depend_tx);
@@ -432,15 +436,19 @@ async_raid6_datap_recov(int disks, size_t bytes, int 
faila,
void *scribble = submit->scribble;
int good_srcs, good, i;
struct page *srcs[2];
+   struct dma_chan *chan = async_dma_find_channel(DMA_PQ);
 
pr_debug("%s: disks: %d len: %zu\n", __func__, disks, bytes);
 
+   if (chan && dma_maxpqcoef(chan->device) < RAID6_PQ_MAX_COEF)
+   chan = NULL;
+
/* if a dma resource is not available or a scribble buffer is not
 * available punt to the synchronous path.  In the 'dma not
 * available' case be sure to use the scribble buffer to
 * preserve the content of 'blocks' as the caller intended.
 */
-   if (!async_dma_find_channel(DMA_PQ) || !scribble) {
+   if (!chan || !scribble) {
void **ptrs = scribble ? scribble : (void **) blocks;
 
async_tx_quiesce(&submit->depend_tx);
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index feee6ec..d938a8b 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /**
@@ -668,6 +669,7 @@ struct dma_filter {
  * @cap_mask: one or more dma_capability flags
  * @max_xor: maximum number of xor sources, 0 if no capability
  * @max_pq: maximum number of PQ sources and PQ-continue capability
+ * @max_pqcoef: maximum number of PQ coefficients, 0 if all supported
  * @copy_align: alignment shift for memcpy operations
  * @xor_align: alignment shift for xor operations
  * @pq_align: alignment shift for pq operations
@@ -727,11 +729,13 @@ struct dma_device {
dma_cap_mask_t  cap_mask;
unsigned short max_xor;
unsigned short max_pq;
+   unsigned short max_pqcoef;
enum dmaengine_alignment copy_align;
enum dmaengine_alignment xor_align;
enum dmaengine_alignment pq_align;
enum dmaengine_alignment fill_align;
#define DMA_HAS_PQ_CONTINUE (1 << 15)
+   #define DMA_HAS_FEWER_PQ_COEF (1 << 15)
 
int dev_id;
struct device *dev;
@@ -1122,6 +1126,21 @@ static inline int dma_maxpq(struct dma_device *dma, enum 
dma_ctrl_flags flags)
BUG();
 }
 
+static inline void dma_set_maxpqcoef(struct dma_device *dma,
+unsigned short max_pqcoef)
+{
+   if (max_pqcoef < RAID6_PQ_MAX_COEF) {
+   dma->max_pqcoef = max_pqcoef;
+   dma->max_pqcoef |= DMA_HAS_FEWER