RE: [PATCH 2/2] powerpc/dma/raidengine: enable Freescale RaidEngine device

2013-04-09 Thread Shi Xuelin-B29237
Hi Dan  vinod,

Do you have any comments about this patch?

Thanks,
Forrest

-Original Message-
From: Shi Xuelin-B29237 
Sent: 2012年11月21日 17:01
To: dan.j.willi...@gmail.com; vinod.k...@intel.com; 
linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org
Cc: i...@ovro.caltech.edu; Shi Xuelin-B29237; Rai Harninder-B01044; Burmi 
Naveen-B16502
Subject: [PATCH 2/2] powerpc/dma/raidengine: enable Freescale RaidEngine device

From: Xuelin Shi b29...@freescale.com

The RaidEngine is a new FSL hardware that used as hardware acceration for 
RAID5/6.

This patch enables the RaidEngine functionality and provides hardware 
offloading capability for memcpy, xor and raid6 pq computation. It works under 
dmaengine control with async_layer interface.

Signed-off-by: Harninder Rai harninder@freescale.com
Signed-off-by: Naveen Burmi naveenbu...@freescale.com
Signed-off-by: Xuelin Shi b29...@freescale.com
---
 drivers/dma/Kconfig|   14 +
 drivers/dma/Makefile   |1 +
 drivers/dma/fsl_raid.c |  990 
 drivers/dma/fsl_raid.h |  317 
 4 files changed, 1322 insertions(+)
 create mode 100644 drivers/dma/fsl_raid.c  create mode 100644 
drivers/dma/fsl_raid.h

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig index d4c1218..aa37279 
100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -320,6 +320,20 @@ config MMP_PDMA
help
  Support the MMP PDMA engine for PXA and MMP platfrom.
 
+config FSL_RAID
+tristate Freescale RAID Engine Device Driver
+depends on FSL_SOC  !FSL_DMA
+select DMA_ENGINE
+select ASYNC_TX_ENABLE_CHANNEL_SWITCH
+select ASYNC_MEMCPY
+select ASYNC_XOR
+select ASYNC_PQ
+---help---
+  Enable support for Freescale RAID Engine. RAID Engine is
+  available on some QorIQ SoCs (like P5020). It has
+  the capability to offload RAID5/RAID6 operations from CPU.
+  RAID5 is XOR and memcpy. RAID6 is P/Q and memcpy
+
 config DMA_ENGINE
bool
 
diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile index 7428fea..29b65eb 
100644
--- a/drivers/dma/Makefile
+++ b/drivers/dma/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_DMATEST) += dmatest.o
 obj-$(CONFIG_INTEL_IOATDMA) += ioat/
 obj-$(CONFIG_INTEL_IOP_ADMA) += iop-adma.o
 obj-$(CONFIG_FSL_DMA) += fsldma.o
+obj-$(CONFIG_FSL_RAID) += fsl_raid.o
 obj-$(CONFIG_MPC512X_DMA) += mpc512x_dma.o
 obj-$(CONFIG_MV_XOR) += mv_xor.o
 obj-$(CONFIG_DW_DMAC) += dw_dmac.o
diff --git a/drivers/dma/fsl_raid.c b/drivers/dma/fsl_raid.c new file mode 
100644 index 000..ec19817
--- /dev/null
+++ b/drivers/dma/fsl_raid.c
@@ -0,0 +1,990 @@
+/*
+ * drivers/dma/fsl_raid.c
+ *
+ * Freescale RAID Engine device driver
+ *
+ * Author:
+ * Harninder Rai harninder@freescale.com
+ * Naveen Burmi naveenbu...@freescale.com
+ *
+ * Copyright (c) 2010-2012 Freescale Semiconductor, Inc.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ * * Redistributions of source code must retain the above copyright
+ *   notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions and the following disclaimer in the
+ *   documentation and/or other materials provided with the distribution.
+ * * Neither the name of Freescale Semiconductor nor the
+ *   names of its contributors may be used to endorse or promote products
+ *   derived from this software without specific prior written permission.
+ *
+ * ALTERNATIVELY, this software may be distributed under the terms of 
+the
+ * GNU General Public License (GPL) as published by the Free Software
+ * Foundation, either version 2 of that License or (at your option) any
+ * later version.
+ *
+ * THIS SOFTWARE IS PROVIDED BY Freescale Semiconductor ``AS IS'' AND 
+ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 
+IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 
+ARE
+ * DISCLAIMED. IN NO EVENT SHALL Freescale Semiconductor BE LIABLE FOR 
+ANY
+ * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 
+DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 
+SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 
+CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 
+OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE 
+USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ * Theory of operation:
+ *
+ * General capabilities:
+ * RAID Engine (RE) block is capable of offloading XOR, memcpy and P/Q
+ * calculations required in RAID5 and RAID6 operations. RE driver
+ * registers with Linux's ASYNC layer as dma

RE: [PATCH] DMA/RaidEngine: Enable FSL RaidEngine

2012-09-05 Thread Shi Xuelin-B29237
Hi Dan,

Do you have any comments about this RaidEngine patch?

Thanks,
Forrest

-Original Message-
From: Linuxppc-dev 
[mailto:linuxppc-dev-bounces+qiang.liu=freescale@lists.ozlabs.org] On 
Behalf Of Shi Xuelin-B29237
Sent: 2012年8月22日 14:24
To: dan.j.willi...@gmail.com; vinod.k...@intel.com; 
linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org
Cc: Rai Harninder-B01044; Rai Harninder-B01044; i...@ovro.caltech.edu; Burmi 
Naveen-B16502; Burmi Naveen-B16502; Shi Xuelin-B29237
Subject: [PATCH] DMA/RaidEngine: Enable FSL RaidEngine

From: Xuelin Shi b29...@freescale.com

The RaidEngine is a new FSL hardware that used as hardware acceration for 
RAID5/6.

This patch enables the RaidEngine functionality and provides hardware 
offloading capability for memcpy, xor and raid6 pq computation. It works under 
dmaengine control with async_layer interface.

Signed-off-by: Harninder Rai harninder@freescale.com
Signed-off-by: Naveen Burmi naveenbu...@freescale.com
Signed-off-by: Xuelin Shi b29...@freescale.com
---
 arch/powerpc/boot/dts/fsl/p5020si-post.dtsi|1 +
 arch/powerpc/boot/dts/fsl/p5020si-pre.dtsi |6 +
 arch/powerpc/boot/dts/fsl/qoriq-raid1.0-0.dtsi |   85 ++
 drivers/dma/Kconfig|   14 +
 drivers/dma/Makefile   |1 +
 drivers/dma/fsl_raid.c | 1090 
 drivers/dma/fsl_raid.h |  294 +++
 7 files changed, 1491 insertions(+), 0 deletions(-)  create mode 100644 
arch/powerpc/boot/dts/fsl/qoriq-raid1.0-0.dtsi
 create mode 100644 drivers/dma/fsl_raid.c  create mode 100644 
drivers/dma/fsl_raid.h

diff --git a/arch/powerpc/boot/dts/fsl/p5020si-post.dtsi 
b/arch/powerpc/boot/dts/fsl/p5020si-post.dtsi
index 64b6abe..5d7205b 100644
--- a/arch/powerpc/boot/dts/fsl/p5020si-post.dtsi
+++ b/arch/powerpc/boot/dts/fsl/p5020si-post.dtsi
@@ -354,4 +354,5 @@
 /include/ qoriq-sata2-0.dtsi
 /include/ qoriq-sata2-1.dtsi
 /include/ qoriq-sec4.2-0.dtsi
+/include/ qoriq-raid1.0-0.dtsi
 };
diff --git a/arch/powerpc/boot/dts/fsl/p5020si-pre.dtsi 
b/arch/powerpc/boot/dts/fsl/p5020si-pre.dtsi
index ae823a4..d54cd90 100644
--- a/arch/powerpc/boot/dts/fsl/p5020si-pre.dtsi
+++ b/arch/powerpc/boot/dts/fsl/p5020si-pre.dtsi
@@ -70,6 +70,12 @@
rtic_c = rtic_c;
rtic_d = rtic_d;
sec_mon = sec_mon;
+
+   raideng = raideng;
+   raideng_jr0 = raideng_jr0;
+   raideng_jr1 = raideng_jr1;
+   raideng_jr2 = raideng_jr2;
+   raideng_jr3 = raideng_jr3;
};
 
cpus {
diff --git a/arch/powerpc/boot/dts/fsl/qoriq-raid1.0-0.dtsi 
b/arch/powerpc/boot/dts/fsl/qoriq-raid1.0-0.dtsi
new file mode 100644
index 000..8d2e8aa
--- /dev/null
+++ b/arch/powerpc/boot/dts/fsl/qoriq-raid1.0-0.dtsi
@@ -0,0 +1,85 @@
+/*
+ * QorIQ RAID 1.0 device tree stub [ controller @ offset 0x32 ]
+ *
+ * Copyright 2012 Freescale Semiconductor Inc.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ * * Redistributions of source code must retain the above copyright
+ *   notice, this list of conditions and the following disclaimer.
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions and the following disclaimer in the
+ *   documentation and/or other materials provided with the distribution.
+ * * Neither the name of Freescale Semiconductor nor the
+ *   names of its contributors may be used to endorse or promote products
+ *   derived from this software without specific prior written permission.
+ *
+ *
+ * ALTERNATIVELY, this software may be distributed under the terms of 
+the
+ * GNU General Public License (GPL) as published by the Free Software
+ * Foundation, either version 2 of that License or (at your option) any
+ * later version.
+ *
+ * THIS SOFTWARE IS PROVIDED BY Freescale Semiconductor ``AS IS'' AND 
+ANY
+ * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 
+IMPLIED
+ * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 
+ARE
+ * DISCLAIMED. IN NO EVENT SHALL Freescale Semiconductor BE LIABLE FOR 
+ANY
+ * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL 
+DAMAGES
+ * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR 
+SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 
+CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, 
+OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE 
+USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+raideng: raideng@32 {
+   compatible = fsl,raideng-v1.0;
+   #address-cells = 1;
+   #size-cells = 1;
+   reg = 0x32 0x1;
+   ranges = 0 0x32 0x1

RE: [PATCH] fsldma: fix performance degradation by optimizing spinlock use.

2012-04-25 Thread Shi Xuelin-B29237
Hi Dan Williams,

Do you have any comment about this patch?

Thanks,
Forrest

-Original Message-
From: Ira W. Snyder [mailto:i...@ovro.caltech.edu] 
Sent: 2012年1月12日 0:14
To: Shi Xuelin-B29237
Cc: vinod.k...@intel.com; dan.j.willi...@intel.com; 
linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org
Subject: Re: [PATCH] fsldma: fix performance degradation by optimizing spinlock 
use.

On Wed, Jan 11, 2012 at 07:54:55AM +, Shi Xuelin-B29237 wrote:
 Hello Iris,
 
 As we discussed in the previous patch, I add one smp_mb() in fsl_tx_status.
 In my testing with iozone, this smp_mb() could cause 1%~2% performance 
 degradation.
 Anyway it is acceptable for me. Do you have any other comments?
 

This patch looks fine to me.

Ira

 -Original Message-
 From: Shi Xuelin-B29237 
 Sent: 2011年12月26日 14:01
 To: i...@ovro.caltech.edu; vinod.k...@intel.com; dan.j.willi...@intel.com; 
 linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org
 Cc: Shi Xuelin-B29237
 Subject: [PATCH] fsldma: fix performance degradation by optimizing spinlock 
 use.
 
 From: Forrest shi b29...@freescale.com
 
 dma status check function fsl_tx_status is heavily called in
 a tight loop and the desc lock in fsl_tx_status contended by
 the dma status update function. this caused the dma performance
 degrades much.
 
 this patch releases the lock in the fsl_tx_status function, and
 introduce the smp_mb() to avoid possible memory inconsistency.
 
 Signed-off-by: Forrest Shi xuelin@freescale.com
 ---
  drivers/dma/fsldma.c |6 +-
  1 files changed, 1 insertions(+), 5 deletions(-)
 
 diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index 
 8a78154..008fb5e 100644
 --- a/drivers/dma/fsldma.c
 +++ b/drivers/dma/fsldma.c
 @@ -986,15 +986,11 @@ static enum dma_status fsl_tx_status(struct dma_chan 
 *dchan,
   struct fsldma_chan *chan = to_fsl_chan(dchan);
   dma_cookie_t last_complete;
   dma_cookie_t last_used;
 - unsigned long flags;
 -
 - spin_lock_irqsave(chan-desc_lock, flags);
  
   last_complete = chan-completed_cookie;
 + smp_mb();
   last_used = dchan-cookie;
  
 - spin_unlock_irqrestore(chan-desc_lock, flags);
 -
   dma_set_tx_state(txstate, last_complete, last_used, 0);
   return dma_async_is_complete(cookie, last_complete, last_used); }
 --
 1.7.0.4
 
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

RE: [PATCH] dmaengine: async_xor, fix zero address issue when xor highmem page

2012-01-10 Thread Shi Xuelin-B29237
Hello Dan Williams,

Do you have any comment about this patch?

Thanks,
Forrest

-Original Message-
From: Shi Xuelin-B29237 
Sent: 2011年12月27日 14:31
To: vinod.k...@intel.com; dan.j.willi...@intel.com; 
linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org; Li Yang-R58472
Cc: Shi Xuelin-B29237
Subject: [PATCH] dmaengine: async_xor, fix zero address issue when xor highmem 
page

From: Forrest shi b29...@freescale.com

we may do_sync_xor high mem pages, in this case, page_address will
return zero address which cause a failure.

this patch uses kmap_atomic before xor the pages and kunmap_atomic
after it.

Signed-off-by: b29...@freescale.com xuelin@freescale.com
---
 crypto/async_tx/async_xor.c |   16 
 1 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/crypto/async_tx/async_xor.c b/crypto/async_tx/async_xor.c index 
bc28337..5b416d1 100644
--- a/crypto/async_tx/async_xor.c
+++ b/crypto/async_tx/async_xor.c
@@ -26,6 +26,7 @@
 #include linux/kernel.h
 #include linux/interrupt.h
 #include linux/mm.h
+#include linux/highmem.h
 #include linux/dma-mapping.h
 #include linux/raid/xor.h
 #include linux/async_tx.h
@@ -126,7 +127,7 @@ do_sync_xor(struct page *dest, struct page **src_list, 
unsigned int offset,
int src_cnt, size_t len, struct async_submit_ctl *submit)  {
int i;
-   int xor_src_cnt = 0;
+   int xor_src_cnt = 0, kmap_cnt=0;
int src_off = 0;
void *dest_buf;
void **srcs;
@@ -138,11 +139,13 @@ do_sync_xor(struct page *dest, struct page **src_list, 
unsigned int offset,
 
/* convert to buffer pointers */
for (i = 0; i  src_cnt; i++)
-   if (src_list[i])
-   srcs[xor_src_cnt++] = page_address(src_list[i]) + 
offset;
+   if (src_list[i]) {
+   srcs[xor_src_cnt++] = kmap_atomic(src_list[i], 
KM_USER1) + offset;
+   }
+   kmap_cnt = xor_src_cnt;
src_cnt = xor_src_cnt;
/* set destination address */
-   dest_buf = page_address(dest) + offset;
+   dest_buf = kmap_atomic(dest, KM_USER0) + offset;
 
if (submit-flags  ASYNC_TX_XOR_ZERO_DST)
memset(dest_buf, 0, len);
@@ -157,6 +160,11 @@ do_sync_xor(struct page *dest, struct page **src_list, 
unsigned int offset,
src_off += xor_src_cnt;
}
 
+   kunmap_atomic(dest_buf, KM_USER0);
+   for (i = 0; i  kmap_cnt; i++) 
+   if (src_list[i])
+   kunmap_atomic(srcs[i], KM_USER1);
+
async_tx_sync_epilog(submit);
 }
 
--
1.7.0.4


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH] fsldma: fix performance degradation by optimizing spinlock use.

2012-01-10 Thread Shi Xuelin-B29237
Hello Iris,

As we discussed in the previous patch, I add one smp_mb() in fsl_tx_status.
In my testing with iozone, this smp_mb() could cause 1%~2% performance 
degradation.
Anyway it is acceptable for me. Do you have any other comments?

Thanks,
Forrest

-Original Message-
From: Shi Xuelin-B29237 
Sent: 2011年12月26日 14:01
To: i...@ovro.caltech.edu; vinod.k...@intel.com; dan.j.willi...@intel.com; 
linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org
Cc: Shi Xuelin-B29237
Subject: [PATCH] fsldma: fix performance degradation by optimizing spinlock use.

From: Forrest shi b29...@freescale.com

dma status check function fsl_tx_status is heavily called in
a tight loop and the desc lock in fsl_tx_status contended by
the dma status update function. this caused the dma performance
degrades much.

this patch releases the lock in the fsl_tx_status function, and
introduce the smp_mb() to avoid possible memory inconsistency.

Signed-off-by: Forrest Shi xuelin@freescale.com
---
 drivers/dma/fsldma.c |6 +-
 1 files changed, 1 insertions(+), 5 deletions(-)

diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index 8a78154..008fb5e 
100644
--- a/drivers/dma/fsldma.c
+++ b/drivers/dma/fsldma.c
@@ -986,15 +986,11 @@ static enum dma_status fsl_tx_status(struct dma_chan 
*dchan,
struct fsldma_chan *chan = to_fsl_chan(dchan);
dma_cookie_t last_complete;
dma_cookie_t last_used;
-   unsigned long flags;
-
-   spin_lock_irqsave(chan-desc_lock, flags);
 
last_complete = chan-completed_cookie;
+   smp_mb();
last_used = dchan-cookie;
 
-   spin_unlock_irqrestore(chan-desc_lock, flags);
-
dma_set_tx_state(txstate, last_complete, last_used, 0);
return dma_async_is_complete(cookie, last_complete, last_used); }
--
1.7.0.4


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.

2011-12-04 Thread Shi Xuelin-B29237
Hi Iris,

Remember, without barriers, CPU-B can observe CPU-A's memory accesses in *any 
possible order*. Memory accesses are not guaranteed to be *complete* by 
the time fsl_dma_tx_submit() returns!

fsl_dma_tx_submit is enclosed by spin_lock_irqsave/spin_unlock_irqrestore, when 
this function returns, I believe the memory access are completed. 
spin_unlock_irqsave is an implicit memory barrier and guaranteed this.

Thanks,
Forrest


-Original Message-
From: Ira W. Snyder [mailto:i...@ovro.caltech.edu] 
Sent: 2011年12月3日 1:14
To: Shi Xuelin-B29237
Cc: vinod.k...@intel.com; dan.j.willi...@intel.com; 
linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org; Li Yang-R58472
Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing 
spinlock use.

On Fri, Dec 02, 2011 at 03:47:27AM +, Shi Xuelin-B29237 wrote:
 Hi Iris,
 
 I'm convinced that smp_rmb() is needed when removing the spinlock. 
 As noted, Documentation/memory-barriers.txt says that stores on one CPU can 
 be observed by another CPU in a different order.
 Previously, there was an UNLOCK (in fsl_dma_tx_submit) followed by a 
 LOCK (in fsl_tx_status). This provided a full barrier, forcing the 
 operations to complete correctly when viewed by the second CPU.
 
 I do not agree this smp_rmb() works here. Because when this smp_rmb() 
 executed and begin to read chan-common.cookie, you still cannot avoid the 
 order issue. Something like one is reading old value, but another CPU is 
 updating the new value. 
 
 My point is here the order is not important for the DMA decision.
 Completed DMA tx is decided as not complete is not a big deal, because next 
 time it will be OK.
 
 I believe there is no case that could cause uncompleted DMA tx is decided as 
 completed, because the fsl_tx_status is called after fsl_dma_tx_submit for a 
 specific cookie. If you can give me an example here, I will agree with you.
 

According to memory-barriers.txt, writes to main memory may be observed in any 
order if memory barriers are not used. This means that writes can appear to 
happen in a different order than they were issued by the CPU.

Citing from the text:

 There are certain things that the Linux kernel memory barriers do not 
 guarantee:

  (*) There is no guarantee that any of the memory accesses specified before a
  memory barrier will be _complete_ by the completion of a memory barrier
  instruction; the barrier can be considered to draw a line in that CPU's
  access queue that accesses of the appropriate type may not cross.

Also:

 Without intervention, CPU 2 may perceive the events on CPU 1 in some 
 effectively random order, despite the write barrier issued by CPU 1:

Also:

 When dealing with CPU-CPU interactions, certain types of memory 
 barrier should always be paired.  A lack of appropriate pairing is almost 
 certainly an error.

 A write barrier should always be paired with a data dependency barrier 
 or read barrier, though a general barrier would also be viable.

Therefore, in an SMP system, the following situation can happen.

descriptor-cookie = 2
chan-common.cookie = 1
chan-completed_cookie = 1

This occurs when CPU-A calls fsl_dma_tx_submit() and then CPU-B calls
dma_async_is_complete() ***after*** CPU-B has observed the write to
descriptor-cookie, and ***before*** before CPU-B has observed the write 
descriptor-to
chan-common.cookie.

Remember, without barriers, CPU-B can observe CPU-A's memory accesses in *any 
possible order*. Memory accesses are not guaranteed to be *complete* by the 
time fsl_dma_tx_submit() returns!

With the above values, dma_async_is_complete() returns DMA_COMPLETE. This is 
incorrect: the DMA is still in progress. The required invariant
chan-common.cookie = descriptor-cookie has not been met.

By adding an smp_rmb(), I force CPU-B to stall until *both* stores in
fsl_dma_tx_submit() (descriptor-cookie and chan-common.cookie) actually hit 
main memory. This avoids the above situation: all CPU's observe
descriptor-cookie and chan-common.cookie to update in sync with each
other.

Is this unclear in any way?

Please run your test with the smp_rmb() and measure the performance impact.

Ira


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.

2011-12-01 Thread Shi Xuelin-B29237
Hi Iris,

I'm convinced that smp_rmb() is needed when removing the spinlock. As noted, 
Documentation/memory-barriers.txt says that stores on one CPU can be
observed by another CPU in a different order.
Previously, there was an UNLOCK (in fsl_dma_tx_submit) followed by a LOCK (in 
fsl_tx_status). This provided a full barrier, forcing the operations to 
complete correctly when viewed by the second CPU. 

I do not agree this smp_rmb() works here. Because when this smp_rmb() executed 
and begin to read chan-common.cookie, you still cannot avoid the order issue. 
Something like one is reading old value, but another CPU is updating the new 
value. 

My point is here the order is not important for the DMA decision.
Completed DMA tx is decided as not complete is not a big deal, because next 
time it will be OK.

I believe there is no case that could cause uncompleted DMA tx is decided as 
completed, because the fsl_tx_status is called after fsl_dma_tx_submit for a 
specific cookie. If you can give me an example here, I will agree with you.

Thanks,
Forrest 

-Original Message-
From: Ira W. Snyder [mailto:i...@ovro.caltech.edu] 
Sent: 2011年12月1日 1:08
To: Shi Xuelin-B29237
Cc: vinod.k...@intel.com; dan.j.willi...@intel.com; 
linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org; Li Yang-R58472
Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing 
spinlock use.

On Wed, Nov 30, 2011 at 09:57:47AM +, Shi Xuelin-B29237 wrote:
 Hello Ira,
 
 In drivers/dma/dmaengine.c, we have below tight loop to check DMA completion 
 in mainline Linux:
do {
 status = dma_async_is_tx_complete(chan, cookie, NULL, NULL);
 if (time_after_eq(jiffies, dma_sync_wait_timeout)) {
 printk(KERN_ERR dma_sync_wait_timeout!\n);
 return DMA_ERROR;
 }
 } while (status == DMA_IN_PROGRESS);
 

That is the body of dma_sync_wait(). It is mostly used in the raid code.
I understand that you don't want to change the raid code to use callbacks.

In any case, I think we've strayed from the topic under consideration, which 
is: can we remove this spinlock without introducing a bug.

I'm convinced that smp_rmb() is needed when removing the spinlock. As noted, 
Documentation/memory-barriers.txt says that stores on one CPU can be observed 
by another CPU in a different order.

Previously, there was an UNLOCK (in fsl_dma_tx_submit) followed by a LOCK (in 
fsl_tx_status). This provided a full barrier, forcing the operations to 
complete correctly when viewed by the second CPU. From the
text:

 Therefore, from (1), (2) and (4) an UNLOCK followed by an 
 unconditional LOCK is equivalent to a full barrier, but a LOCK followed by an 
 UNLOCK is not.

Also, please read EXAMPLES OF MEMORY BARRIER SEQUENCES and INTER-CPU LOCKING 
BARRIER EFFECTS. Particularly, in EXAMPLES OF MEMORY BARRIER SEQUENCES, the 
text notes:

 Without intervention, CPU 2 may perceive the events on CPU 1 in some 
 effectively random order, despite the write barrier issued by CPU 1:

 [snip diagram]

 And thirdly, a read barrier acts as a partial order on loads. Consider 
 the following sequence of events:

 [snip diagram]

 Without intervention, CPU 2 may then choose to perceive the events on 
 CPU 1 in some effectively random order, despite the write barrier issued by 
 CPU 1:

 [snip diagram]


And so on. Please read this entire section in the document.

I can't give you an ACK on the proposed patch. To the best of my understanding, 
I believe it introduces a bug. I've tried to provide as much evidence for this 
belief as I can, in the form of documentation in the kernel source tree. If you 
can cite some documentation that shows I am wrong, I will happily change my 
mind!

Ira

 -Original Message-
 From: Ira W. Snyder [mailto:i...@ovro.caltech.edu]
 Sent: 2011年11月30日 1:26
 To: Li Yang-R58472
 Cc: Shi Xuelin-B29237; vinod.k...@intel.com; dan.j.willi...@intel.com; 
 linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org
 Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing 
 spinlock use.
 
 On Tue, Nov 29, 2011 at 03:19:05AM +, Li Yang-R58472 wrote:
   Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by 
   optimizing spinlock use.
   
   On Thu, Nov 24, 2011 at 08:12:25AM +, Shi Xuelin-B29237 wrote:
Hi Ira,
   
Thanks for your review.
   
After second thought, I think your scenario may not occur.
Because the cookie 20 we query must be returned by
fsl_dma_tx_submit(...) in
   practice.
We never query a cookie not returned by fsl_dma_tx_submit(...).
   
   
   I agree about this part.
   
When we call fsl_tx_status(20), the chan-common.cookie is 
definitely wrote as
   20 and cpu2 could not read as 19.
   
   
   This is what I don't agree about. However, I'm not an expert on CPU cache 
   vs.
   memory accesses in an multi-processor system. The section

RE: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.

2011-11-30 Thread Shi Xuelin-B29237
Hello Ira,

In drivers/dma/dmaengine.c, we have below tight loop to check DMA completion in 
mainline Linux:
   do {
status = dma_async_is_tx_complete(chan, cookie, NULL, NULL);
if (time_after_eq(jiffies, dma_sync_wait_timeout)) {
printk(KERN_ERR dma_sync_wait_timeout!\n);
return DMA_ERROR;
}
} while (status == DMA_IN_PROGRESS);


Thanks,
Forrest

-Original Message-
From: Ira W. Snyder [mailto:i...@ovro.caltech.edu] 
Sent: 2011年11月30日 1:26
To: Li Yang-R58472
Cc: Shi Xuelin-B29237; vinod.k...@intel.com; dan.j.willi...@intel.com; 
linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org
Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing 
spinlock use.

On Tue, Nov 29, 2011 at 03:19:05AM +, Li Yang-R58472 wrote:
  Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by 
  optimizing spinlock use.
  
  On Thu, Nov 24, 2011 at 08:12:25AM +, Shi Xuelin-B29237 wrote:
   Hi Ira,
  
   Thanks for your review.
  
   After second thought, I think your scenario may not occur.
   Because the cookie 20 we query must be returned by 
   fsl_dma_tx_submit(...) in
  practice.
   We never query a cookie not returned by fsl_dma_tx_submit(...).
  
  
  I agree about this part.
  
   When we call fsl_tx_status(20), the chan-common.cookie is 
   definitely wrote as
  20 and cpu2 could not read as 19.
  
  
  This is what I don't agree about. However, I'm not an expert on CPU cache 
  vs.
  memory accesses in an multi-processor system. The section titled 
  CACHE COHERENCY in Documentation/memory-barriers.txt leads me to 
  believe that the scenario I described is possible.
 
 For Freescale PowerPC, the chip automatically takes care of cache coherency.  
 Even if this is a concern, spinlock can't address it.
 
  
  What happens if CPU1's write of chan-common.cookie only goes into 
  CPU1's cache. It never makes it to main memory before CPU2 fetches the old 
  value of 19.
  
  I don't think you should see any performance impact from the 
  smp_mb() operation.
 
 Smp_mb() do have impact on performance if it's in the hot path.  While it 
 might be safer having it, I doubt it is really necessary.  If the CPU1 
 doesn't have the updated last_used, it's shouldn't have known there is a 
 cookie 20 existed either.
 

I believe that you are correct, for powerpc. However, anything outside of 
arch/powerpc shouldn't assume it only runs on powerpc. I wouldn't be surprised 
to see fsldma running on an iMX someday (ARM processor).

My interpretation says that the change introduces the possibility that
fsl_tx_status() returns the wrong answer for an extremely small time window, on 
SMP only, based on Documentation/memory-barriers.txt. But I can't seem convince 
you.

My real question is what code path is hitting this spinlock? Is it in mainline 
Linux? Why is it polling rather than using callbacks to determine DMA 
completion?

Thanks,
Ira

   -Original Message-
   From: Ira W. Snyder [mailto:i...@ovro.caltech.edu]
   Sent: 2011年11月23日 2:59
   To: Shi Xuelin-B29237
   Cc: dan.j.willi...@intel.com; Li Yang-R58472; z...@zh-kernel.org; 
   vinod.k...@intel.com; linuxppc-dev@lists.ozlabs.org; 
   linux-ker...@vger.kernel.org
   Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by 
   optimizing
  spinlock use.
  
   On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29...@freescale.com wrote:
From: Forrest Shi b29...@freescale.com
   
dma status check function fsl_tx_status is heavily called in
a tight loop and the desc lock in fsl_tx_status contended by
the dma status update function. this caused the dma performance
degrades much.
   
this patch releases the lock in the fsl_tx_status function.
I believe it has no neglect impact on the following call of
dma_async_is_complete(...).
   
we can see below three conditions will be identified as success
a)  x  complete  use
b)  x  complete+N  use+N
c)  x  complete  use+N
here complete is the completed_cookie, use is the last_used
cookie, x is the querying cookie, N is MAX cookie
   
when chan-completed_cookie is being read, the last_used may
be incresed. Anyway it has no neglect impact on the dma status
decision.
   
Signed-off-by: Forrest Shi xuelin@freescale.com
---
 drivers/dma/fsldma.c |5 -
 1 files changed, 0 insertions(+), 5 deletions(-)
   
diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index 
8a78154..1dca56f 100644
--- a/drivers/dma/fsldma.c
+++ b/drivers/dma/fsldma.c
@@ -986,15 +986,10 @@ static enum dma_status 
fsl_tx_status(struct
  dma_chan *dchan,
struct fsldma_chan *chan = to_fsl_chan(dchan);
dma_cookie_t last_complete;
dma_cookie_t last_used;
-   unsigned long flags

RE: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.

2011-11-28 Thread Shi Xuelin-B29237
Hi Ira, 

see my comments below.

Thanks,
Forrest

-Original Message-
From: Ira W. Snyder [mailto:i...@ovro.caltech.edu] 
Sent: 2011年11月29日 0:38
To: Shi Xuelin-B29237
Cc: vinod.k...@intel.com; dan.j.willi...@intel.com; 
linuxppc-dev@lists.ozlabs.org; Li Yang-R58472; linux-ker...@vger.kernel.org
Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing 
spinlock use.

On Thu, Nov 24, 2011 at 08:12:25AM +, Shi Xuelin-B29237 wrote:
 Hi Ira,
 
 Thanks for your review.
 
 After second thought, I think your scenario may not occur.
 Because the cookie 20 we query must be returned by fsl_dma_tx_submit(...) in 
 practice. 
 We never query a cookie not returned by fsl_dma_tx_submit(...).
 

I agree about this part.

 When we call fsl_tx_status(20), the chan-common.cookie is definitely wrote 
 as 20 and cpu2 could not read as 19.
 

This is what I don't agree about. However, I'm not an expert on CPU cache vs. 
memory accesses in an multi-processor system. The section titled CACHE 
COHERENCY in Documentation/memory-barriers.txt leads me to believe that the 
scenario I described is possible.

 
[Shi Xuelin-B29237] If query is used without rules, your scenario may happen. 
But in DMA usage here, the query is used something like sequentially. Only 
after chan-common.cookie is updated in fsl_dma_tx_submit(...) and returned, 
then it could be queried. So you scenario will not happen.

What happens if CPU1's write of chan-common.cookie only goes into CPU1's 
cache. It never makes it to main memory before CPU2 fetches the old value of 19.
 
[Shi Xuelin-B29237] As you see, chan-common.cookie is updated in 
fsl_dma_tx_submit(...) and enclosed by spinlock. The spinlock implementation in 
PPC will guarantee the cache coherence among cores, something like you called 
implicit smp_mb.

I don't think you should see any performance impact from the smp_mb() operation.

[Shi Xuelin-B29237] Only add smp_mb() doesn't work. It only sync on this step, 
but next step is the same as just getting into this function without smp_mb 
operation.

Thanks,
Ira

 -Original Message-
 From: Ira W. Snyder [mailto:i...@ovro.caltech.edu]
 Sent: 2011年11月23日 2:59
 To: Shi Xuelin-B29237
 Cc: dan.j.willi...@intel.com; Li Yang-R58472; z...@zh-kernel.org; 
 vinod.k...@intel.com; linuxppc-dev@lists.ozlabs.org; 
 linux-ker...@vger.kernel.org
 Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing 
 spinlock use.
 
 On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29...@freescale.com wrote:
  From: Forrest Shi b29...@freescale.com
  
  dma status check function fsl_tx_status is heavily called in
  a tight loop and the desc lock in fsl_tx_status contended by
  the dma status update function. this caused the dma performance
  degrades much.
  
  this patch releases the lock in the fsl_tx_status function.
  I believe it has no neglect impact on the following call of
  dma_async_is_complete(...).
  
  we can see below three conditions will be identified as success
  a)  x  complete  use
  b)  x  complete+N  use+N
  c)  x  complete  use+N
  here complete is the completed_cookie, use is the last_used
  cookie, x is the querying cookie, N is MAX cookie
  
  when chan-completed_cookie is being read, the last_used may
  be incresed. Anyway it has no neglect impact on the dma status
  decision.
  
  Signed-off-by: Forrest Shi xuelin@freescale.com
  ---
   drivers/dma/fsldma.c |5 -
   1 files changed, 0 insertions(+), 5 deletions(-)
  
  diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index 
  8a78154..1dca56f 100644
  --- a/drivers/dma/fsldma.c
  +++ b/drivers/dma/fsldma.c
  @@ -986,15 +986,10 @@ static enum dma_status fsl_tx_status(struct dma_chan 
  *dchan,
  struct fsldma_chan *chan = to_fsl_chan(dchan);
  dma_cookie_t last_complete;
  dma_cookie_t last_used;
  -   unsigned long flags;
  -
  -   spin_lock_irqsave(chan-desc_lock, flags);
   
 
 This will cause a bug. See below for a detailed explanation. You need this 
 instead:
 
   /*
* On an SMP system, we must ensure that this CPU has seen the
* memory accesses performed by another CPU under the
* chan-desc_lock spinlock.
*/
   smp_mb();
  last_complete = chan-completed_cookie;
  last_used = dchan-cookie;
   
  -   spin_unlock_irqrestore(chan-desc_lock, flags);
  -
  dma_set_tx_state(txstate, last_complete, last_used, 0);
  return dma_async_is_complete(cookie, last_complete, last_used);  }
 
 Facts:
 - dchan-cookie is the same member as chan-common.cookie (same memory 
 location)
 - chan-common.cookie is the last allocated cookie for a pending transaction
 - chan-completed_cookie is the last completed transaction
 
 I have replaced dchan-cookie with chan-common.cookie in the below 
 explanation, to keep everything referenced from the same structure.
 
 Variable usage before your change. Everything is used

RE: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.

2011-11-24 Thread Shi Xuelin-B29237
Hi Ira,

Thanks for your review.

After second thought, I think your scenario may not occur.
Because the cookie 20 we query must be returned by fsl_dma_tx_submit(...) in 
practice. 
We never query a cookie not returned by fsl_dma_tx_submit(...).

When we call fsl_tx_status(20), the chan-common.cookie is definitely wrote as 
20 and cpu2 could not read as 19.

How do you think? 

Happy Thanks Giving day,
Forrest

-Original Message-
From: Ira W. Snyder [mailto:i...@ovro.caltech.edu] 
Sent: 2011年11月23日 2:59
To: Shi Xuelin-B29237
Cc: dan.j.willi...@intel.com; Li Yang-R58472; z...@zh-kernel.org; 
vinod.k...@intel.com; linuxppc-dev@lists.ozlabs.org; 
linux-ker...@vger.kernel.org
Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing 
spinlock use.

On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29...@freescale.com wrote:
 From: Forrest Shi b29...@freescale.com
 
 dma status check function fsl_tx_status is heavily called in
 a tight loop and the desc lock in fsl_tx_status contended by
 the dma status update function. this caused the dma performance
 degrades much.
 
 this patch releases the lock in the fsl_tx_status function.
 I believe it has no neglect impact on the following call of
 dma_async_is_complete(...).
 
 we can see below three conditions will be identified as success
 a)  x  complete  use
 b)  x  complete+N  use+N
 c)  x  complete  use+N
 here complete is the completed_cookie, use is the last_used
 cookie, x is the querying cookie, N is MAX cookie
 
 when chan-completed_cookie is being read, the last_used may
 be incresed. Anyway it has no neglect impact on the dma status
 decision.
 
 Signed-off-by: Forrest Shi xuelin@freescale.com
 ---
  drivers/dma/fsldma.c |5 -
  1 files changed, 0 insertions(+), 5 deletions(-)
 
 diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index 
 8a78154..1dca56f 100644
 --- a/drivers/dma/fsldma.c
 +++ b/drivers/dma/fsldma.c
 @@ -986,15 +986,10 @@ static enum dma_status fsl_tx_status(struct dma_chan 
 *dchan,
   struct fsldma_chan *chan = to_fsl_chan(dchan);
   dma_cookie_t last_complete;
   dma_cookie_t last_used;
 - unsigned long flags;
 -
 - spin_lock_irqsave(chan-desc_lock, flags);
  

This will cause a bug. See below for a detailed explanation. You need this 
instead:

/*
 * On an SMP system, we must ensure that this CPU has seen the
 * memory accesses performed by another CPU under the
 * chan-desc_lock spinlock.
 */
smp_mb();
   last_complete = chan-completed_cookie;
   last_used = dchan-cookie;
  
 - spin_unlock_irqrestore(chan-desc_lock, flags);
 -
   dma_set_tx_state(txstate, last_complete, last_used, 0);
   return dma_async_is_complete(cookie, last_complete, last_used);  }

Facts:
- dchan-cookie is the same member as chan-common.cookie (same memory location)
- chan-common.cookie is the last allocated cookie for a pending transaction
- chan-completed_cookie is the last completed transaction

I have replaced dchan-cookie with chan-common.cookie in the below 
explanation, to keep everything referenced from the same structure.

Variable usage before your change. Everything is used locked.
- RW chan-common.cookie(fsl_dma_tx_submit)
- R  chan-common.cookie(fsl_tx_status)
- R  chan-completed_cookie (fsl_tx_status)
- W  chan-completed_cookie (dma_do_tasklet)

Variable usage after your change:
- RW chan-common.cookieLOCKED
- R  chan-common.cookieNO LOCK
- R  chan-completed_cookie NO LOCK
- W  chan-completed_cookie LOCKED

What if we assume that you have a 2 CPU system (such as a P2020). After your 
changes, one possible sequence is:

=== CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() === 
spin_lock_irqsave
descriptor-cookie = 20 (x in your example)
chan-common.cookie = 20(used in your example)
spin_unlock_irqrestore

=== CPU2 - immediately calls fsl_tx_status() ===
chan-common.cookie == 19
chan-completed_cookie == 19
descriptor-cookie == 20

Since we don't have locks anymore, CPU2 may not have seen the write to
chan-common.cookie yet.

Also assume that the DMA hardware has not started processing the transaction 
yet. Therefore dma_do_tasklet() has not been called, and
chan-completed_cookie has not been updated.

In this case, dma_async_is_complete() (on CPU2) returns DMA_SUCCESS, even 
though the DMA operation has not succeeded. The DMA operation has not even 
started yet!

The smp_mb() fixes this, since it forces CPU2 to have seen all memory 
operations that happened before CPU1 released the spinlock. Spinlocks are 
implicit SMP memory barriers.

Therefore, the above example becomes:
smp_mb();
chan-common.cookie == 20
chan-completed_cookie == 19
descriptor-cookie == 20

Then dma_async_is_complete() returns DMA_IN_PROGRESS, which