RE: [PATCH 2/2] powerpc/dma/raidengine: enable Freescale RaidEngine device
Hi Dan vinod, Do you have any comments about this patch? Thanks, Forrest -Original Message- From: Shi Xuelin-B29237 Sent: 2012年11月21日 17:01 To: dan.j.willi...@gmail.com; vinod.k...@intel.com; linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org Cc: i...@ovro.caltech.edu; Shi Xuelin-B29237; Rai Harninder-B01044; Burmi Naveen-B16502 Subject: [PATCH 2/2] powerpc/dma/raidengine: enable Freescale RaidEngine device From: Xuelin Shi b29...@freescale.com The RaidEngine is a new FSL hardware that used as hardware acceration for RAID5/6. This patch enables the RaidEngine functionality and provides hardware offloading capability for memcpy, xor and raid6 pq computation. It works under dmaengine control with async_layer interface. Signed-off-by: Harninder Rai harninder@freescale.com Signed-off-by: Naveen Burmi naveenbu...@freescale.com Signed-off-by: Xuelin Shi b29...@freescale.com --- drivers/dma/Kconfig| 14 + drivers/dma/Makefile |1 + drivers/dma/fsl_raid.c | 990 drivers/dma/fsl_raid.h | 317 4 files changed, 1322 insertions(+) create mode 100644 drivers/dma/fsl_raid.c create mode 100644 drivers/dma/fsl_raid.h diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig index d4c1218..aa37279 100644 --- a/drivers/dma/Kconfig +++ b/drivers/dma/Kconfig @@ -320,6 +320,20 @@ config MMP_PDMA help Support the MMP PDMA engine for PXA and MMP platfrom. +config FSL_RAID +tristate Freescale RAID Engine Device Driver +depends on FSL_SOC !FSL_DMA +select DMA_ENGINE +select ASYNC_TX_ENABLE_CHANNEL_SWITCH +select ASYNC_MEMCPY +select ASYNC_XOR +select ASYNC_PQ +---help--- + Enable support for Freescale RAID Engine. RAID Engine is + available on some QorIQ SoCs (like P5020). It has + the capability to offload RAID5/RAID6 operations from CPU. + RAID5 is XOR and memcpy. RAID6 is P/Q and memcpy + config DMA_ENGINE bool diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile index 7428fea..29b65eb 100644 --- a/drivers/dma/Makefile +++ b/drivers/dma/Makefile @@ -9,6 +9,7 @@ obj-$(CONFIG_DMATEST) += dmatest.o obj-$(CONFIG_INTEL_IOATDMA) += ioat/ obj-$(CONFIG_INTEL_IOP_ADMA) += iop-adma.o obj-$(CONFIG_FSL_DMA) += fsldma.o +obj-$(CONFIG_FSL_RAID) += fsl_raid.o obj-$(CONFIG_MPC512X_DMA) += mpc512x_dma.o obj-$(CONFIG_MV_XOR) += mv_xor.o obj-$(CONFIG_DW_DMAC) += dw_dmac.o diff --git a/drivers/dma/fsl_raid.c b/drivers/dma/fsl_raid.c new file mode 100644 index 000..ec19817 --- /dev/null +++ b/drivers/dma/fsl_raid.c @@ -0,0 +1,990 @@ +/* + * drivers/dma/fsl_raid.c + * + * Freescale RAID Engine device driver + * + * Author: + * Harninder Rai harninder@freescale.com + * Naveen Burmi naveenbu...@freescale.com + * + * Copyright (c) 2010-2012 Freescale Semiconductor, Inc. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * * Neither the name of Freescale Semiconductor nor the + * names of its contributors may be used to endorse or promote products + * derived from this software without specific prior written permission. + * + * ALTERNATIVELY, this software may be distributed under the terms of +the + * GNU General Public License (GPL) as published by the Free Software + * Foundation, either version 2 of that License or (at your option) any + * later version. + * + * THIS SOFTWARE IS PROVIDED BY Freescale Semiconductor ``AS IS'' AND +ANY + * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +ARE + * DISCLAIMED. IN NO EVENT SHALL Freescale Semiconductor BE LIABLE FOR +ANY + * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE +USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + * + * Theory of operation: + * + * General capabilities: + * RAID Engine (RE) block is capable of offloading XOR, memcpy and P/Q + * calculations required in RAID5 and RAID6 operations. RE driver + * registers with Linux's ASYNC layer as dma
RE: [PATCH] DMA/RaidEngine: Enable FSL RaidEngine
Hi Dan, Do you have any comments about this RaidEngine patch? Thanks, Forrest -Original Message- From: Linuxppc-dev [mailto:linuxppc-dev-bounces+qiang.liu=freescale@lists.ozlabs.org] On Behalf Of Shi Xuelin-B29237 Sent: 2012年8月22日 14:24 To: dan.j.willi...@gmail.com; vinod.k...@intel.com; linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org Cc: Rai Harninder-B01044; Rai Harninder-B01044; i...@ovro.caltech.edu; Burmi Naveen-B16502; Burmi Naveen-B16502; Shi Xuelin-B29237 Subject: [PATCH] DMA/RaidEngine: Enable FSL RaidEngine From: Xuelin Shi b29...@freescale.com The RaidEngine is a new FSL hardware that used as hardware acceration for RAID5/6. This patch enables the RaidEngine functionality and provides hardware offloading capability for memcpy, xor and raid6 pq computation. It works under dmaengine control with async_layer interface. Signed-off-by: Harninder Rai harninder@freescale.com Signed-off-by: Naveen Burmi naveenbu...@freescale.com Signed-off-by: Xuelin Shi b29...@freescale.com --- arch/powerpc/boot/dts/fsl/p5020si-post.dtsi|1 + arch/powerpc/boot/dts/fsl/p5020si-pre.dtsi |6 + arch/powerpc/boot/dts/fsl/qoriq-raid1.0-0.dtsi | 85 ++ drivers/dma/Kconfig| 14 + drivers/dma/Makefile |1 + drivers/dma/fsl_raid.c | 1090 drivers/dma/fsl_raid.h | 294 +++ 7 files changed, 1491 insertions(+), 0 deletions(-) create mode 100644 arch/powerpc/boot/dts/fsl/qoriq-raid1.0-0.dtsi create mode 100644 drivers/dma/fsl_raid.c create mode 100644 drivers/dma/fsl_raid.h diff --git a/arch/powerpc/boot/dts/fsl/p5020si-post.dtsi b/arch/powerpc/boot/dts/fsl/p5020si-post.dtsi index 64b6abe..5d7205b 100644 --- a/arch/powerpc/boot/dts/fsl/p5020si-post.dtsi +++ b/arch/powerpc/boot/dts/fsl/p5020si-post.dtsi @@ -354,4 +354,5 @@ /include/ qoriq-sata2-0.dtsi /include/ qoriq-sata2-1.dtsi /include/ qoriq-sec4.2-0.dtsi +/include/ qoriq-raid1.0-0.dtsi }; diff --git a/arch/powerpc/boot/dts/fsl/p5020si-pre.dtsi b/arch/powerpc/boot/dts/fsl/p5020si-pre.dtsi index ae823a4..d54cd90 100644 --- a/arch/powerpc/boot/dts/fsl/p5020si-pre.dtsi +++ b/arch/powerpc/boot/dts/fsl/p5020si-pre.dtsi @@ -70,6 +70,12 @@ rtic_c = rtic_c; rtic_d = rtic_d; sec_mon = sec_mon; + + raideng = raideng; + raideng_jr0 = raideng_jr0; + raideng_jr1 = raideng_jr1; + raideng_jr2 = raideng_jr2; + raideng_jr3 = raideng_jr3; }; cpus { diff --git a/arch/powerpc/boot/dts/fsl/qoriq-raid1.0-0.dtsi b/arch/powerpc/boot/dts/fsl/qoriq-raid1.0-0.dtsi new file mode 100644 index 000..8d2e8aa --- /dev/null +++ b/arch/powerpc/boot/dts/fsl/qoriq-raid1.0-0.dtsi @@ -0,0 +1,85 @@ +/* + * QorIQ RAID 1.0 device tree stub [ controller @ offset 0x32 ] + * + * Copyright 2012 Freescale Semiconductor Inc. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + * * Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * * Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * * Neither the name of Freescale Semiconductor nor the + * names of its contributors may be used to endorse or promote products + * derived from this software without specific prior written permission. + * + * + * ALTERNATIVELY, this software may be distributed under the terms of +the + * GNU General Public License (GPL) as published by the Free Software + * Foundation, either version 2 of that License or (at your option) any + * later version. + * + * THIS SOFTWARE IS PROVIDED BY Freescale Semiconductor ``AS IS'' AND +ANY + * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED + * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +ARE + * DISCLAIMED. IN NO EVENT SHALL Freescale Semiconductor BE LIABLE FOR +ANY + * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +DAMAGES + * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +SERVICES; + * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +CAUSED AND + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE +USE OF THIS + * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +raideng: raideng@32 { + compatible = fsl,raideng-v1.0; + #address-cells = 1; + #size-cells = 1; + reg = 0x32 0x1; + ranges = 0 0x32 0x1
RE: [PATCH] fsldma: fix performance degradation by optimizing spinlock use.
Hi Dan Williams, Do you have any comment about this patch? Thanks, Forrest -Original Message- From: Ira W. Snyder [mailto:i...@ovro.caltech.edu] Sent: 2012年1月12日 0:14 To: Shi Xuelin-B29237 Cc: vinod.k...@intel.com; dan.j.willi...@intel.com; linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH] fsldma: fix performance degradation by optimizing spinlock use. On Wed, Jan 11, 2012 at 07:54:55AM +, Shi Xuelin-B29237 wrote: Hello Iris, As we discussed in the previous patch, I add one smp_mb() in fsl_tx_status. In my testing with iozone, this smp_mb() could cause 1%~2% performance degradation. Anyway it is acceptable for me. Do you have any other comments? This patch looks fine to me. Ira -Original Message- From: Shi Xuelin-B29237 Sent: 2011年12月26日 14:01 To: i...@ovro.caltech.edu; vinod.k...@intel.com; dan.j.willi...@intel.com; linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org Cc: Shi Xuelin-B29237 Subject: [PATCH] fsldma: fix performance degradation by optimizing spinlock use. From: Forrest shi b29...@freescale.com dma status check function fsl_tx_status is heavily called in a tight loop and the desc lock in fsl_tx_status contended by the dma status update function. this caused the dma performance degrades much. this patch releases the lock in the fsl_tx_status function, and introduce the smp_mb() to avoid possible memory inconsistency. Signed-off-by: Forrest Shi xuelin@freescale.com --- drivers/dma/fsldma.c |6 +- 1 files changed, 1 insertions(+), 5 deletions(-) diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index 8a78154..008fb5e 100644 --- a/drivers/dma/fsldma.c +++ b/drivers/dma/fsldma.c @@ -986,15 +986,11 @@ static enum dma_status fsl_tx_status(struct dma_chan *dchan, struct fsldma_chan *chan = to_fsl_chan(dchan); dma_cookie_t last_complete; dma_cookie_t last_used; - unsigned long flags; - - spin_lock_irqsave(chan-desc_lock, flags); last_complete = chan-completed_cookie; + smp_mb(); last_used = dchan-cookie; - spin_unlock_irqrestore(chan-desc_lock, flags); - dma_set_tx_state(txstate, last_complete, last_used, 0); return dma_async_is_complete(cookie, last_complete, last_used); } -- 1.7.0.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH] dmaengine: async_xor, fix zero address issue when xor highmem page
Hello Dan Williams, Do you have any comment about this patch? Thanks, Forrest -Original Message- From: Shi Xuelin-B29237 Sent: 2011年12月27日 14:31 To: vinod.k...@intel.com; dan.j.willi...@intel.com; linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org; Li Yang-R58472 Cc: Shi Xuelin-B29237 Subject: [PATCH] dmaengine: async_xor, fix zero address issue when xor highmem page From: Forrest shi b29...@freescale.com we may do_sync_xor high mem pages, in this case, page_address will return zero address which cause a failure. this patch uses kmap_atomic before xor the pages and kunmap_atomic after it. Signed-off-by: b29...@freescale.com xuelin@freescale.com --- crypto/async_tx/async_xor.c | 16 1 files changed, 12 insertions(+), 4 deletions(-) diff --git a/crypto/async_tx/async_xor.c b/crypto/async_tx/async_xor.c index bc28337..5b416d1 100644 --- a/crypto/async_tx/async_xor.c +++ b/crypto/async_tx/async_xor.c @@ -26,6 +26,7 @@ #include linux/kernel.h #include linux/interrupt.h #include linux/mm.h +#include linux/highmem.h #include linux/dma-mapping.h #include linux/raid/xor.h #include linux/async_tx.h @@ -126,7 +127,7 @@ do_sync_xor(struct page *dest, struct page **src_list, unsigned int offset, int src_cnt, size_t len, struct async_submit_ctl *submit) { int i; - int xor_src_cnt = 0; + int xor_src_cnt = 0, kmap_cnt=0; int src_off = 0; void *dest_buf; void **srcs; @@ -138,11 +139,13 @@ do_sync_xor(struct page *dest, struct page **src_list, unsigned int offset, /* convert to buffer pointers */ for (i = 0; i src_cnt; i++) - if (src_list[i]) - srcs[xor_src_cnt++] = page_address(src_list[i]) + offset; + if (src_list[i]) { + srcs[xor_src_cnt++] = kmap_atomic(src_list[i], KM_USER1) + offset; + } + kmap_cnt = xor_src_cnt; src_cnt = xor_src_cnt; /* set destination address */ - dest_buf = page_address(dest) + offset; + dest_buf = kmap_atomic(dest, KM_USER0) + offset; if (submit-flags ASYNC_TX_XOR_ZERO_DST) memset(dest_buf, 0, len); @@ -157,6 +160,11 @@ do_sync_xor(struct page *dest, struct page **src_list, unsigned int offset, src_off += xor_src_cnt; } + kunmap_atomic(dest_buf, KM_USER0); + for (i = 0; i kmap_cnt; i++) + if (src_list[i]) + kunmap_atomic(srcs[i], KM_USER1); + async_tx_sync_epilog(submit); } -- 1.7.0.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH] fsldma: fix performance degradation by optimizing spinlock use.
Hello Iris, As we discussed in the previous patch, I add one smp_mb() in fsl_tx_status. In my testing with iozone, this smp_mb() could cause 1%~2% performance degradation. Anyway it is acceptable for me. Do you have any other comments? Thanks, Forrest -Original Message- From: Shi Xuelin-B29237 Sent: 2011年12月26日 14:01 To: i...@ovro.caltech.edu; vinod.k...@intel.com; dan.j.willi...@intel.com; linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org Cc: Shi Xuelin-B29237 Subject: [PATCH] fsldma: fix performance degradation by optimizing spinlock use. From: Forrest shi b29...@freescale.com dma status check function fsl_tx_status is heavily called in a tight loop and the desc lock in fsl_tx_status contended by the dma status update function. this caused the dma performance degrades much. this patch releases the lock in the fsl_tx_status function, and introduce the smp_mb() to avoid possible memory inconsistency. Signed-off-by: Forrest Shi xuelin@freescale.com --- drivers/dma/fsldma.c |6 +- 1 files changed, 1 insertions(+), 5 deletions(-) diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index 8a78154..008fb5e 100644 --- a/drivers/dma/fsldma.c +++ b/drivers/dma/fsldma.c @@ -986,15 +986,11 @@ static enum dma_status fsl_tx_status(struct dma_chan *dchan, struct fsldma_chan *chan = to_fsl_chan(dchan); dma_cookie_t last_complete; dma_cookie_t last_used; - unsigned long flags; - - spin_lock_irqsave(chan-desc_lock, flags); last_complete = chan-completed_cookie; + smp_mb(); last_used = dchan-cookie; - spin_unlock_irqrestore(chan-desc_lock, flags); - dma_set_tx_state(txstate, last_complete, last_used, 0); return dma_async_is_complete(cookie, last_complete, last_used); } -- 1.7.0.4 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.
Hi Iris, Remember, without barriers, CPU-B can observe CPU-A's memory accesses in *any possible order*. Memory accesses are not guaranteed to be *complete* by the time fsl_dma_tx_submit() returns! fsl_dma_tx_submit is enclosed by spin_lock_irqsave/spin_unlock_irqrestore, when this function returns, I believe the memory access are completed. spin_unlock_irqsave is an implicit memory barrier and guaranteed this. Thanks, Forrest -Original Message- From: Ira W. Snyder [mailto:i...@ovro.caltech.edu] Sent: 2011年12月3日 1:14 To: Shi Xuelin-B29237 Cc: vinod.k...@intel.com; dan.j.willi...@intel.com; linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org; Li Yang-R58472 Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use. On Fri, Dec 02, 2011 at 03:47:27AM +, Shi Xuelin-B29237 wrote: Hi Iris, I'm convinced that smp_rmb() is needed when removing the spinlock. As noted, Documentation/memory-barriers.txt says that stores on one CPU can be observed by another CPU in a different order. Previously, there was an UNLOCK (in fsl_dma_tx_submit) followed by a LOCK (in fsl_tx_status). This provided a full barrier, forcing the operations to complete correctly when viewed by the second CPU. I do not agree this smp_rmb() works here. Because when this smp_rmb() executed and begin to read chan-common.cookie, you still cannot avoid the order issue. Something like one is reading old value, but another CPU is updating the new value. My point is here the order is not important for the DMA decision. Completed DMA tx is decided as not complete is not a big deal, because next time it will be OK. I believe there is no case that could cause uncompleted DMA tx is decided as completed, because the fsl_tx_status is called after fsl_dma_tx_submit for a specific cookie. If you can give me an example here, I will agree with you. According to memory-barriers.txt, writes to main memory may be observed in any order if memory barriers are not used. This means that writes can appear to happen in a different order than they were issued by the CPU. Citing from the text: There are certain things that the Linux kernel memory barriers do not guarantee: (*) There is no guarantee that any of the memory accesses specified before a memory barrier will be _complete_ by the completion of a memory barrier instruction; the barrier can be considered to draw a line in that CPU's access queue that accesses of the appropriate type may not cross. Also: Without intervention, CPU 2 may perceive the events on CPU 1 in some effectively random order, despite the write barrier issued by CPU 1: Also: When dealing with CPU-CPU interactions, certain types of memory barrier should always be paired. A lack of appropriate pairing is almost certainly an error. A write barrier should always be paired with a data dependency barrier or read barrier, though a general barrier would also be viable. Therefore, in an SMP system, the following situation can happen. descriptor-cookie = 2 chan-common.cookie = 1 chan-completed_cookie = 1 This occurs when CPU-A calls fsl_dma_tx_submit() and then CPU-B calls dma_async_is_complete() ***after*** CPU-B has observed the write to descriptor-cookie, and ***before*** before CPU-B has observed the write descriptor-to chan-common.cookie. Remember, without barriers, CPU-B can observe CPU-A's memory accesses in *any possible order*. Memory accesses are not guaranteed to be *complete* by the time fsl_dma_tx_submit() returns! With the above values, dma_async_is_complete() returns DMA_COMPLETE. This is incorrect: the DMA is still in progress. The required invariant chan-common.cookie = descriptor-cookie has not been met. By adding an smp_rmb(), I force CPU-B to stall until *both* stores in fsl_dma_tx_submit() (descriptor-cookie and chan-common.cookie) actually hit main memory. This avoids the above situation: all CPU's observe descriptor-cookie and chan-common.cookie to update in sync with each other. Is this unclear in any way? Please run your test with the smp_rmb() and measure the performance impact. Ira ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.
Hi Iris, I'm convinced that smp_rmb() is needed when removing the spinlock. As noted, Documentation/memory-barriers.txt says that stores on one CPU can be observed by another CPU in a different order. Previously, there was an UNLOCK (in fsl_dma_tx_submit) followed by a LOCK (in fsl_tx_status). This provided a full barrier, forcing the operations to complete correctly when viewed by the second CPU. I do not agree this smp_rmb() works here. Because when this smp_rmb() executed and begin to read chan-common.cookie, you still cannot avoid the order issue. Something like one is reading old value, but another CPU is updating the new value. My point is here the order is not important for the DMA decision. Completed DMA tx is decided as not complete is not a big deal, because next time it will be OK. I believe there is no case that could cause uncompleted DMA tx is decided as completed, because the fsl_tx_status is called after fsl_dma_tx_submit for a specific cookie. If you can give me an example here, I will agree with you. Thanks, Forrest -Original Message- From: Ira W. Snyder [mailto:i...@ovro.caltech.edu] Sent: 2011年12月1日 1:08 To: Shi Xuelin-B29237 Cc: vinod.k...@intel.com; dan.j.willi...@intel.com; linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org; Li Yang-R58472 Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use. On Wed, Nov 30, 2011 at 09:57:47AM +, Shi Xuelin-B29237 wrote: Hello Ira, In drivers/dma/dmaengine.c, we have below tight loop to check DMA completion in mainline Linux: do { status = dma_async_is_tx_complete(chan, cookie, NULL, NULL); if (time_after_eq(jiffies, dma_sync_wait_timeout)) { printk(KERN_ERR dma_sync_wait_timeout!\n); return DMA_ERROR; } } while (status == DMA_IN_PROGRESS); That is the body of dma_sync_wait(). It is mostly used in the raid code. I understand that you don't want to change the raid code to use callbacks. In any case, I think we've strayed from the topic under consideration, which is: can we remove this spinlock without introducing a bug. I'm convinced that smp_rmb() is needed when removing the spinlock. As noted, Documentation/memory-barriers.txt says that stores on one CPU can be observed by another CPU in a different order. Previously, there was an UNLOCK (in fsl_dma_tx_submit) followed by a LOCK (in fsl_tx_status). This provided a full barrier, forcing the operations to complete correctly when viewed by the second CPU. From the text: Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is equivalent to a full barrier, but a LOCK followed by an UNLOCK is not. Also, please read EXAMPLES OF MEMORY BARRIER SEQUENCES and INTER-CPU LOCKING BARRIER EFFECTS. Particularly, in EXAMPLES OF MEMORY BARRIER SEQUENCES, the text notes: Without intervention, CPU 2 may perceive the events on CPU 1 in some effectively random order, despite the write barrier issued by CPU 1: [snip diagram] And thirdly, a read barrier acts as a partial order on loads. Consider the following sequence of events: [snip diagram] Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in some effectively random order, despite the write barrier issued by CPU 1: [snip diagram] And so on. Please read this entire section in the document. I can't give you an ACK on the proposed patch. To the best of my understanding, I believe it introduces a bug. I've tried to provide as much evidence for this belief as I can, in the form of documentation in the kernel source tree. If you can cite some documentation that shows I am wrong, I will happily change my mind! Ira -Original Message- From: Ira W. Snyder [mailto:i...@ovro.caltech.edu] Sent: 2011年11月30日 1:26 To: Li Yang-R58472 Cc: Shi Xuelin-B29237; vinod.k...@intel.com; dan.j.willi...@intel.com; linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use. On Tue, Nov 29, 2011 at 03:19:05AM +, Li Yang-R58472 wrote: Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use. On Thu, Nov 24, 2011 at 08:12:25AM +, Shi Xuelin-B29237 wrote: Hi Ira, Thanks for your review. After second thought, I think your scenario may not occur. Because the cookie 20 we query must be returned by fsl_dma_tx_submit(...) in practice. We never query a cookie not returned by fsl_dma_tx_submit(...). I agree about this part. When we call fsl_tx_status(20), the chan-common.cookie is definitely wrote as 20 and cpu2 could not read as 19. This is what I don't agree about. However, I'm not an expert on CPU cache vs. memory accesses in an multi-processor system. The section
RE: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.
Hello Ira, In drivers/dma/dmaengine.c, we have below tight loop to check DMA completion in mainline Linux: do { status = dma_async_is_tx_complete(chan, cookie, NULL, NULL); if (time_after_eq(jiffies, dma_sync_wait_timeout)) { printk(KERN_ERR dma_sync_wait_timeout!\n); return DMA_ERROR; } } while (status == DMA_IN_PROGRESS); Thanks, Forrest -Original Message- From: Ira W. Snyder [mailto:i...@ovro.caltech.edu] Sent: 2011年11月30日 1:26 To: Li Yang-R58472 Cc: Shi Xuelin-B29237; vinod.k...@intel.com; dan.j.willi...@intel.com; linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use. On Tue, Nov 29, 2011 at 03:19:05AM +, Li Yang-R58472 wrote: Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use. On Thu, Nov 24, 2011 at 08:12:25AM +, Shi Xuelin-B29237 wrote: Hi Ira, Thanks for your review. After second thought, I think your scenario may not occur. Because the cookie 20 we query must be returned by fsl_dma_tx_submit(...) in practice. We never query a cookie not returned by fsl_dma_tx_submit(...). I agree about this part. When we call fsl_tx_status(20), the chan-common.cookie is definitely wrote as 20 and cpu2 could not read as 19. This is what I don't agree about. However, I'm not an expert on CPU cache vs. memory accesses in an multi-processor system. The section titled CACHE COHERENCY in Documentation/memory-barriers.txt leads me to believe that the scenario I described is possible. For Freescale PowerPC, the chip automatically takes care of cache coherency. Even if this is a concern, spinlock can't address it. What happens if CPU1's write of chan-common.cookie only goes into CPU1's cache. It never makes it to main memory before CPU2 fetches the old value of 19. I don't think you should see any performance impact from the smp_mb() operation. Smp_mb() do have impact on performance if it's in the hot path. While it might be safer having it, I doubt it is really necessary. If the CPU1 doesn't have the updated last_used, it's shouldn't have known there is a cookie 20 existed either. I believe that you are correct, for powerpc. However, anything outside of arch/powerpc shouldn't assume it only runs on powerpc. I wouldn't be surprised to see fsldma running on an iMX someday (ARM processor). My interpretation says that the change introduces the possibility that fsl_tx_status() returns the wrong answer for an extremely small time window, on SMP only, based on Documentation/memory-barriers.txt. But I can't seem convince you. My real question is what code path is hitting this spinlock? Is it in mainline Linux? Why is it polling rather than using callbacks to determine DMA completion? Thanks, Ira -Original Message- From: Ira W. Snyder [mailto:i...@ovro.caltech.edu] Sent: 2011年11月23日 2:59 To: Shi Xuelin-B29237 Cc: dan.j.willi...@intel.com; Li Yang-R58472; z...@zh-kernel.org; vinod.k...@intel.com; linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use. On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29...@freescale.com wrote: From: Forrest Shi b29...@freescale.com dma status check function fsl_tx_status is heavily called in a tight loop and the desc lock in fsl_tx_status contended by the dma status update function. this caused the dma performance degrades much. this patch releases the lock in the fsl_tx_status function. I believe it has no neglect impact on the following call of dma_async_is_complete(...). we can see below three conditions will be identified as success a) x complete use b) x complete+N use+N c) x complete use+N here complete is the completed_cookie, use is the last_used cookie, x is the querying cookie, N is MAX cookie when chan-completed_cookie is being read, the last_used may be incresed. Anyway it has no neglect impact on the dma status decision. Signed-off-by: Forrest Shi xuelin@freescale.com --- drivers/dma/fsldma.c |5 - 1 files changed, 0 insertions(+), 5 deletions(-) diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index 8a78154..1dca56f 100644 --- a/drivers/dma/fsldma.c +++ b/drivers/dma/fsldma.c @@ -986,15 +986,10 @@ static enum dma_status fsl_tx_status(struct dma_chan *dchan, struct fsldma_chan *chan = to_fsl_chan(dchan); dma_cookie_t last_complete; dma_cookie_t last_used; - unsigned long flags
RE: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.
Hi Ira, see my comments below. Thanks, Forrest -Original Message- From: Ira W. Snyder [mailto:i...@ovro.caltech.edu] Sent: 2011年11月29日 0:38 To: Shi Xuelin-B29237 Cc: vinod.k...@intel.com; dan.j.willi...@intel.com; linuxppc-dev@lists.ozlabs.org; Li Yang-R58472; linux-ker...@vger.kernel.org Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use. On Thu, Nov 24, 2011 at 08:12:25AM +, Shi Xuelin-B29237 wrote: Hi Ira, Thanks for your review. After second thought, I think your scenario may not occur. Because the cookie 20 we query must be returned by fsl_dma_tx_submit(...) in practice. We never query a cookie not returned by fsl_dma_tx_submit(...). I agree about this part. When we call fsl_tx_status(20), the chan-common.cookie is definitely wrote as 20 and cpu2 could not read as 19. This is what I don't agree about. However, I'm not an expert on CPU cache vs. memory accesses in an multi-processor system. The section titled CACHE COHERENCY in Documentation/memory-barriers.txt leads me to believe that the scenario I described is possible. [Shi Xuelin-B29237] If query is used without rules, your scenario may happen. But in DMA usage here, the query is used something like sequentially. Only after chan-common.cookie is updated in fsl_dma_tx_submit(...) and returned, then it could be queried. So you scenario will not happen. What happens if CPU1's write of chan-common.cookie only goes into CPU1's cache. It never makes it to main memory before CPU2 fetches the old value of 19. [Shi Xuelin-B29237] As you see, chan-common.cookie is updated in fsl_dma_tx_submit(...) and enclosed by spinlock. The spinlock implementation in PPC will guarantee the cache coherence among cores, something like you called implicit smp_mb. I don't think you should see any performance impact from the smp_mb() operation. [Shi Xuelin-B29237] Only add smp_mb() doesn't work. It only sync on this step, but next step is the same as just getting into this function without smp_mb operation. Thanks, Ira -Original Message- From: Ira W. Snyder [mailto:i...@ovro.caltech.edu] Sent: 2011年11月23日 2:59 To: Shi Xuelin-B29237 Cc: dan.j.willi...@intel.com; Li Yang-R58472; z...@zh-kernel.org; vinod.k...@intel.com; linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use. On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29...@freescale.com wrote: From: Forrest Shi b29...@freescale.com dma status check function fsl_tx_status is heavily called in a tight loop and the desc lock in fsl_tx_status contended by the dma status update function. this caused the dma performance degrades much. this patch releases the lock in the fsl_tx_status function. I believe it has no neglect impact on the following call of dma_async_is_complete(...). we can see below three conditions will be identified as success a) x complete use b) x complete+N use+N c) x complete use+N here complete is the completed_cookie, use is the last_used cookie, x is the querying cookie, N is MAX cookie when chan-completed_cookie is being read, the last_used may be incresed. Anyway it has no neglect impact on the dma status decision. Signed-off-by: Forrest Shi xuelin@freescale.com --- drivers/dma/fsldma.c |5 - 1 files changed, 0 insertions(+), 5 deletions(-) diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index 8a78154..1dca56f 100644 --- a/drivers/dma/fsldma.c +++ b/drivers/dma/fsldma.c @@ -986,15 +986,10 @@ static enum dma_status fsl_tx_status(struct dma_chan *dchan, struct fsldma_chan *chan = to_fsl_chan(dchan); dma_cookie_t last_complete; dma_cookie_t last_used; - unsigned long flags; - - spin_lock_irqsave(chan-desc_lock, flags); This will cause a bug. See below for a detailed explanation. You need this instead: /* * On an SMP system, we must ensure that this CPU has seen the * memory accesses performed by another CPU under the * chan-desc_lock spinlock. */ smp_mb(); last_complete = chan-completed_cookie; last_used = dchan-cookie; - spin_unlock_irqrestore(chan-desc_lock, flags); - dma_set_tx_state(txstate, last_complete, last_used, 0); return dma_async_is_complete(cookie, last_complete, last_used); } Facts: - dchan-cookie is the same member as chan-common.cookie (same memory location) - chan-common.cookie is the last allocated cookie for a pending transaction - chan-completed_cookie is the last completed transaction I have replaced dchan-cookie with chan-common.cookie in the below explanation, to keep everything referenced from the same structure. Variable usage before your change. Everything is used
RE: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use.
Hi Ira, Thanks for your review. After second thought, I think your scenario may not occur. Because the cookie 20 we query must be returned by fsl_dma_tx_submit(...) in practice. We never query a cookie not returned by fsl_dma_tx_submit(...). When we call fsl_tx_status(20), the chan-common.cookie is definitely wrote as 20 and cpu2 could not read as 19. How do you think? Happy Thanks Giving day, Forrest -Original Message- From: Ira W. Snyder [mailto:i...@ovro.caltech.edu] Sent: 2011年11月23日 2:59 To: Shi Xuelin-B29237 Cc: dan.j.willi...@intel.com; Li Yang-R58472; z...@zh-kernel.org; vinod.k...@intel.com; linuxppc-dev@lists.ozlabs.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH][RFC] fsldma: fix performance degradation by optimizing spinlock use. On Tue, Nov 22, 2011 at 12:55:05PM +0800, b29...@freescale.com wrote: From: Forrest Shi b29...@freescale.com dma status check function fsl_tx_status is heavily called in a tight loop and the desc lock in fsl_tx_status contended by the dma status update function. this caused the dma performance degrades much. this patch releases the lock in the fsl_tx_status function. I believe it has no neglect impact on the following call of dma_async_is_complete(...). we can see below three conditions will be identified as success a) x complete use b) x complete+N use+N c) x complete use+N here complete is the completed_cookie, use is the last_used cookie, x is the querying cookie, N is MAX cookie when chan-completed_cookie is being read, the last_used may be incresed. Anyway it has no neglect impact on the dma status decision. Signed-off-by: Forrest Shi xuelin@freescale.com --- drivers/dma/fsldma.c |5 - 1 files changed, 0 insertions(+), 5 deletions(-) diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c index 8a78154..1dca56f 100644 --- a/drivers/dma/fsldma.c +++ b/drivers/dma/fsldma.c @@ -986,15 +986,10 @@ static enum dma_status fsl_tx_status(struct dma_chan *dchan, struct fsldma_chan *chan = to_fsl_chan(dchan); dma_cookie_t last_complete; dma_cookie_t last_used; - unsigned long flags; - - spin_lock_irqsave(chan-desc_lock, flags); This will cause a bug. See below for a detailed explanation. You need this instead: /* * On an SMP system, we must ensure that this CPU has seen the * memory accesses performed by another CPU under the * chan-desc_lock spinlock. */ smp_mb(); last_complete = chan-completed_cookie; last_used = dchan-cookie; - spin_unlock_irqrestore(chan-desc_lock, flags); - dma_set_tx_state(txstate, last_complete, last_used, 0); return dma_async_is_complete(cookie, last_complete, last_used); } Facts: - dchan-cookie is the same member as chan-common.cookie (same memory location) - chan-common.cookie is the last allocated cookie for a pending transaction - chan-completed_cookie is the last completed transaction I have replaced dchan-cookie with chan-common.cookie in the below explanation, to keep everything referenced from the same structure. Variable usage before your change. Everything is used locked. - RW chan-common.cookie(fsl_dma_tx_submit) - R chan-common.cookie(fsl_tx_status) - R chan-completed_cookie (fsl_tx_status) - W chan-completed_cookie (dma_do_tasklet) Variable usage after your change: - RW chan-common.cookieLOCKED - R chan-common.cookieNO LOCK - R chan-completed_cookie NO LOCK - W chan-completed_cookie LOCKED What if we assume that you have a 2 CPU system (such as a P2020). After your changes, one possible sequence is: === CPU1 - allocate + submit descriptor: fsl_dma_tx_submit() === spin_lock_irqsave descriptor-cookie = 20 (x in your example) chan-common.cookie = 20(used in your example) spin_unlock_irqrestore === CPU2 - immediately calls fsl_tx_status() === chan-common.cookie == 19 chan-completed_cookie == 19 descriptor-cookie == 20 Since we don't have locks anymore, CPU2 may not have seen the write to chan-common.cookie yet. Also assume that the DMA hardware has not started processing the transaction yet. Therefore dma_do_tasklet() has not been called, and chan-completed_cookie has not been updated. In this case, dma_async_is_complete() (on CPU2) returns DMA_SUCCESS, even though the DMA operation has not succeeded. The DMA operation has not even started yet! The smp_mb() fixes this, since it forces CPU2 to have seen all memory operations that happened before CPU1 released the spinlock. Spinlocks are implicit SMP memory barriers. Therefore, the above example becomes: smp_mb(); chan-common.cookie == 20 chan-completed_cookie == 19 descriptor-cookie == 20 Then dma_async_is_complete() returns DMA_IN_PROGRESS, which