Re: [RFC PATCH 0/8] x86: Support Intel Key Locker

2020-12-17 Thread Dan Williams
On Thu, Dec 17, 2020 at 11:11 AM Eric Biggers  wrote:
>
> On Wed, Dec 16, 2020 at 09:41:38AM -0800, Chang S. Bae wrote:
> > [1] Intel Architecture Instruction Set Extensions Programming Reference:
> > 
> > https://software.intel.com/content/dam/develop/external/us/en/documents/architecture-instruction-set-$

https://software.intel.com/content/dam/develop/external/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf

> > [2] Intel Key Locker Specification:
> > 
> > https://software.intel.com/content/dam/develop/external/us/en/documents/343965-intel-key-locker-speci$

https://software.intel.com/content/dam/develop/external/us/en/documents/343965-intel-key-locker-specification.pdf


Re: [RFC] treewide: cleanup unreachable breaks

2020-10-17 Thread Dan Williams
On Sat, Oct 17, 2020 at 9:10 AM  wrote:
>
> From: Tom Rix 
>
> This is a upcoming change to clean up a new warning treewide.
> I am wondering if the change could be one mega patch (see below) or
> normal patch per file about 100 patches or somewhere half way by collecting
> early acks.
>
> clang has a number of useful, new warnings see
> https://clang.llvm.org/docs/DiagnosticsReference.html
>
> This change cleans up -Wunreachable-code-break
> https://clang.llvm.org/docs/DiagnosticsReference.html#wunreachable-code-break
> for 266 of 485 warnings in this week's linux-next, allyesconfig on x86_64.
>
> The method of fixing was to look for warnings where the preceding statement
> was a simple statement and by inspection made the subsequent break unneeded.
> In order of frequency these look like
>
> return and break
>
> switch (c->x86_vendor) {
> case X86_VENDOR_INTEL:
> intel_p5_mcheck_init(c);
> return 1;
> -   break;
>
> goto and break
>
> default:
> operation = 0; /* make gcc happy */
> goto fail_response;
> -   break;
>
> break and break
> case COLOR_SPACE_SRGB:
> /* by pass */
> REG_SET(OUTPUT_CSC_CONTROL, 0,
> OUTPUT_CSC_GRPH_MODE, 0);
> break;
> -   break;
>
> The exception to the simple statement, is a switch case with a block
> and the end of block is a return
>
> struct obj_buffer *buff = r->ptr;
> return scnprintf(str, PRIV_STR_SIZE,
> "size=%u\naddr=0x%X\n", buff->size,
> buff->addr);
> }
> -   break;
>
> Not considered obvious and excluded, breaks after
> multi level switches
> complicated if-else if-else blocks
> panic() or similar calls
>
> And there is an odd addition of a 'fallthrough' in drivers/tty/nozomi.c
[..]
> diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
> index 5a7c80053c62..2f250874b1a4 100644
> --- a/drivers/nvdimm/claim.c
> +++ b/drivers/nvdimm/claim.c
> @@ -200,11 +200,10 @@ ssize_t nd_namespace_store(struct device *dev,
> }
> break;
> default:
> len = -EBUSY;
> goto out_attach;
> -   break;
> }

Acked-by: Dan Williams 


Re: DMA error when sg->offset value is greater than PAGE_SIZE in Intel IOMMU

2017-09-27 Thread Dan Williams
On Wed, Sep 27, 2017 at 9:31 AM, Casey Leedom  wrote:
> | From: Dan Williams 
> | Sent: Tuesday, September 26, 2017 9:10 AM
> |
> | On Tue, Sep 26, 2017 at 9:06 AM, Casey Leedom  wrote:
> | > | From: Robin Murphy 
> | > | Sent: Tuesday, September 26, 2017 7:22 AM
> | > |...
> | > ...
> | >   Regardless, it seems that you agree that there's an issue with the Intel
> | > I/O MMU support code with regard to the legal values which a (struct
> | > scatterlist) can take on?  I still can't find any documentation for this
> | > and, personally, I'm a bit baffled by a Page-oriented Scatter/Gather List
> | > representation where [Offset, Offset+Length) can reside outside the Page.
> |
> | Consider the case where the page represents a huge page, then an
> | offset greater than PAGE_SIZE (up to HPAGE_SIZE) makes sense.
>
>   Okay, but whatever the underlaying Page Size is, should [Offset,
> Offset+Length) completely reside within the referenced Page?  I'm just
> trying to understand the Invariance Conditions which are assumed by all of
> the code which processes Scatter/gather Lists ...

As far as I can see "Offset can be greater than PAGE_SIZE" is the only
safe assumption for core code.


Re: DMA error when sg->offset value is greater than PAGE_SIZE in Intel IOMMU

2017-09-26 Thread Dan Williams
On Tue, Sep 26, 2017 at 9:06 AM, Casey Leedom  wrote:
> | From: Robin Murphy 
> | Sent: Tuesday, September 26, 2017 7:22 AM
>
> |
> | On 26/09/17 13:21, Harsh Jain wrote:
> | > Find attached new set of log. After repeated tries it panics.
> |
> | Thanks, that makes things a bit clearer - looks like fixing the physical
> | address/pteval calculation to not be off by a page in one direction wasn't
> | helping much because the returned DMA address is actually also off by a
> | page in the other direction, and thus overflowing past the allocated IOVA
> | into whoever else's mapping happened to be there; complete carnage ensues.
> |
> | After another look through the intel_map_sg() path, here's my second
> (still
> | completely untested) guess at a possible fix.
> |
> | Robin.
> |
> | ->8-
> | diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> | index 6784a05dd6b2..d7f7def81613 100644
> | --- a/drivers/iommu/intel-iommu.c
> | +++ b/drivers/iommu/intel-iommu.c
> | @@ -2254,10 +2254,12 @@ static int __domain_mapping(struct dmar_domain
> *domain, unsigned long iov_pfn,
> |  uint64_t tmp;
> |
> |  if (!sg_res) {
> | +   size_t off = sg->offset & ~PAGE_MASK;
> | +
> |  sg_res = aligned_nrpages(sg->offset, sg->length);
> | -   sg->dma_address = ((dma_addr_t)iov_pfn <<
> VTD_PAGE_SHIFT) + sg->offset;
> | +   sg->dma_address = ((dma_addr_t)iov_pfn <<
> VTD_PAGE_SHIFT) + off;
> |  sg->dma_length = sg->length;
> | -   pteval = page_to_phys(sg_page(sg)) | prot;
> | +   pteval = (page_to_phys(sg_page(sg)) + sg->offset -
> off) | prot;
> |  phys_pfn = pteval >> VTD_PAGE_SHIFT;
> |  }
>
>   Thanks Robin.  And thanks Harsh for sending the detailed trace logs.  I'll
> see if I can get this tested today.  Harsh is probably headed towards bed,
> but there may be sufficiently good instructions in our internal bug system
> to reproduce the issue.
>
>   Regardless, it seems that you agree that there's an issue with the Intel
> I/O MMU support code with regard to the legal values which a (struct
> scatterlist) can take on?  I still can't find any documentation for this
> and, personally, I'm a bit baffled by a Page-oriented Scatter/Gather List
> representation where [Offset, Offset+Length) can reside outside the Page.

Consider the case where the page represents a huge page, then an
offset greater than PAGE_SIZE (up to HPAGE_SIZE) makes sense.


Re: DMA error when sg->offset value is greater than PAGE_SIZE in Intel IOMMU

2017-09-25 Thread Dan Williams
On Mon, Sep 25, 2017 at 1:05 PM, Casey Leedom  wrote:
> | From: Dan Williams 
> | Sent: Monday, September 25, 2017 12:31 PM
> | ...
> | IIUC it looks like this has been broken ever since commit e1605495c716
> | "intel-iommu: Introduce domain_sg_mapping() to speed up
> | intel_map_sg()". I.e. it looks like the calculation for pte_val should
> | be:
> |
> | pteval = (page_to_phys(sg_page(sg)) + sg->offset) | prot;
>
> Hhmmm, shouldn't that be:
>
> pteval = (page_to_phys(sg_page(sg)) + (sg->offset>>PAGE_SHIFT)) | prot;

Yes, I think you're right. We do want to mask off the page-unaligned
portion of sg->offset.


Re: DMA error when sg->offset value is greater than PAGE_SIZE in Intel IOMMU

2017-09-25 Thread Dan Williams
On Mon, Sep 25, 2017 at 10:46 AM, Casey Leedom  wrote:
> | From: Robin Murphy 
> | Sent: Wednesday, September 20, 2017 3:12 AM
> |
> | On 20/09/17 09:01, Herbert Xu wrote:
> | >
> | > Harsh Jain  wrote:
> | >>
> | >> While debugging DMA mapping error in chelsio crypto driver we
> | >> observed that when scatter/gather list received by driver has
> | >> some entry with page->offset > 4096 (PAGE_SIZE). It starts
> | >> giving DMA error.  Without IOMMU it works fine.
> | >
> | > This is not a bug.  The network stack can and will feed us such
> | > SG lists.
> | >
> | >> 2) It cannot be driver's responsibilty to update received sg
> | >> entries to adjust offset and page because we are not the only
> | >> one who directly uses received sg list.
> | >
> | > No the driver must deal with this.  Having said that, if we can
> | > improve our driver helper interface to make this easier then we
> | > should do that too.  What we certainly shouldn't do is to take a
> | > whack-a-mole approach like this patch does.
> |
> | AFAICS this is entirely on intel-iommu - from a brief look it appears
> | that all the IOVA calculations would handle the offset correctly, but
> | then __domain_mapping() blindly uses sg_page() for the physical address,
> | so if offset is larger than a page it would end up with the DMA mapping
> | covering the wrong part of the buffer.
> |
> | Does the diff below help?
> |
> | Robin.
> |
> | ->8-
> | diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> | index b3914fce8254..2ed43d928135 100644
> | --- a/drivers/iommu/intel-iommu.c
> | +++ b/drivers/iommu/intel-iommu.c
> | @@ -2253,7 +2253,7 @@ static int __domain_mapping(struct dmar_domain 
> *domain, unsigned long iov_pfn,
> |  sg_res = aligned_nrpages(sg->offset, sg->length);
> |  sg->dma_address = ((dma_addr_t)iov_pfn << 
> VTD_PAGE_SHIFT) + sg->offset;
> |  sg->dma_length = sg->length;
> | -   pteval = page_to_phys(sg_page(sg)) | prot;
> | +   pteval = (sg_phys(sg) & PAGE_MASK) | prot;

This breaks on platforms where sizeof(phys_addr_t) > sizeof(unsigned
long). I.e. it's not always safe to assume that PAGE_MASK is the
correct width.

> |  phys_pfn = pteval >> VTD_PAGE_SHIFT;
> |  }
>
>   Adding some likely people to the Cc list so they can comment on this.
> Dan Williams submitted that specific piece of code in kernel.org:3e6110fd54
> ... but there are lots of similar bits in that function.  Hopefully one of
> the Intel I/O MMU Gurus will have a better idea of what may be going wrong
> here.  In the mean time I've asked our team to gather far more detailed
> debug traces showing the exact Scatter/Gather Lists we're getting, what they
> get translated to in the DMA Mappings, and what DMA Addresses were seeing in
> error.

IIUC it looks like this has been broken ever since commit e1605495c716
"intel-iommu: Introduce domain_sg_mapping() to speed up
intel_map_sg()". I.e. it looks like the calculation for pte_val should
be:

pteval = (page_to_phys(sg_page(sg)) + sg->offset) | prot;


Re: [PATCH v6 0/4] Broadcom SBA RAID support

2017-03-14 Thread Dan Williams
On Mon, Mar 6, 2017 at 1:43 AM, Anup Patel  wrote:
> The Broadcom SBA RAID is a stream-based device which provides
> RAID5/6 offload.
>
> It requires a SoC specific ring manager (such as Broadcom FlexRM
> ring manager) to provide ring-based programming interface. Due to
> this, the Broadcom SBA RAID driver (mailbox client) implements
> DMA device having one DMA channel using a set of mailbox channels
> provided by Broadcom SoC specific ring manager driver (mailbox
> controller).
>
> The Broadcom SBA RAID hardware requires PQ disk position instead
> of PQ disk coefficient. To address this, we have added raid_gflog
> table which will help driver to convert PQ disk coefficient to PQ
> disk position.
>
> This patchset is based on Linux-4.11-rc1 and depends on patchset
> "[PATCH v5 0/2] Broadcom FlexRM ring manager support"
>
> It is also available at sba-raid-v6 branch of
> https://github.com/Broadcom/arm64-linux.git
>
[..]
>
> Anup Patel (4):
>   lib/raid6: Add log-of-2 table for RAID6 HW requiring disk position
>   async_tx: Fix DMA_PREP_FENCE usage in do_async_gen_syndrome()
>   dmaengine: Add Broadcom SBA RAID driver
>   dt-bindings: Add DT bindings document for Broadcom SBA RAID driver

For the dmaengine and async_tx changes:

Acked-by: Dan Williams 

The raid change should get an ack from Shaohua.


Re: [PATCH v4 3/4] dmaengine: Add Broadcom SBA RAID driver

2017-02-14 Thread Dan Williams
On Tue, Feb 14, 2017 at 11:03 PM, Anup Patel  wrote:
> On Wed, Feb 15, 2017 at 12:13 PM, Dan Williams  
> wrote:
>> On Tue, Feb 14, 2017 at 10:25 PM, Anup Patel  wrote:
>>> On Tue, Feb 14, 2017 at 10:04 PM, Dan Williams  
>>> wrote:
>>>> On Mon, Feb 13, 2017 at 10:51 PM, Anup Patel  
>>>> wrote:
>>>>> The Broadcom stream buffer accelerator (SBA) provides offloading
>>>>> capabilities for RAID operations. This SBA offload engine is
>>>>> accessible via Broadcom SoC specific ring manager.
>>>>>
>>>>> This patch adds Broadcom SBA RAID driver which provides one
>>>>> DMA device with RAID capabilities using one or more Broadcom
>>>>> SoC specific ring manager channels. The SBA RAID driver in its
>>>>> current shape implements memcpy, xor, and pq operations.
>>>>>
>>>>> Signed-off-by: Anup Patel 
>>>>> Reviewed-by: Ray Jui 
>>>>> ---
>>>>>  drivers/dma/Kconfig|   13 +
>>>>>  drivers/dma/Makefile   |1 +
>>>>>  drivers/dma/bcm-sba-raid.c | 1694 
>>>>> 
>>>>>  3 files changed, 1708 insertions(+)
>>>>>  create mode 100644 drivers/dma/bcm-sba-raid.c
>>>>>
>>>>> diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
>>>>> index 263495d..bf8fb84 100644
>>>>> --- a/drivers/dma/Kconfig
>>>>> +++ b/drivers/dma/Kconfig
>>>>> @@ -99,6 +99,19 @@ config AXI_DMAC
>>>>>   controller is often used in Analog Device's reference designs 
>>>>> for FPGA
>>>>>   platforms.
>>>>>
>>>>> +config BCM_SBA_RAID
>>>>> +   tristate "Broadcom SBA RAID engine support"
>>>>> +   depends on (ARM64 && MAILBOX && RAID6_PQ) || COMPILE_TEST
>>>>> +   select DMA_ENGINE
>>>>> +   select DMA_ENGINE_RAID
>>>>> +   select ASYNC_TX_ENABLE_CHANNEL_SWITCH
>>>>
>>>> I thought you agreed to drop this. Its usage is broken.
>>>
>>> If ASYNC_TX_ENABLE_CHANNEL_SWITCH is not selected
>>> then async_dma_find_channel() will only try to find channel
>>> with DMA_ASYNC_TX capability.
>>>
>>> The DMA_ASYNC_TX capability is set by
>>> dma_async_device_register() when all Async Tx
>>> capabilities are supported by a DMA devices namely
>>> DMA_INTERRUPT, DMA_MEMCPY, DMA_XOR,
>>> DMA_XOR_VAL, DMA_PQ, and DMA_PQ_VAL.
>>>
>>> We only support DMA_MEMCPY, DMA_XOR, and
>>> DMA_PQ capabilities in BCM-SBA-RAID driver so
>>> DMA_ASYNC_TX capability is never set for the
>>> DMA device registered by BCM-SBA-RAID driver.
>>>
>>> Due to above, if ASYNC_TX_ENABLE_CHANNEL_SWITCH
>>> is not selected then Async Tx APIs fail to find DMA
>>> channel provided by BCM-SBA-RAID hence the
>>> option ASYNC_TX_ENABLE_CHANNEL_SWITCH is
>>> required for BCM-SBA-RAID.
>>>
>>> The DMA mappings are violated by channel switching
>>> only if we switch form DMA channel A to DMA channel
>>> B and both these DMA channels have different underlying
>>> "struct device". In most of the cases DMA mappings
>>> are not violated because DMA channels having
>>> Async Tx capabilities are provided using same
>>> underlying "struct device".
>>
>> No, fix the infrastructure. Do not put local hack in your driver for
>> this global problem [1].
>
> There is no hack in the driver. We need
> ASYNC_TX_ENABLE_CHANNEL_SWITCH
> based on current state of dmaengine framework.
>
> The framework should be fixed as separate patchset.
>
> We have other RAID drivers such as xgene-dma and
> mv_xor_v2 who also require
> ASYNC_TX_ENABLE_CHANNEL_SWITCH due
> to same reason.
>
> Fixing the framework and improving framework is
> a ongoing process. I don't see why that should
> stop this patchset.
>

Because this driver is turning on a dangerous compile time option and
is not using the functionality. If this silicon IP block appears in
another product in the future paired with another DMA engine then the
assumptions about a safe/single dma-device is violated.

The realization of how async_tx was breaking DMA mapping api
assumptions came after some of these dma-drivers were added to the
kernel. We should stop making the problem worse.

I should have submitted a patch like the below at the time we
discovered this problem, but unfortunately it languished when I
stopped maintaining the iop-adma and ioat drivers.

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 263495d0adbd..6b30eb9ad125 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -35,6 +35,7 @@ comment "DMA Devices"

 #core
 config ASYNC_TX_ENABLE_CHANNEL_SWITCH
+   depends on BROKEN
bool

 config ARCH_HAS_ASYNC_TX_FIND_CHANNEL


Re: [PATCH v4 3/4] dmaengine: Add Broadcom SBA RAID driver

2017-02-14 Thread Dan Williams
On Tue, Feb 14, 2017 at 10:25 PM, Anup Patel  wrote:
> On Tue, Feb 14, 2017 at 10:04 PM, Dan Williams  
> wrote:
>> On Mon, Feb 13, 2017 at 10:51 PM, Anup Patel  wrote:
>>> The Broadcom stream buffer accelerator (SBA) provides offloading
>>> capabilities for RAID operations. This SBA offload engine is
>>> accessible via Broadcom SoC specific ring manager.
>>>
>>> This patch adds Broadcom SBA RAID driver which provides one
>>> DMA device with RAID capabilities using one or more Broadcom
>>> SoC specific ring manager channels. The SBA RAID driver in its
>>> current shape implements memcpy, xor, and pq operations.
>>>
>>> Signed-off-by: Anup Patel 
>>> Reviewed-by: Ray Jui 
>>> ---
>>>  drivers/dma/Kconfig|   13 +
>>>  drivers/dma/Makefile   |1 +
>>>  drivers/dma/bcm-sba-raid.c | 1694 
>>> 
>>>  3 files changed, 1708 insertions(+)
>>>  create mode 100644 drivers/dma/bcm-sba-raid.c
>>>
>>> diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
>>> index 263495d..bf8fb84 100644
>>> --- a/drivers/dma/Kconfig
>>> +++ b/drivers/dma/Kconfig
>>> @@ -99,6 +99,19 @@ config AXI_DMAC
>>>   controller is often used in Analog Device's reference designs for 
>>> FPGA
>>>   platforms.
>>>
>>> +config BCM_SBA_RAID
>>> +   tristate "Broadcom SBA RAID engine support"
>>> +   depends on (ARM64 && MAILBOX && RAID6_PQ) || COMPILE_TEST
>>> +   select DMA_ENGINE
>>> +   select DMA_ENGINE_RAID
>>> +   select ASYNC_TX_ENABLE_CHANNEL_SWITCH
>>
>> I thought you agreed to drop this. Its usage is broken.
>
> If ASYNC_TX_ENABLE_CHANNEL_SWITCH is not selected
> then async_dma_find_channel() will only try to find channel
> with DMA_ASYNC_TX capability.
>
> The DMA_ASYNC_TX capability is set by
> dma_async_device_register() when all Async Tx
> capabilities are supported by a DMA devices namely
> DMA_INTERRUPT, DMA_MEMCPY, DMA_XOR,
> DMA_XOR_VAL, DMA_PQ, and DMA_PQ_VAL.
>
> We only support DMA_MEMCPY, DMA_XOR, and
> DMA_PQ capabilities in BCM-SBA-RAID driver so
> DMA_ASYNC_TX capability is never set for the
> DMA device registered by BCM-SBA-RAID driver.
>
> Due to above, if ASYNC_TX_ENABLE_CHANNEL_SWITCH
> is not selected then Async Tx APIs fail to find DMA
> channel provided by BCM-SBA-RAID hence the
> option ASYNC_TX_ENABLE_CHANNEL_SWITCH is
> required for BCM-SBA-RAID.
>
> The DMA mappings are violated by channel switching
> only if we switch form DMA channel A to DMA channel
> B and both these DMA channels have different underlying
> "struct device". In most of the cases DMA mappings
> are not violated because DMA channels having
> Async Tx capabilities are provided using same
> underlying "struct device".

No, fix the infrastructure. Do not put local hack in your driver for
this global problem [1].

[1]: https://lwn.net/Articles/443531/


Re: [PATCH v4 3/4] dmaengine: Add Broadcom SBA RAID driver

2017-02-14 Thread Dan Williams
On Mon, Feb 13, 2017 at 10:51 PM, Anup Patel  wrote:
> The Broadcom stream buffer accelerator (SBA) provides offloading
> capabilities for RAID operations. This SBA offload engine is
> accessible via Broadcom SoC specific ring manager.
>
> This patch adds Broadcom SBA RAID driver which provides one
> DMA device with RAID capabilities using one or more Broadcom
> SoC specific ring manager channels. The SBA RAID driver in its
> current shape implements memcpy, xor, and pq operations.
>
> Signed-off-by: Anup Patel 
> Reviewed-by: Ray Jui 
> ---
>  drivers/dma/Kconfig|   13 +
>  drivers/dma/Makefile   |1 +
>  drivers/dma/bcm-sba-raid.c | 1694 
> 
>  3 files changed, 1708 insertions(+)
>  create mode 100644 drivers/dma/bcm-sba-raid.c
>
> diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
> index 263495d..bf8fb84 100644
> --- a/drivers/dma/Kconfig
> +++ b/drivers/dma/Kconfig
> @@ -99,6 +99,19 @@ config AXI_DMAC
>   controller is often used in Analog Device's reference designs for 
> FPGA
>   platforms.
>
> +config BCM_SBA_RAID
> +   tristate "Broadcom SBA RAID engine support"
> +   depends on (ARM64 && MAILBOX && RAID6_PQ) || COMPILE_TEST
> +   select DMA_ENGINE
> +   select DMA_ENGINE_RAID
> +   select ASYNC_TX_ENABLE_CHANNEL_SWITCH

I thought you agreed to drop this. Its usage is broken.


Re: [PATCH v3 3/4] dmaengine: Add Broadcom SBA RAID driver

2017-02-10 Thread Dan Williams
On Fri, Feb 10, 2017 at 1:07 AM, Anup Patel  wrote:
> The Broadcom stream buffer accelerator (SBA) provides offloading
> capabilities for RAID operations. This SBA offload engine is
> accessible via Broadcom SoC specific ring manager.
>
> This patch adds Broadcom SBA RAID driver which provides one
> DMA device with RAID capabilities using one or more Broadcom
> SoC specific ring manager channels. The SBA RAID driver in its
> current shape implements memcpy, xor, and pq operations.
>
> Signed-off-by: Anup Patel 
> Reviewed-by: Ray Jui 
> ---
>  drivers/dma/Kconfig|   13 +
>  drivers/dma/Makefile   |1 +
>  drivers/dma/bcm-sba-raid.c | 1711 
> 
>  3 files changed, 1725 insertions(+)
>  create mode 100644 drivers/dma/bcm-sba-raid.c
>
> diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
> index 263495d..bf8fb84 100644
> --- a/drivers/dma/Kconfig
> +++ b/drivers/dma/Kconfig
> @@ -99,6 +99,19 @@ config AXI_DMAC
>   controller is often used in Analog Device's reference designs for 
> FPGA
>   platforms.
>
> +config BCM_SBA_RAID
> +   tristate "Broadcom SBA RAID engine support"
> +   depends on (ARM64 && MAILBOX && RAID6_PQ) || COMPILE_TEST
> +   select DMA_ENGINE
> +   select DMA_ENGINE_RAID
> +   select ASYNC_TX_ENABLE_CHANNEL_SWITCH

ASYNC_TX_ENABLE_CHANNEL_SWITCH violates the DMA mapping API and
Russell has warned it's especially problematic on ARM [1].  If you
need channel switching for this offload engine to be useful then you
need to move DMA mapping and channel switching responsibilities to MD
itself.

[1]: 
http://lists.infradead.org/pipermail/linux-arm-kernel/2011-January/036753.html


[..]
> diff --git a/drivers/dma/bcm-sba-raid.c b/drivers/dma/bcm-sba-raid.c
> new file mode 100644
> index 000..bab9918
> --- /dev/null
> +++ b/drivers/dma/bcm-sba-raid.c
> @@ -0,0 +1,1711 @@
> +/*
> + * Copyright (C) 2017 Broadcom
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +/*
> + * Broadcom SBA RAID Driver
> + *
> + * The Broadcom stream buffer accelerator (SBA) provides offloading
> + * capabilities for RAID operations. The SBA offload engine is accessible
> + * via Broadcom SoC specific ring manager. Two or more offload engines
> + * can share same Broadcom SoC specific ring manager due to this Broadcom
> + * SoC specific ring manager driver is implemented as a mailbox controller
> + * driver and offload engine drivers are implemented as mallbox clients.
> + *
> + * Typically, Broadcom SoC specific ring manager will implement larger
> + * number of hardware rings over one or more SBA hardware devices. By
> + * design, the internal buffer size of SBA hardware device is limited
> + * but all offload operations supported by SBA can be broken down into
> + * multiple small size requests and executed parallely on multiple SBA
> + * hardware devices for achieving high through-put.
> + *
> + * The Broadcom SBA RAID driver does not require any register programming
> + * except submitting request to SBA hardware device via mailbox channels.
> + * This driver implements a DMA device with one DMA channel using a set
> + * of mailbox channels provided by Broadcom SoC specific ring manager
> + * driver. To exploit parallelism (as described above), all DMA request
> + * coming to SBA RAID DMA channel are broken down to smaller requests
> + * and submitted to multiple mailbox channels in round-robin fashion.
> + * For having more SBA DMA channels, we can create more SBA device nodes
> + * in Broadcom SoC specific DTS based on number of hardware rings supported
> + * by Broadcom SoC ring manager.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "dmaengine.h"
> +
> +/* SBA command helper macros */
> +#define SBA_DEC(_d, _s, _m)(((_d) >> (_s)) & (_m))
> +#define SBA_ENC(_d, _v, _s, _m)\
> +   do {\
> +   (_d) &= ~((u64)(_m) << (_s));   \
> +   (_d) |= (((u64)(_v) & (_m)) << (_s));   \
> +   } while (0)

Reusing a macro argument multiple times is problematic, consider
SBA_ENC(..., arg++, ...), and hiding assignments in a macro make this
hard to read. The compiler should inline it properly if you just make
this a function that returns a value. You could also mark it __pure.

[..]
> +
> +static struct sba_request *sba_alloc_request(struct sba_device *sba)
> +{
> +   unsigned long flags;
> +   struct sba_request *req = NULL;
> +
> +   spin_lock_irqsave(&sba->reqs_lock, flags);
> +
> +   if (!list_empty(&sba->reqs_free_list)) {
> +   req = list_first_entry(&sba->reqs_free_l

Re: [PATCH v2 2/5] async_tx: Handle DMA devices having support for fewer PQ coefficients

2017-02-09 Thread Dan Williams
On Thu, Feb 9, 2017 at 1:29 AM, Anup Patel  wrote:
> On Wed, Feb 8, 2017 at 9:54 PM, Dan Williams  wrote:
>> On Wed, Feb 8, 2017 at 12:57 AM, Anup Patel  wrote:
>>> On Tue, Feb 7, 2017 at 11:46 PM, Dan Williams  
>>> wrote:
>>>> On Tue, Feb 7, 2017 at 1:02 AM, Anup Patel  wrote:
>>>>> On Tue, Feb 7, 2017 at 1:57 PM, Dan Williams  
>>>>> wrote:
>>>>>> On Tue, Feb 7, 2017 at 12:16 AM, Anup Patel  
>>>>>> wrote:
>>>>>>> The DMAENGINE framework assumes that if PQ offload is supported by a
>>>>>>> DMA device then all 256 PQ coefficients are supported. This assumption
>>>>>>> does not hold anymore because we now have BCM-SBA-RAID offload engine
>>>>>>> which supports PQ offload with limited number of PQ coefficients.
>>>>>>>
>>>>>>> This patch extends async_tx APIs to handle DMA devices with support
>>>>>>> for fewer PQ coefficients.
>>>>>>>
>>>>>>> Signed-off-by: Anup Patel 
>>>>>>> Reviewed-by: Scott Branden 
>>>>>>
>>>>>> I don't like this approach. Define an interface for md to query the
>>>>>> offload engine once at the beginning of time. We should not be adding
>>>>>> any new extensions to async_tx.
>>>>>
>>>>> Even if we do capability checks in Linux MD, we still need a way
>>>>> for DMAENGINE drivers to advertise number of PQ coefficients
>>>>> handled by the HW.
>>>>>
>>>>> I agree capability checks should be done once in Linux MD but I don't
>>>>> see why this has to be part of BCM-SBA-RAID driver patches. We need
>>>>> separate patchsets to address limitations of async_tx framework.
>>>>
>>>> Right, separate enabling before we pile on new hardware support to a
>>>> known broken framework.
>>>
>>> Linux Async Tx not broken framework. The issue is:
>>> 1. Its not complete enough
>>> 2. Its not optimized for very high through-put offload engines
>>
>> I'm not understanding your point. I'm nak'ing this change to add yet
>> more per-transaction capability checking to async_tx. I don't like the
>> DMA_HAS_FEWER_PQ_COEF flag, especially since it is equal to
>> DMA_HAS_PQ_CONTINUE. I'm not asking for all of async_tx's problems to
>> be fixed before this new hardware support, I'm simply saying we should
>> start the process of moving offload-engine capability checking to the
>> raid code.
>
> The DMA_HAS_FEWER_PQ_COEF is not equal to
> DMA_HAS_PQ_CONTINUE.

#define DMA_HAS_PQ_CONTINUE (1 << 15
#define DMA_HAS_FEWER_PQ_COEF (1 << 15)

> I will try to drop this patch and take care of unsupported PQ
> coefficients in BCM-SBA-RAID driver itself even if this means
> doing some computations in BCM-SBA-RAID driver itself.

That should be nak'd as well, please do capability detection in a
routine that is common to all raid engines.


Re: [PATCH v2 2/5] async_tx: Handle DMA devices having support for fewer PQ coefficients

2017-02-08 Thread Dan Williams
On Wed, Feb 8, 2017 at 12:57 AM, Anup Patel  wrote:
> On Tue, Feb 7, 2017 at 11:46 PM, Dan Williams  
> wrote:
>> On Tue, Feb 7, 2017 at 1:02 AM, Anup Patel  wrote:
>>> On Tue, Feb 7, 2017 at 1:57 PM, Dan Williams  
>>> wrote:
>>>> On Tue, Feb 7, 2017 at 12:16 AM, Anup Patel  
>>>> wrote:
>>>>> The DMAENGINE framework assumes that if PQ offload is supported by a
>>>>> DMA device then all 256 PQ coefficients are supported. This assumption
>>>>> does not hold anymore because we now have BCM-SBA-RAID offload engine
>>>>> which supports PQ offload with limited number of PQ coefficients.
>>>>>
>>>>> This patch extends async_tx APIs to handle DMA devices with support
>>>>> for fewer PQ coefficients.
>>>>>
>>>>> Signed-off-by: Anup Patel 
>>>>> Reviewed-by: Scott Branden 
>>>>
>>>> I don't like this approach. Define an interface for md to query the
>>>> offload engine once at the beginning of time. We should not be adding
>>>> any new extensions to async_tx.
>>>
>>> Even if we do capability checks in Linux MD, we still need a way
>>> for DMAENGINE drivers to advertise number of PQ coefficients
>>> handled by the HW.
>>>
>>> I agree capability checks should be done once in Linux MD but I don't
>>> see why this has to be part of BCM-SBA-RAID driver patches. We need
>>> separate patchsets to address limitations of async_tx framework.
>>
>> Right, separate enabling before we pile on new hardware support to a
>> known broken framework.
>
> Linux Async Tx not broken framework. The issue is:
> 1. Its not complete enough
> 2. Its not optimized for very high through-put offload engines

I'm not understanding your point. I'm nak'ing this change to add yet
more per-transaction capability checking to async_tx. I don't like the
DMA_HAS_FEWER_PQ_COEF flag, especially since it is equal to
DMA_HAS_PQ_CONTINUE. I'm not asking for all of async_tx's problems to
be fixed before this new hardware support, I'm simply saying we should
start the process of moving offload-engine capability checking to the
raid code.


Re: [PATCH v2 2/5] async_tx: Handle DMA devices having support for fewer PQ coefficients

2017-02-07 Thread Dan Williams
On Tue, Feb 7, 2017 at 1:02 AM, Anup Patel  wrote:
> On Tue, Feb 7, 2017 at 1:57 PM, Dan Williams  wrote:
>> On Tue, Feb 7, 2017 at 12:16 AM, Anup Patel  wrote:
>>> The DMAENGINE framework assumes that if PQ offload is supported by a
>>> DMA device then all 256 PQ coefficients are supported. This assumption
>>> does not hold anymore because we now have BCM-SBA-RAID offload engine
>>> which supports PQ offload with limited number of PQ coefficients.
>>>
>>> This patch extends async_tx APIs to handle DMA devices with support
>>> for fewer PQ coefficients.
>>>
>>> Signed-off-by: Anup Patel 
>>> Reviewed-by: Scott Branden 
>>
>> I don't like this approach. Define an interface for md to query the
>> offload engine once at the beginning of time. We should not be adding
>> any new extensions to async_tx.
>
> Even if we do capability checks in Linux MD, we still need a way
> for DMAENGINE drivers to advertise number of PQ coefficients
> handled by the HW.
>
> I agree capability checks should be done once in Linux MD but I don't
> see why this has to be part of BCM-SBA-RAID driver patches. We need
> separate patchsets to address limitations of async_tx framework.

Right, separate enabling before we pile on new hardware support to a
known broken framework.


Re: [PATCH v2 2/5] async_tx: Handle DMA devices having support for fewer PQ coefficients

2017-02-07 Thread Dan Williams
On Tue, Feb 7, 2017 at 12:16 AM, Anup Patel  wrote:
> The DMAENGINE framework assumes that if PQ offload is supported by a
> DMA device then all 256 PQ coefficients are supported. This assumption
> does not hold anymore because we now have BCM-SBA-RAID offload engine
> which supports PQ offload with limited number of PQ coefficients.
>
> This patch extends async_tx APIs to handle DMA devices with support
> for fewer PQ coefficients.
>
> Signed-off-by: Anup Patel 
> Reviewed-by: Scott Branden 

I don't like this approach. Define an interface for md to query the
offload engine once at the beginning of time. We should not be adding
any new extensions to async_tx.


Re: [PATCH 3/6] async_tx: Handle DMA devices having support for fewer PQ coefficients

2017-02-03 Thread Dan Williams
On Fri, Feb 3, 2017 at 2:59 AM, Anup Patel  wrote:
>
>
> On Thu, Feb 2, 2017 at 11:31 AM, Dan Williams 
> wrote:
>>
>> On Wed, Feb 1, 2017 at 8:47 PM, Anup Patel 
>> wrote:
>> > The DMAENGINE framework assumes that if PQ offload is supported by a
>> > DMA device then all 256 PQ coefficients are supported. This assumption
>> > does not hold anymore because we now have BCM-SBA-RAID offload engine
>> > which supports PQ offload with limited number of PQ coefficients.
>> >
>> > This patch extends async_tx APIs to handle DMA devices with support
>> > for fewer PQ coefficients.
>> >
>> > Signed-off-by: Anup Patel 
>> > Reviewed-by: Scott Branden 
>> > ---
>> >  crypto/async_tx/async_pq.c  |  3 +++
>> >  crypto/async_tx/async_raid6_recov.c | 12 ++--
>> >  include/linux/dmaengine.h   | 19 +++
>> >  include/linux/raid/pq.h |  3 +++
>> >  4 files changed, 35 insertions(+), 2 deletions(-)
>>
>> So, I hate the way async_tx does these checks on each operation, and
>> it's ok for me to say that because it's my fault. Really it's md that
>> should be validating engine offload capabilities once at the beginning
>> of time. I'd rather we move in that direction than continue to pile
>> onto a bad design.
>
>
> Yes, indeed. All async_tx APIs have lot of checks and for high throughput
> RAID offload engine these checks can add some overhead.
>
> I think doing checks in Linux md would be certainly better but this would
> mean lot of changes in Linux md as well as remove checks in async_tx.
>
> Also, async_tx APIs should not find DMA channel on its own instead it
> should rely on Linux md to provide DMA channel pointer as parameter.
>
> It's better to do checks cleanup in async_tx as separate patchset and
> keep this patchset simple.

That's been the problem with async_tx being broken like this for
years. Once you get this "small / simple" patch upstream, that
arguably makes async_tx a little bit worse, there is no longer any
motivation to fix the underlying issues. If you care about the long
term health of raid offload and are enabling new hardware support you
should first tackle the known problems with it before adding new
features.


Re: [PATCH 3/6] async_tx: Handle DMA devices having support for fewer PQ coefficients

2017-02-01 Thread Dan Williams
On Wed, Feb 1, 2017 at 8:47 PM, Anup Patel  wrote:
> The DMAENGINE framework assumes that if PQ offload is supported by a
> DMA device then all 256 PQ coefficients are supported. This assumption
> does not hold anymore because we now have BCM-SBA-RAID offload engine
> which supports PQ offload with limited number of PQ coefficients.
>
> This patch extends async_tx APIs to handle DMA devices with support
> for fewer PQ coefficients.
>
> Signed-off-by: Anup Patel 
> Reviewed-by: Scott Branden 
> ---
>  crypto/async_tx/async_pq.c  |  3 +++
>  crypto/async_tx/async_raid6_recov.c | 12 ++--
>  include/linux/dmaengine.h   | 19 +++
>  include/linux/raid/pq.h |  3 +++
>  4 files changed, 35 insertions(+), 2 deletions(-)

So, I hate the way async_tx does these checks on each operation, and
it's ok for me to say that because it's my fault. Really it's md that
should be validating engine offload capabilities once at the beginning
of time. I'd rather we move in that direction than continue to pile
onto a bad design.


Re: [PATCH] crypto/async_pq: use __free_page() instead of put_page()

2016-02-29 Thread Dan Williams
On Mon, Feb 29, 2016 at 1:33 AM, Arnd Bergmann  wrote:
> The addition of tracepoints to the page reference tracking had an
> unfortunate side-effect in at least one driver that calls put_page
> from its exit function, resulting in a link error:
>
> `.exit.text' referenced in section `__jump_table' of crypto/built-in.o: 
> defined in discarded section `.exit.text' of crypto/built-in.o
>
> From a cursory look at that this driver, it seems that it may be
> doing the wrong thing here anyway, as the page gets allocated
> using 'alloc_page()', and should be freed using '__free_page()'
> rather than 'put_page()'.
>
> With this patch, I no longer get any other build errors from the
> page_ref patch, so hopefully we can assume that it's always wrong
> to call any of those functions from __exit code, and that no other
> driver does it.
>
> Fixes: 0f80830dd044 ("mm/page_ref: add tracepoint to track down page 
> reference manipulation")
> Signed-off-by: Arnd Bergmann 

Acked-by: Dan Williams 

Vinod, will you take this one?
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] scatterlist: use sg_phys()

2015-08-07 Thread Dan Williams
On Fri, Aug 7, 2015 at 9:15 AM, Christoph Hellwig  wrote:
> From: Dan Williams 
>
> Coccinelle cleanup to replace open coded sg to physical address
> translations.  This is in preparation for introducing scatterlists that
> reference __pfn_t.
>
> // sg_phys.cocci: convert usage page_to_phys(sg_page(sg)) to sg_phys(sg)
> // usage: make coccicheck COCCI=sg_phys.cocci MODE=patch
>
> virtual patch
> virtual report
> virtual org
>
> @@
> struct scatterlist *sg;
> @@
>
> - page_to_phys(sg_page(sg)) + sg->offset
> + sg_phys(sg)
>
> @@
> struct scatterlist *sg;
> @@
>
> - page_to_phys(sg_page(sg))
> + sg_phys(sg) - sg->offset

So this ends up being an awkward conversion, a better one is:

- page_to_phys(sg_page(sg))
+ sg_phys(sg) & PAGE_MASK

[..]
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 5852df8..e0faecc 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -2098,7 +2098,7 @@ static int __domain_mapping(struct dmar_domain *domain, 
> unsigned long iov_pfn,
> sg_res = aligned_nrpages(sg->offset, sg->length);
> sg->dma_address = ((dma_addr_t)iov_pfn << 
> VTD_PAGE_SHIFT) + sg->offset;
> sg->dma_length = sg->length;
> -   pteval = page_to_phys(sg_page(sg)) | prot;
> +   pteval = (sg_phys(sg) - sg->offset) | prot;

For example here, as Joerg pointed out.

[1]: https://lkml.org/lkml/2015/6/10/230
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/8] ARM: mvebu: Add support for RAID6 PQ offloading

2015-05-26 Thread Dan Williams
[ adding Boaz as this discussion has implications for ore_raid ]

On Tue, May 26, 2015 at 2:45 AM, Maxime Ripard
 wrote:
> On Mon, May 18, 2015 at 10:06:55AM -0700, Dan Williams wrote:
>> On Mon, May 18, 2015 at 2:14 AM, Maxime Ripard
>>  wrote:
>> > Hi Dan,
>> >
>> > On Wed, May 13, 2015 at 09:00:46AM -0700, Dan Williams wrote:
>> >> On Wed, May 13, 2015 at 2:17 AM, Maxime Ripard
>> >>  wrote:
>> >> > Hi Dan,
>> >> >
>> >> > On Tue, May 12, 2015 at 09:05:41AM -0700, Dan Williams wrote:
>> >> >> On Tue, May 12, 2015 at 8:37 AM, Maxime Ripard
>> >> >>  wrote:
>> >> >> > Hi,
>> >> >> >
>> >> >> > This serie refactors the mv_xor in order to support the latest Armada
>> >> >> > 38x features, including the PQ support in order to offload the RAID6
>> >> >> > PQ operations.
>> >> >> >
>> >> >> > Not all the PQ operations are supported by the XOR engine, so we had
>> >> >> > to introduce new async_tx flags in the process to identify
>> >> >> > un-supported operations.
>> >> >> >
>> >> >> > Please note that this is currently not usable because of a possible
>> >> >> > regression in the RAID stack in 4.1 that is being discussed at the
>> >> >> > moment here: https://lkml.org/lkml/2015/5/7/527
>> >> >>
>> >> >> This is problematic as async_tx is a wart on the dmaengine subsystem
>> >> >> and needs to be deprecated, I just have yet to find the time to do
>> >> >> that work.  It turns out it was a mistake to hide the device details
>> >> >> from md, it should be explicitly managing the dma channels, not
>> >> >> relying on a abstraction api.  The async_tx api usage of the
>> >> >> dma-mapping api is broken in that it relies on overlapping mappings of
>> >> >> the same address.  This happens to work on x86, but on arm it needs
>> >> >> explicit non-overlapping mappings.  I started the work to reference
>> >> >> count dma-mappings in 3.13, and we need to teach md to use
>> >> >> dmaengine_unmap_data explicitly.  Yielding dma channel management to
>> >> >> md also results in a more efficient implementation as we can dma_map()
>> >> >> the stripe cache once rather than per-io.  The  "async_tx_ack()"
>> >> >> disaster can also go away when md is explicitly handling channel
>> >> >> switching.
>> >> >
>> >> > Even though I'd be very much in favor of deprecating / removing
>> >> > async_tx, is it something likely to happen soon?
>> >>
>> >> Not unless someone else takes it on, I'm actively asking for help.
>> >>
>> >> > I remember discussing this with Vinod at Plumbers back in October, but
>> >> > haven't seen anything since then.
>> >>
>> >> Right, "help!" :)
>> >>
>> >> > If not, I think that we shouldn't really hold back patches to
>> >> > async_tx, even though we know than in a year from now, it's going to
>> >> > be gone.
>> >>
>> >> We definitely should block new usages, because they make a bad
>> >> situation worse.  Russell already warned that the dma_mapping api
>> >> abuse could lead to data corruption on ARM (speculative pre-fetching).
>> >> We need to mark ASYNC_TX_DMA as "depends on !ARM" or even "depends on
>> >> BROKEN" until we can get this resolved.
>> >
>> > I'm not sure what the issues exactly are with async_tx and ARM, but
>> > these patches have been tested on ARM and are working quite well.
>>
>> https://lkml.org/lkml/2011/7/8/363
>>
>> > What I'm doing here is merely using the existing API, I'm not making
>> > it worse, just using the API that is used by numerous drivers
>> > already. So I'm not sure this is really reasonable to ask for such a
>> > huge rework (with a huge potential of regressions) before merging my
>> > patches.
>>
>> It happens.
>>
>> https://lwn.net/Articles/641443/
>
> It really depends on what you mean by "help". If you mean "undertake
> all by yourself the removal of async tx", then no, sorry, I won't,
> espec

Re: [PATCH 0/8] ARM: mvebu: Add support for RAID6 PQ offloading

2015-05-18 Thread Dan Williams
On Mon, May 18, 2015 at 2:14 AM, Maxime Ripard
 wrote:
> Hi Dan,
>
> On Wed, May 13, 2015 at 09:00:46AM -0700, Dan Williams wrote:
>> On Wed, May 13, 2015 at 2:17 AM, Maxime Ripard
>>  wrote:
>> > Hi Dan,
>> >
>> > On Tue, May 12, 2015 at 09:05:41AM -0700, Dan Williams wrote:
>> >> On Tue, May 12, 2015 at 8:37 AM, Maxime Ripard
>> >>  wrote:
>> >> > Hi,
>> >> >
>> >> > This serie refactors the mv_xor in order to support the latest Armada
>> >> > 38x features, including the PQ support in order to offload the RAID6
>> >> > PQ operations.
>> >> >
>> >> > Not all the PQ operations are supported by the XOR engine, so we had
>> >> > to introduce new async_tx flags in the process to identify
>> >> > un-supported operations.
>> >> >
>> >> > Please note that this is currently not usable because of a possible
>> >> > regression in the RAID stack in 4.1 that is being discussed at the
>> >> > moment here: https://lkml.org/lkml/2015/5/7/527
>> >>
>> >> This is problematic as async_tx is a wart on the dmaengine subsystem
>> >> and needs to be deprecated, I just have yet to find the time to do
>> >> that work.  It turns out it was a mistake to hide the device details
>> >> from md, it should be explicitly managing the dma channels, not
>> >> relying on a abstraction api.  The async_tx api usage of the
>> >> dma-mapping api is broken in that it relies on overlapping mappings of
>> >> the same address.  This happens to work on x86, but on arm it needs
>> >> explicit non-overlapping mappings.  I started the work to reference
>> >> count dma-mappings in 3.13, and we need to teach md to use
>> >> dmaengine_unmap_data explicitly.  Yielding dma channel management to
>> >> md also results in a more efficient implementation as we can dma_map()
>> >> the stripe cache once rather than per-io.  The  "async_tx_ack()"
>> >> disaster can also go away when md is explicitly handling channel
>> >> switching.
>> >
>> > Even though I'd be very much in favor of deprecating / removing
>> > async_tx, is it something likely to happen soon?
>>
>> Not unless someone else takes it on, I'm actively asking for help.
>>
>> > I remember discussing this with Vinod at Plumbers back in October, but
>> > haven't seen anything since then.
>>
>> Right, "help!" :)
>>
>> > If not, I think that we shouldn't really hold back patches to
>> > async_tx, even though we know than in a year from now, it's going to
>> > be gone.
>>
>> We definitely should block new usages, because they make a bad
>> situation worse.  Russell already warned that the dma_mapping api
>> abuse could lead to data corruption on ARM (speculative pre-fetching).
>> We need to mark ASYNC_TX_DMA as "depends on !ARM" or even "depends on
>> BROKEN" until we can get this resolved.
>
> I'm not sure what the issues exactly are with async_tx and ARM, but
> these patches have been tested on ARM and are working quite well.

https://lkml.org/lkml/2011/7/8/363

> What I'm doing here is merely using the existing API, I'm not making
> it worse, just using the API that is used by numerous drivers
> already. So I'm not sure this is really reasonable to ask for such a
> huge rework (with a huge potential of regressions) before merging my
> patches.

It happens.

https://lwn.net/Articles/641443/

I'm not happy about not having had the time to do this rework myself.
Linux is better off with this api deprecated.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 1/8] Talitos: Support for async_tx XOR offload

2012-09-04 Thread Dan Williams
On Tue, Sep 4, 2012 at 5:28 AM, Liu Qiang-B32616  wrote:
>> Will this engine be coordinating with another to handle memory copies?
>>  The dma mapping code for async_tx/raid is broken when dma mapping
>> requests overlap or cross dma device boundaries [1].
>>
>> [1]: http://marc.info/?l=linux-arm-kernel&m=129407269402930&w=2
> Yes, it needs fsl-dma to handle memcpy copies.
> I read your link, the unmap address is stored in talitos hwdesc, the address 
> will be unmapped when async_tx ack this descriptor, I know fsl-dma won't wait 
> this ack flag in current kernel, so I fix it in fsl-dma patch 5/8. Do you 
> mean that?

Unfortunately no.  I'm open to other suggestions. but as far as I can
see it requires deeper changes to rip out the dma mapping that happens
in async_tx and the automatic unmapping done by drivers.  It should
all be pushed to the client (md).

Currently async_tx hides hardware details from md such that it doesn't
even care if the operation is offloaded to hardware at all, but that
takes things too far.  In the worst case an copy->xor chain handled by
multiple channels results in :

1/ dma_map(copy_chan...)
2/ dma_map(xor_chan...)
3/ 
4/ dma_unmap(copy_chan...)
5/  <---initiated by the copy_chan
6/ dma_unmap(xor_chan...)

Step 2 violates the dma api since the buffers belong to the xor_chan
until unmap.  Step 5 also causes the random completion context of the
copy channel to bleed into submission context of the xor channel which
is problematic.  So the order needs to be:

1/ dma_map(copy_chan...)
2/ 
3/ dma_unmap(copy_chan...)
4/ dma_map(xor_chan...)
5/  <--initiated by md in a static context
6/ dma_unmap(xor_chan...)

Also, if xor_chan and copy_chan lie with the same dma mapping domain
(iommu or parent device) then we can map the stripe once and skip the
extra maintenance for the duration of the chain of operations.  This
dumps a lot of hardware details on md, but I think it is the only way
to get consistent semantics when arbitrary offload devices are
involved.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 6/8] fsl-dma: use spin_lock_bh to instead of spin_lock_irqsave

2012-09-02 Thread Dan Williams
On Thu, Aug 9, 2012 at 1:23 AM,   wrote:
> From: Qiang Liu 
>
> The use of spin_lock_irqsave() is a stronger locking mechanism than is
> required throughout the driver. The minimum locking required should be
> used instead. Interrupts will be turned off and context will be saved,
> there is needless to use irqsave.
>
> Change all instances of spin_lock_irqsave() to spin_lock_bh().
> All manipulation of protected fields is done using tasklet context or
> weaker, which makes spin_lock_bh() the correct choice.

It seems you are coordinating fsl-dma copy and talitos xor operations.
 It looks like fsl-dma will be called through
talitos_process_pending()->dma_run_dependencies(), which is
potentially called in hard irq context.

This all comes back to the need to fix raid offload to manage the
channels explicitly rather than the current dependency chains.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 1/8] Talitos: Support for async_tx XOR offload

2012-09-02 Thread Dan Williams
On Thu, Aug 9, 2012 at 1:20 AM,   wrote:
> From: Qiang Liu 
>
> Expose Talitos's XOR functionality to be used for RAID parity
> calculation via the Async_tx layer.
>
> Cc: Herbert Xu 
> Cc: David S. Miller 
> Signed-off-by: Dipen Dudhat 
> Signed-off-by: Maneesh Gupta 
> Signed-off-by: Kim Phillips 
> Signed-off-by: Vishnu Suresh 
> Signed-off-by: Qiang Liu 
> ---
>  drivers/crypto/Kconfig   |9 +
>  drivers/crypto/talitos.c |  413 
> ++
>  drivers/crypto/talitos.h |   53 ++
>  3 files changed, 475 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/crypto/Kconfig b/drivers/crypto/Kconfig
> index be6b2ba..f0a7c29 100644
> --- a/drivers/crypto/Kconfig
> +++ b/drivers/crypto/Kconfig
> @@ -222,6 +222,15 @@ config CRYPTO_DEV_TALITOS
>   To compile this driver as a module, choose M here: the module
>   will be called talitos.
>
> +config CRYPTO_DEV_TALITOS_RAIDXOR
> +   bool "Talitos RAID5 XOR Calculation Offload"
> +   default y

No, default y.  The user should explicitly enable this.

> +   select DMA_ENGINE
> +   depends on CRYPTO_DEV_TALITOS
> +   help
> + Say 'Y' here to use the Freescale Security Engine (SEC) to
> + offload RAID XOR parity Calculation
> +

Is it faster than cpu xor?

Will this engine be coordinating with another to handle memory copies?
 The dma mapping code for async_tx/raid is broken when dma mapping
requests overlap or cross dma device boundaries [1].

[1]: http://marc.info/?l=linux-arm-kernel&m=129407269402930&w=2

> +static void talitos_process_pending(struct talitos_xor_chan *xor_chan)
> +{
> +   struct talitos_xor_desc *desc, *_desc;
> +   unsigned long flags;
> +   int status;
> +   struct talitos_private *priv;
> +   int ch;
> +
> +   priv = dev_get_drvdata(xor_chan->dev);
> +   ch = atomic_inc_return(&priv->last_chan) &
> + (priv->num_channels - 1);

Maybe a comment about why this this is duplicated from
talitos_cra_init()?  It sticks out here and I had to go looking to
find out why this channel number increments on submit.


> +   spin_lock_irqsave(&xor_chan->desc_lock, flags);
> +
> +   list_for_each_entry_safe(desc, _desc, &xor_chan->pending_q, node) {
> +   status = talitos_submit(xor_chan->dev, ch, &desc->hwdesc,
> +   talitos_release_xor, desc);
> +   if (status != -EINPROGRESS)
> +   break;
> +
> +   list_del(&desc->node);
> +   list_add_tail(&desc->node, &xor_chan->in_progress_q);
> +   }
> +
> +   spin_unlock_irqrestore(&xor_chan->desc_lock, flags);
> +}
> +
> +static void talitos_xor_run_tx_complete_actions(struct talitos_xor_desc 
> *desc,
> +   struct talitos_xor_chan *xor_chan)
> +{
> +   struct device *dev = xor_chan->dev;
> +   dma_addr_t dest, addr;
> +   unsigned int src_cnt = desc->unmap_src_cnt;
> +   unsigned int len = desc->unmap_len;
> +   enum dma_ctrl_flags flags = desc->async_tx.flags;
> +   struct dma_async_tx_descriptor *tx = &desc->async_tx;
> +
> +   /* unmap dma addresses */
> +   dest = desc->hwdesc.ptr[6].ptr;
> +   if (likely(!(flags & DMA_COMPL_SKIP_DEST_UNMAP)))
> +   dma_unmap_page(dev, dest, len, DMA_BIDIRECTIONAL);
> +
> +   desc->idx = 6 - src_cnt;
> +   if (likely(!(flags & DMA_COMPL_SKIP_SRC_UNMAP))) {
> +   while(desc->idx < 6) {

Checkpatch says:
ERROR: space required before the open parenthesis '('


> +   addr = desc->hwdesc.ptr[desc->idx++].ptr;
> +   if (addr == dest)
> +   continue;
> +   dma_unmap_page(dev, addr, len, DMA_TO_DEVICE);
> +   }
> +   }
> +
> +   /* run dependent operations */
> +   dma_run_dependencies(tx);

Here is where we run into problems if another engine accesses these
same buffers, especially on ARM v6+.

> +}
> +
> +static void talitos_release_xor(struct device *dev, struct talitos_desc 
> *hwdesc,
> +   void *context, int error)
> +{
> +   struct talitos_xor_desc *desc = context;
> +   struct talitos_xor_chan *xor_chan;
> +   dma_async_tx_callback callback;
> +   void *callback_param;
> +
> +   if (unlikely(error))
> +   dev_err(dev, "xor operation: talitos error %d\n", error);
> +
> +   xor_chan = container_of(desc->async_tx.chan, struct talitos_xor_chan,
> +   common);
> +   spin_lock_bh(&xor_chan->desc_lock);
> +   if (xor_chan->completed_cookie < desc->async_tx.cookie)
> +   xor_chan->completed_cookie = desc->async_tx.cookie;
> +

Use dma_cookie_complete().

> +   callback = desc->async_tx.callback;
> +   callback_param = desc->async_tx.callback_param;
> +
> +   if (callback) {
> +   spin_unlock_bh(&xo

Re: [PATCH v7 1/8] Talitos: Support for async_tx XOR offload

2012-09-01 Thread Dan Williams
On Thu, Aug 30, 2012 at 7:23 AM, Geanta Neag Horia Ioan-B05471
 wrote:
>
> Besides these:
> 1. As Dan Williams mentioned, you should explain why you are using
> both spin_lock_bh and spin_lock_irqsave on the same lock.

It looks like talitos_process_pending() can be called from hardirq
context via talitos_error().  So spin_lock_bh is not sufficient.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 0/8] Raid: enable talitos xor offload for improving performance

2012-08-29 Thread Dan Williams
On Wed, 2012-08-29 at 11:15 +, Liu Qiang-B32616 wrote:
> Hi Dan,
> 
> Ping?
> Can you apply these patches? Thanks.
> 

I'm working my way through them.

The first thing I notice is that xor_chan->desc_lock is taken
inconsistently.  I.e. spin_lock_irqsave() in talitos_process_pending()
and spin_lock_bh() everywhere else.  Have you run these patches with
lockdep?

--
Dan


--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 0/8] Raid: enable talitos xor offload for improving performance

2012-08-14 Thread Dan Williams
On Tue, Aug 14, 2012 at 2:04 AM, Liu Qiang-B32616  wrote:
> Hi Vinod,
>
> Would you like to apply this series from patch 2/8 to 7/8) in your tree?
> The link as below,
> http://patchwork.ozlabs.org/patch/176023/
> http://patchwork.ozlabs.org/patch/176024/
> http://patchwork.ozlabs.org/patch/176025/
> http://patchwork.ozlabs.org/patch/176026/
> http://patchwork.ozlabs.org/patch/176027/
> http://patchwork.ozlabs.org/patch/176028/
>

Hi, sorry for the recent silence I've been transitioning and am now
just catching up.  I'll take a look and then it's fine for these to go
through Vinod's tree.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] fsl-dma: support attribute of DMA_MEMORY when async_tx enabled

2012-07-10 Thread Dan Williams
On Mon, Jul 9, 2012 at 10:59 PM, Qiang Liu  wrote:
> - delete attribute of DMA_INTERRUPT because fsl-dma doesn't support
> this function, exception will be thrown if talitos is used to compute xor
> at the same time;
> - change the release process of dma descriptor for avoiding exception when
> enable config NET_DMA, release dma descriptor from 1st to last second, the
> last descriptor which is reserved in current descriptor register may not be
> completed, race condition will be raised if free current descriptor;
> - use spin_lock_bh to instead of spin_lock_irqsave for improving performance;
>
> A race condition which is raised when use both of talitos and dmaengine to
> offload xor is because napi scheduler will sync all pending requests in dma
> channels, it will affect the process of raid operations. The descriptor is
> freed which is submitted just now, but async_tx must check whether this depend
> tx descriptor is acked, there are poison contents in the invalid address,
> then BUG_ON() is thrown, so this descriptor will be freed in the next time.
>
> Cc: Dan Williams 
> Cc: Vinod Koul 
> Cc: Li Yang 
> Signed-off-by: Qiang Liu 
> ---

>From the description this sounds like 3 or 4 patches.  Can you split it up?
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID5 XOR speed vs RAID6 Q speed (was Re: AVX RAID5 xor checksumming)

2012-04-06 Thread Dan Williams
[adding Boaz since he also made an attempt at fixing this]

http://marc.info/?l=linux-crypto-vger&m=13182924450&w=2

...I had meant to follow up on this, but was buried in 'isci' issues.


On Tue, Apr 3, 2012 at 4:56 PM, Jim Kukunas
 wrote:
> On Tue, Apr 03, 2012 at 11:23:16AM +0100, John Robinson wrote:
>> On 02/04/2012 23:48, Jim Kukunas wrote:
>> > On Sat, Mar 31, 2012 at 12:38:56PM +0100, John Robinson wrote:
>> [...]
>> >> I just noticed in my logs the other day (recent el5 kernel on a Core 2):
>> >>
>> >> raid5: automatically using best checksumming function: generic_sse
>> >>      generic_sse:  7805.000 MB/sec
>> >> raid5: using function: generic_sse (7805.000 MB/sec)
>> [...]
>> >> raid6: using algorithm sse2x4 (8237 MB/s)
>> >>
>> >> I was just wondering how it's possible to do the RAID6 Q calculation
>> >> faster than the RAID5 XOR calculation - or am I reading this log excerpt
>> >> wrongly?
>> >
>> > Out of curiosity, are you running with CONFIG_PREEMPT=y?
>>
>> No. Here's an excerpt from my .config:
>>
>> # CONFIG_PREEMPT_NONE is not set
>> CONFIG_PREEMPT_VOLUNTARY=y
>> # CONFIG_PREEMPT is not set
>> CONFIG_PREEMPT_BKL=y
>> CONFIG_PREEMPT_NOTIFIERS=y
>>
>> But this is a Xen dom0 kernel, 2.6.18-308.1.1.el5.centos.plusxen. Now, a
>> non-Xen kernel (2.6.18-308.1.1.el5) says:
>> raid5: automatically using best checksumming function: generic_sse
>>     generic_sse: 11892.000 MB/sec
>> raid5: using function: generic_sse (11892.000 MB/sec)
>> raid6: int64x1   2644 MB/s
>> raid6: int64x2   3238 MB/s
>> raid6: int64x4   3011 MB/s
>> raid6: int64x8   2503 MB/s
>> raid6: sse2x1    5375 MB/s
>> raid6: sse2x2    5851 MB/s
>> raid6: sse2x4    9136 MB/s
>> raid6: using algorithm sse2x4 (9136 MB/s)
>>
>> Looks like it loses a chunk of performance running as a Xen dom0.
>>
>> Even still, 11892 MB/s for XOR vs 9136 MB/s for XOR+Q - it still seems
>> remarkable that the XOR can't be done several times faster than the Q.
>
> Taking a look at do_xor_speed, I see two issues which might be the cause
> of the disparity you reported.
>
> 0) In the RAID5 xor benchmark, we get the current jiffy, then run do_2() until
> the jiffy increments. This means we could potentially be testing for less
> than a full jiffy. The RAID6 benchmark handles this by obtaining the current
> jiffy, then calling cpu_relax() until the jiffy increments, and then running
> the test. This is addressed by my first patch.
>
> 1) The only way I could reproduce your findings of a higher throughput for
> RAID6 than for RAID5 xor checksumming was with CONFIG_PREEMPT=y. It seems
> that you encountered this while running as XEN dom0. Currently, we disable
> preemption during the RAID6 benchmark, but don't in the RAID5 benchmark.
> This is addressed by my second patch.
>
> I've added linux-crypto to the discussion as both of these patches affect
> code in crypto/
>
> Thanks.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] PPC4xx: ADMA separating SoC specific functions

2010-09-30 Thread Dan Williams
[ adding Greg ]

On Thu, Sep 30, 2010 at 5:16 PM, Tirumala Marri  wrote:
>> Where  iop_adma_alloc_slots() is implemented differently between
>> iop13xx and iop3xx.  In this case why does ppc440spe-adma.h exist?  If
>> it has code specific to ppe440spe it should just live in the ppe440spe
>> C file.  If it is truly generic it should move to the base adma.c
>> implementation.  If you want to reuse a ppe440spe routine just link to
>> it.
> [Marri]That is how I started changing the code. And I see tons of warnings
> Saying "Used but not defined" or "Defined but not used". How should I
> suppress
> Some functions from adma.c are used in ppc440spe-adma.c and some from
> ppc440spe-adma.c
> Are used in adma.c.

This is part of defining a common interface.  Maybe look at the
linkages of how the common ioat_probe() routine is used to support all
three versions of its dma hardware.

> So I created intermediate file ppc440spe-adma.h with
> inlined
> Functions. In future this will be converted into ppc4xx_adma.h and move
> existing
> SoC specific stuff to ppc440spe-adma.c file.

You definitely need to be able to resolve "used but not defined" and
"defined but not used" warnings before tackling a driver conversion
like this.  In light of this comment I wonder if it would be
appropriate to submit your original driver, that just duplicated
routines from the ppc440spe driver, to the -staging tree.  Then it
would be available for someone familiar with driver conversions to
take a shot at unifying.

Greg, is this an appropriate use of -staging?

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] PPC4xx: ADMA separating SoC specific functions

2010-09-30 Thread Dan Williams
On Thu, Sep 30, 2010 at 12:08 PM, Wolfgang Denk  wrote:
[snip other valid review comments]
>
> This is a header file, yet you add here literally thousands of lines of
> code.

Yes, these functions are entirely too large to be inlined.  It looks
like you are trying to borrow too heavily from the iop-adma model.
The differences between iop13xx and iop33x from a adma perspective are
just in descriptor format and channel capabilities.  If you look at
the routines implemented in:
arch/arm/include/asm/hardware/iop3xx-adma.h
arch/arm/mach-iop13xx/include/mach/adma.h
...they are just simple helpers that abstract the descriptor details.
For example:

iop_adma_prep_dma_xor()
{
[snip]
spin_lock_bh(&iop_chan->lock);
slot_cnt = iop_chan_xor_slot_count(len, src_cnt, &slots_per_op);
sw_desc = iop_adma_alloc_slots(iop_chan, slot_cnt, slots_per_op);
if (sw_desc) {
grp_start = sw_desc->group_head;
iop_desc_init_xor(grp_start, src_cnt, flags);
iop_desc_set_byte_count(grp_start, iop_chan, len);
iop_desc_set_dest_addr(grp_start, iop_chan, dma_dest);
sw_desc->unmap_src_cnt = src_cnt;
sw_desc->unmap_len = len;
sw_desc->async_tx.flags = flags;
while (src_cnt--)
iop_desc_set_xor_src_addr(grp_start, src_cnt,
  dma_src[src_cnt]);
}
spin_unlock_bh(&iop_chan->lock);

return sw_desc ? &sw_desc->async_tx : NULL;
}

Where  iop_adma_alloc_slots() is implemented differently between
iop13xx and iop3xx.  In this case why does ppc440spe-adma.h exist?  If
it has code specific to ppe440spe it should just live in the ppe440spe
C file.  If it is truly generic it should move to the base adma.c
implementation.  If you want to reuse a ppe440spe routine just link to
it.

> Selecting the architecture at build time is bad as it prevents using a
> sinlge kernel image across a wide range of boards.  You only replied
> "We select the architecture at build time." without any explanation if
> there is a pressing technical reason to do it this way, or if this was
> just a arbitrary decision.

I agree I have yet to see any indication that this driver needs to
have an architecture selected at build time.

A potential compromise a first step would be to have a common C file
that is shared between two driver modules until such point that they
can be unified into a common module.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 1/4] PPC4xx: Generalizing ADMA driver modifications

2010-09-23 Thread Dan Williams

On 9/23/2010 3:44 PM, Tirumala Marri wrote:


Did you look at this changelog before sending?  It just deletes 4000
lines of code??

[Marri] The reason I have to send it in different file is the size of the
patch.
There seem to be issue with patch sizes 200k or more.


Read the rest of what I wrote:


Moving and renaming code in one patch makes it very difficult to verify
the result.  When generalizing code the first thing I want to see with a
very quick glance at the patch(es) is that the existing implementation
is not harmed.  One way to go about this is to first identify the
portions of existing code that you want to reuse in your driver and the
pieces that are truly ppc440spe specific.  Move the ppc440spe pieces to
their own file (get this reviewed and approved by the ppc440spe
authors).  The remaining code in adma.c will be assumed generic.  You
can then have another patch to do a simple s/ppc440spe/ppc4xx/ in adma.c
(no other logic changes or code movement).  Then you can introduce your
ppc460ex unique implementation that calls into adma.c.


The patch would not be so large if you leave the existing code where it 
is and move the implementation specific pieces to their own file.


--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 3/4] PPC4xx: New file with SoC specific functions

2010-09-23 Thread Dan Williams

On 9/23/2010 3:11 PM, tma...@apm.com wrote:

From: Tirumala Marri

This patch creates new file with SoC dependent functions.

Signed-off-by: Tirumala R Marri
---
V1:
   * Remove all 440SPe specific references.


Maybe it renames ppc440spe to ppc4xx but it adds things like...


+#if defined(CONFIG_440SPe) || defined(CONFIG_440SP)
+   np = of_find_compatible_node(NULL, NULL, "ibm,i2o-440spe");
+#endif


...in the code.  Which is 1) not generic and 2) I suspect causes a 
compile warning for using an uninitialized variable.



+   if (!np) {
+   pr_err("%s: can't find I2O device tree node\n",
+  __func__);
+   ret = -ENODEV;
+   goto err_req2;
+   }


It looks to me like the common code will need to have a few build 
dependent helper routines as it appears one instance of the driver 
cannot simultaneously support 440sp, 440spe, and 460ex.


--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 1/4] PPC4xx: Generalizing ADMA driver modifications

2010-09-23 Thread Dan Williams

On 9/23/2010 3:10 PM, tma...@apm.com wrote:

From: Tirumala Marri

This patch generalizes the existing drver/dma/ppc4xx/adma.c, so that
common code can be shared between different similar DMA engine
drivers in other SoCs. Also Makefile and Kconfig changed to accommodate
PPC4XX.

Signed-off-by: Tirumala R Marri
---
V1:
   * No change.
---
  arch/powerpc/include/asm/async_tx.h |4 +-
  drivers/dma/Kconfig |6 +-
  drivers/dma/Makefile|2 +-
  drivers/dma/ppc4xx/Makefile |2 +-
  drivers/dma/ppc4xx/adma.c   | 4437 +++
  drivers/dma/ppc4xx/adma.h   |   92 +-
  6 files changed, 354 insertions(+), 4189 deletions(-)


Did you look at this changelog before sending?  It just deletes 4000 
lines of code??


Moving and renaming code in one patch makes it very difficult to verify 
the result.  When generalizing code the first thing I want to see with a 
very quick glance at the patch(es) is that the existing implementation 
is not harmed.  One way to go about this is to first identify the 
portions of existing code that you want to reuse in your driver and the 
pieces that are truly ppc440spe specific.  Move the ppc440spe pieces to 
their own file (get this reviewed and approved by the ppc440spe 
authors).  The remaining code in adma.c will be assumed generic.  You 
can then have another patch to do a simple s/ppc440spe/ppc4xx/ in adma.c 
(no other logic changes or code movement).  Then you can introduce your 
ppc460ex unique implementation that calls into adma.c.


I don't want to see patches along the lines of "rename 
drivers/dma/ppc4xx/adma.c to drivers/dma/ppc4xx/ppc4xx-adma.c" because 
that is just redundant.  Assume that the existing generic file names are 
where the common code will lie and then add hw-implementation specific 
files to call into that base.


Another rule is that the conversion should be bisectable at every step, 
I should be able to apply each patch in the series and still have a 
functional/runnable result.


--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] PPC4xx: Merge xor.h and dma.h into onefile ppc440spe-dma.h

2010-09-23 Thread Dan Williams
On Fri, Sep 17, 2010 at 6:42 PM,   wrote:
> From: Tirumala Marri 
> This patch combines drivers/dma/ppc4xx/xor.h and driver/dma/dma/ppc4xx/dma.h
> into drivers/dma/ppc4xx/ppx440spe-dma.h .
>

Is this just code churn, or do we gain anything by combining these
header files?  Don't add "ppc440spe-" back to the prefix, we're
already in the ppc4xx directory, unless the file will only contain
definitions that are relevant to ppc440spe.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] PPC4xx: Generelizing drivers/dma/ppc4xx/adma.c

2010-09-23 Thread Dan Williams

On 9/17/2010 6:42 PM, tma...@apm.com wrote:

From: Tirumala Marri

This patch generalizes the existing drver/dma/ppc4xx/adma.c, so that
common code can be shared between different similar DMA engine
drivers in other SoCs.

Signed-off-by: Tirumala R Marri
---
  drivers/dma/ppc4xx/adma.c| 4370 +++---
  drivers/dma/ppc4xx/adma.h|  116 +-
  drivers/dma/ppc4xx/ppc4xx-adma.h | 4020 +++
  3 files changed, 4357 insertions(+), 4149 deletions(-)
  create mode 100644 drivers/dma/ppc4xx/ppc4xx-adma.h



Will both versions of this driver exist in the same kernel build?  For 
example the iop-adma driver supports iop13xx and iop3xx, but we select 
the archtitecture at build time?  Or, as I assume in this case, will the 
two (maybe more?) ppc4xx adma drivers all be built in the same image, 
more like ioatdma?


In the latter case I would recommend a file structure like:

drivers/dma/ppc4xx/adma.c
drivers/dma/ppc4xx/adma_440spe.c
drivers/dma/ppc4xx/adma_460ex.c

With patches to move the chipset specific pieces to their own file. 
Minimizing the code churn in adma.c, or at least showing a progression 
of what is unique and needs to be moved.


This would be similar to how ioatdma is structured and compiles a single 
driver to cover the three major hardware revisions.


--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Adding ADMA support for PPC460EX DMA engine.

2010-07-23 Thread Dan Williams
On Thu, Jul 22, 2010 at 11:15 PM, Stefan Roese  wrote:
> Hi Marri,
>
> On Friday 23 July 2010 02:57:18 tma...@amcc.com wrote:
>> From: Tirumala Marri 
>>
>>   This patch will add ADMA support for DMA engine and HW offload for
>>   XOR/ADG (RAID-5/6) functionalities.
>>   1. It supports memcpy, xor, GF(2) based RAID-6.
>>   2. It supports interrupt based DMA completions.
>>   3. Also supports memcpy in RAID-1 case.
>>
>>   Kernel version: 2.6.35-rc5
>>
>>   Testing:
>>     Created RAID-5/6 arrays usign mdadm.
>>     And ran raw IO and filesystem IO to the RAID array.
>>     Chunk size 4k,64k was tested.
>>     RAID rebuild , disk fail, resync tested.
>>
>>   File names:
>>     This code is similar to ppc440spe . So I named the files as
>>     drivers/dma/ppc4xx/adma1.c and drivers/dma/ppc4xx/adma1.h
>
> As you describe above, a lot of the code seems to be copied from
> drivers/dma/ppc4xx/adma.c/h. Wouldn't it make more sense to factor out the
> common code instead of duplicating it?
>

Yes, and you might look to drivers/dma/iop-adma.c as an example of a
way to support similar hardware with a single code base.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] Crypto: Talitos: Support for Async_tx XOR offload

2009-12-18 Thread Dan Williams
On Fri, Dec 18, 2009 at 8:02 AM, Li Yang-R58472  wrote:
>
>>Subject: Re: [PATCH v2 2/2] Crypto: Talitos: Support for
>>Async_tx XOR offload
>>
>>Ira W. Snyder wrote:
>>> Yes, I have used the device_prep_dma_interrupt()
>>functionality quite a
>>> while back. However, I found it to be pretty much useless.
>>
>>The specific case it is needed for Talitos/raid is a channel
>>switch interrupt.  The interrupt causes the cleanup operation
>>to be run which will kick off any pending dependent operations
>>on the xor channel.  In the raid case we only have callbacks
>>at the end of a chain, so we need the interrupt to kick the
>>engine in an operation chain like
>>xor->copy->xor->callback.
>
> I am wondering if can use more callbacks to kick off pending dependent 
> operations?
> Like xor->callback->copy->callback->xor->callback?
>

No, the callback field is reserved for clients of the api.  What you want is:

xor->cleanupT->copy->cleanupF->xor->cleanupT->callback

Where cleanupT is the Talitos descriptor cleanup routine and cleanupF
is from fsldma.  The assumption is that the interrupt kicks the
cleanup routine and that calls dma_run_dependencies().

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] Crypto: Talitos: Support for Async_tx XOR offload

2009-12-17 Thread Dan Williams
On Thu, Dec 17, 2009 at 11:45 AM, Kumar Gala  wrote:
>> The specific case it is needed for Talitos/raid is a channel switch 
>> interrupt.  The interrupt causes the cleanup operation to be run which will 
>> kick off any pending dependent operations on the xor channel.  In the raid 
>> case we only have callbacks at the end of a chain, so we need the interrupt 
>> to kick the engine in an operation chain like xor->copy->xor->callback.
>
> Ok, I'm still confused as to how the DMA interrupt interacts with the 
> Talitos/raid side of things.  The should be completely independent (separate 
> interrupts, separate IP blocks).
>

To keep hardware implementation details out of md/raid the async_tx
api provides support for managing cross-channel dependency chains.
When the raid5 code submits a xor->copy->xor chain the api prepares
all the descriptors across all the involved channels but then delays
submission as needed to maintain ordering.  So at a minimum we need
two interrupts in this scenario one from Talitos to kick the
submission of the copy-descriptor to fsldma when the first xor
completes, and another one to kick the submission of the second
xor-descriptor on Talitos when fsldma completes the copy.  Needless to
say it is more efficient when a channel has all the capabilities, but
this channel switch mechanism has proven effective on iop3xx and
ppc4xx.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] Crypto: Talitos: Support for Async_tx XOR offload

2009-12-17 Thread Dan Williams

Ira W. Snyder wrote:

Yes, I have used the device_prep_dma_interrupt() functionality quite a
while back. However, I found it to be pretty much useless.


The specific case it is needed for Talitos/raid is a channel switch 
interrupt.  The interrupt causes the cleanup operation to be run which 
will kick off any pending dependent operations on the xor channel.  In 
the raid case we only have callbacks at the end of a chain, so we need 
the interrupt to kick the engine in an operation chain like 
xor->copy->xor->callback.


--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] Crypto: Talitos: Support for Async_tx XOR offload

2009-12-16 Thread Dan Williams

Kumar Gala wrote:

Changes with respect to v1 as per comments received
o. Rebased to linux-next as of 20091216
o. The selection is based exclusive of fsldma
o. Intoduced a new Kernel Configuration variable
  *. This enables selecting the Cryptographic functionality
 of Talitos along with fsldma.
  *. Disables the XOR parity calculation offload, if fsldma enabled
 either as kernel in-built or as a module
  *. Once the inter-operability with fsldma is resolved, this option
 can be removed

wait, why can't the interoperability bug be fixed in the first place?


I agree w/Kim.  We need to better understand what the bug is and how to 
reproduce it so we can get to the root cause.

Paper taping over it by disabling fsldma is not the right solution.


Hopefully this prompts fsldma authors to get involved because the 
interoperability issue has been out there without comment*, just 
band-aids, since October.


--
Dan

* well one comment from Ira saying the interrupt functionality worked 
for him.

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v0] Crypto: Talitos: re-initialize async_tx descriptors

2009-12-15 Thread Dan Williams

Suresh Vishnu-B05022 wrote:
 > On Mon, Dec 14, 2009 at 6:33 AM, Vishnu Suresh  
wrote:

 > > The async_tx descriptors contains dangling pointers.
 > > Hence, re-initialize them to NULL before use.
 > >
 > > Signed-off-by: Vishnu Suresh 
 > > ---
 > > o. Rebased to linux-next as of 20091214
 > >
 > >  drivers/crypto/talitos.c |3 +++
 > >  1 files changed, 3 insertions(+), 0 deletions(-)
 > >
 > > diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
 > > index 87f06be..9e261c6 100644
 > > --- a/drivers/crypto/talitos.c
 > > +++ b/drivers/crypto/talitos.c
 > > @@ -952,6 +952,9 @@ static struct dma_async_tx_descriptor * 
talitos_prep_dma_xor(

 > >return NULL;
 > >}
 > >dma_async_tx_descriptor_init(&new->async_tx, &xor_chan->common);
 > > +   new->async_tx.parent = NULL;
 > > +   new->async_tx.next = NULL;
 > > +
 > >
 > >desc = &new->hwdesc;
 > >/* Set destination: Last pointer pair */

 > These two values are owned by the async_tx api, drivers are not
 > supposed to touch them.
I have sent this patch and the similar one for fsldma seperately,
so that if the changes are needed and can be done in 
dma_async_tx_descriptor_init(),
these patches can be ignored. 
 > Both iop_adma and the new ppx4xx driver

 > (which use the async_tx channel switching capability) get away without
 > touching these fields which makes me suspect there is a
 > misunderstanding/bug somewhere else in the talitos implementation.

This bug does not occur on all the platforms. The occurrance is random.
This occurs when a channel switch between two different devices are 
present.
This same initialization is required in case of fsldma as well. 
In case of fsldma/talitosXOR, there are two DMA channels (on same 
device) and single XOR channel (on another device).


When used without fsldma, this driver works fine.
Does iop_adma and ppx4xx work in similar enviroment?


Yes, iop_adma when running on iop3xx hardware has one xor channel and 
two memcpy channels.


The bug may very well be in the fsldma driver, but the workarounds are 
just band aids.  I would be more comfortable with dropping the three 
workaround patches and simply adding "depends on !FSL_DMA" to the 
CRYPTO_DEV_TALITOS option until the true fix can be developed.


--
Dan

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v0] Crypto: Talitos: re-initialize async_tx descriptors

2009-12-14 Thread Dan Williams
On Mon, Dec 14, 2009 at 6:33 AM, Vishnu Suresh  wrote:
> The async_tx descriptors contains dangling pointers.
> Hence, re-initialize them to NULL before use.
>
> Signed-off-by: Vishnu Suresh 
> ---
> o. Rebased to linux-next as of 20091214
>
>  drivers/crypto/talitos.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
> index 87f06be..9e261c6 100644
> --- a/drivers/crypto/talitos.c
> +++ b/drivers/crypto/talitos.c
> @@ -952,6 +952,9 @@ static struct dma_async_tx_descriptor * 
> talitos_prep_dma_xor(
>                return NULL;
>        }
>        dma_async_tx_descriptor_init(&new->async_tx, &xor_chan->common);
> +       new->async_tx.parent = NULL;
> +       new->async_tx.next = NULL;
> +
>
>        desc = &new->hwdesc;
>        /* Set destination: Last pointer pair */

These two values are owned by the async_tx api, drivers are not
supposed to touch them.  Both iop_adma and the new ppx4xx driver
(which use the async_tx channel switching capability) get away without
touching these fields which makes me suspect there is a
misunderstanding/bug somewhere else in the talitos implementation.
Also that dma_async_tx_descriptor_init() is unexpected in the hot
path, it's only needed at initial descriptor allocation.  End result I
think this driver needs some more time to brew.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v0 2/2] Crypto: Talitos: Support for Async_tx XOR offload

2009-10-29 Thread Dan Williams
On Wed, Oct 14, 2009 at 11:41 PM, Vishnu Suresh  wrote:
[..]
> diff --git a/drivers/crypto/Kconfig b/drivers/crypto/Kconfig
> index b08403d..343e578 100644
> --- a/drivers/crypto/Kconfig
> +++ b/drivers/crypto/Kconfig
> @@ -192,6 +192,8 @@ config CRYPTO_DEV_TALITOS
>        select CRYPTO_ALGAPI
>        select CRYPTO_AUTHENC
>        select HW_RANDOM
> +       select DMA_ENGINE
> +       select ASYNC_XOR

You need only select DMA_ENGINE.  ASYNC_XOR is selected by its users.

>        depends on FSL_SOC
>        help
>          Say 'Y' here to use the Freescale Security Engine (SEC)
> diff --git a/drivers/crypto/talitos.c b/drivers/crypto/talitos.c
> index c47ffe8..84819d4 100644
> --- a/drivers/crypto/talitos.c
> +++ b/drivers/crypto/talitos.c
[..]
> +static void talitos_process_pending(struct talitos_xor_chan *xor_chan)
> +{
> +       struct talitos_xor_desc *desc, *_desc;
> +       unsigned long flags;
> +       int status;
> +
> +       spin_lock_irqsave(&xor_chan->desc_lock, flags);
> +
> +       list_for_each_entry_safe(desc, _desc, &xor_chan->pending_q, node) {
> +               status = talitos_submit(xor_chan->dev, &desc->hwdesc,
> +                                       talitos_release_xor, desc);
> +               if (status != -EINPROGRESS)
> +                       break;
> +
> +               list_del(&desc->node);
> +               list_add_tail(&desc->node, &xor_chan->in_progress_q);
> +       }
> +
> +       spin_unlock_irqrestore(&xor_chan->desc_lock, flags);
> +}

The driver uses spin_lock_bh everywhere else which is either a bug, or
this code is being overly protective.  In any event lockdep will
rightly complain about this.  The API and its users do not submit
operations in hard-irq context.

> +
> +static void talitos_release_xor(struct device *dev, struct talitos_desc 
> *hwdesc,
> +                               void *context, int error)
> +{
> +       struct talitos_xor_desc *desc = context;
> +       struct talitos_xor_chan *xor_chan;
> +       dma_async_tx_callback callback;
> +       void *callback_param;
> +
> +       if (unlikely(error)) {
> +               dev_err(dev, "xor operation: talitos error %d\n", error);
> +               BUG();
> +       }
> +
> +       xor_chan = container_of(desc->async_tx.chan, struct talitos_xor_chan,
> +                               common);
> +       spin_lock_bh(&xor_chan->desc_lock);
> +       if (xor_chan->completed_cookie < desc->async_tx.cookie)
> +               xor_chan->completed_cookie = desc->async_tx.cookie;
> +
> +       callback = desc->async_tx.callback;
> +       callback_param = desc->async_tx.callback_param;
> +
> +       if (callback) {
> +               spin_unlock_bh(&xor_chan->desc_lock);
> +               callback(callback_param);
> +               spin_lock_bh(&xor_chan->desc_lock);
> +       }

You do not need to unlock to call the callback.  Users of the API
assume that they are called in bh context and that they are not
allowed to submit new operations directly from the callback (this
simplifies the descriptor cleanup routines in other drivers).

> +
> +       list_del(&desc->node);
> +       list_add_tail(&desc->node, &xor_chan->free_desc);
> +       spin_unlock_bh(&xor_chan->desc_lock);
> +       if (!list_empty(&xor_chan->pending_q))
> +               talitos_process_pending(xor_chan);
> +}

It appears this routine is missing a call to dma_run_dependencies()?
This is needed if another channel is handling memory copy offload.  I
assume this is the case due to your other patch to fsldma.  See
iop_adma_run_tx_complete_actions().  If this xor resource can also
handle copy operations that would be more efficient as it eliminates
channel switching.  See the ioatdma driver and its use of
ASYNC_TX_DISABLE_CHANNEL_SWITCH.

> +static struct dma_async_tx_descriptor * talitos_prep_dma_xor(
> +                       struct dma_chan *chan, dma_addr_t dest, dma_addr_t 
> *src,
> +                       unsigned int src_cnt, size_t len, unsigned long flags)
> +{
> +       struct talitos_xor_chan *xor_chan;
> +       struct talitos_xor_desc *new;
> +       struct talitos_desc *desc;
> +       int i, j;
> +
> +       BUG_ON(unlikely(len > TALITOS_MAX_DATA_LEN));
> +
> +       xor_chan = container_of(chan, struct talitos_xor_chan, common);
> +
> +       spin_lock_bh(&xor_chan->desc_lock);
> +       if (!list_empty(&xor_chan->free_desc)) {
> +               new = container_of(xor_chan->free_desc.next,
> +                                  struct talitos_xor_desc, node);
> +               list_del(&new->node);
> +       } else {
> +               new = talitos_xor_alloc_descriptor(xor_chan, GFP_KERNEL);
> +       }
> +       spin_unlock_bh(&xor_chan->desc_lock);
> +
> +       if (!new) {
> +               dev_err(xor_chan->common.device->dev,
> +                       "No free memory for XOR DMA descriptor\n");
> +               return NULL;
> +       }
> +       dma_async_tx_descriptor_init(&new->async_tx, &xor_chan->common);
> +
> +  

Re: [PATCH v0 1/2] DMA: fsldma: Disable DMA_INTERRUPT when Async_tx enabled

2009-10-15 Thread Dan Williams
[ added Leo and Timur to the Cc ]

On Wed, Oct 14, 2009 at 11:41 PM, Vishnu Suresh  wrote:
> This patch disables the use of DMA_INTERRUPT capability with Async_tx
>
> The fsldma produces a null transfer with DMA_INTERRUPT
> capability when used with Async_tx. When RAID devices queue
> a transaction via Async_tx, this  results in a hang.
>
> Signed-off-by: Vishnu Suresh 
> ---
>  drivers/dma/fsldma.c |    6 ++
>  1 files changed, 6 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/dma/fsldma.c b/drivers/dma/fsldma.c
> index 296f9e7..66d9b39 100644
> --- a/drivers/dma/fsldma.c
> +++ b/drivers/dma/fsldma.c
> @@ -1200,7 +1200,13 @@ static int __devinit of_fsl_dma_probe(struct of_device 
> *dev,
>                                                - fdev->reg.start + 1);
>
>        dma_cap_set(DMA_MEMCPY, fdev->common.cap_mask);
> +#ifndef CONFIG_ASYNC_CORE
> +       /*
> +        * The DMA_INTERRUPT async_tx is a NULL transfer, which will
> +        * triger a PE interrupt.
> +        */
>        dma_cap_set(DMA_INTERRUPT, fdev->common.cap_mask);
> +#endif
>        dma_cap_set(DMA_SLAVE, fdev->common.cap_mask);
>        fdev->common.device_alloc_chan_resources = 
> fsl_dma_alloc_chan_resources;
>        fdev->common.device_free_chan_resources = fsl_dma_free_chan_resources;

You are basically saying that fsl_dma_prep_interrupt() is buggy.  Can
that routine be fixed rather than this piecemeal solution?  If it
cannot be fixed (i.e. hardware issue) then fsl_dma_prep_interrupt()
should just be disabled/deleted altogether.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html