Re: [PATCH 0/8] ARM: mvebu: Add support for RAID6 PQ offloading
On Tue, May 26, 2015 at 09:31:03AM -0700, Dan Williams wrote: If you mean, give me a hand, you can start there, then yeah, I can do that. I'm not happy about not having had the time to do this rework myself. Linux is better off with this api deprecated. You're not talking about deprecating it, you're talking about removing it entirely. True, and adding more users makes that removal more difficult. I'm willing to help out on the design and review for this work, I just can't commit to doing the implementation and testing. I think it looks something like this: At init time the raid456 driver probes for offload resources It can discover several scenarios: 1/ the ioatdma case: raid channels that have all the necessary operations (copy, xor/pq, xor/pq check). In this case we'll never need to perform a channel switch. Potentially the cpu never touches the stripe cache in this case and we can maintain a static dma mapping for the entire lifespan of a struct stripe_head. 2/ the channel switch case: All the necessary offload resources are available but span multiple devices. In this case we need to wait for channel1 to complete an operation before channel2 can start. This case is complicated by the fact that different channels may need their own dma mappings. In the simplest case channels can share the same mapping and raid456 needs to wait for channel completions. I think we can do a better job than the async_tx api here as raid456 should probably poll for completions after each stripe processing batch. Taking an interrupt per channel-switch event seems like excessive overhead. 3/ the co-op case: We have a xor/pq offload resource, but copy and check operations require the cpu to touch the stripe cache. In this case we need to use the dma_sync_*_for_cpu()/dma_sync_*_for_device() to pass buffers back and forth between device and cpu ownership. This shares some of the complexity of waiting for completions with scenario 2. Which scenario does your implementation fall into? Maybe we can focus on that one and leave the other scenarios for other dmaengine maintainers to jump in an implement? From my limited understanding of RAID and PQ computations, it would be 3 with a twist. Our hardware controller supports xor and PQ, but the checks and recovering data is not supported (we're not able to offload async_mult and async_sum_product). Maxime -- Maxime Ripard, Free Electrons Embedded Linux, Kernel and Android engineering http://free-electrons.com signature.asc Description: Digital signature
Re: [PATCH 0/8] ARM: mvebu: Add support for RAID6 PQ offloading
On 05/26/2015 07:31 PM, Dan Williams wrote: [ adding Boaz as this discussion has implications for ore_raid ] You're not talking about deprecating it, you're talking about removing it entirely. True, and adding more users makes that removal more difficult. I'm willing to help out on the design and review for this work, I just can't commit to doing the implementation and testing. Hi So for ore_raid, Yes it uses both xor and pq functions, and I expect that to work also after the API changes. That said, I never really cared for the HW offload engines of these APIs. Actually I never met any. On a modern machine I always got the DCE/MMX kick in or one of the other CPU variants. With preliminary testing of XOR I got an almost memory speed for xor (read n pages + write one) So with multy-core CPUs I fail to see how an HW do better, memory caching and all. The PQ was not that far behind. All I need is an abstract API that gives me the best implementation on any ARCH / configuration. Actually the async_tx API is a pain and a sync API would make things simple. I do not use the concurrent async submit, wait later at all. I submit then wait. So anything you change this to, as long as you keep the wonderful dce implementation is good with me, just that the code keeps running after the new API is fine with me. (And the common structures between XOR and PQ was also nice, but I can also use a union, its always either or in ore_raid) Once you make API changes and modify code, CC me I'll run tests good luck Boaz -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] ARM: mvebu: Add support for RAID6 PQ offloading
On Mon, May 18, 2015 at 10:06:55AM -0700, Dan Williams wrote: On Mon, May 18, 2015 at 2:14 AM, Maxime Ripard maxime.rip...@free-electrons.com wrote: Hi Dan, On Wed, May 13, 2015 at 09:00:46AM -0700, Dan Williams wrote: On Wed, May 13, 2015 at 2:17 AM, Maxime Ripard maxime.rip...@free-electrons.com wrote: Hi Dan, On Tue, May 12, 2015 at 09:05:41AM -0700, Dan Williams wrote: On Tue, May 12, 2015 at 8:37 AM, Maxime Ripard maxime.rip...@free-electrons.com wrote: Hi, This serie refactors the mv_xor in order to support the latest Armada 38x features, including the PQ support in order to offload the RAID6 PQ operations. Not all the PQ operations are supported by the XOR engine, so we had to introduce new async_tx flags in the process to identify un-supported operations. Please note that this is currently not usable because of a possible regression in the RAID stack in 4.1 that is being discussed at the moment here: https://lkml.org/lkml/2015/5/7/527 This is problematic as async_tx is a wart on the dmaengine subsystem and needs to be deprecated, I just have yet to find the time to do that work. It turns out it was a mistake to hide the device details from md, it should be explicitly managing the dma channels, not relying on a abstraction api. The async_tx api usage of the dma-mapping api is broken in that it relies on overlapping mappings of the same address. This happens to work on x86, but on arm it needs explicit non-overlapping mappings. I started the work to reference count dma-mappings in 3.13, and we need to teach md to use dmaengine_unmap_data explicitly. Yielding dma channel management to md also results in a more efficient implementation as we can dma_map() the stripe cache once rather than per-io. The async_tx_ack() disaster can also go away when md is explicitly handling channel switching. Even though I'd be very much in favor of deprecating / removing async_tx, is it something likely to happen soon? Not unless someone else takes it on, I'm actively asking for help. I remember discussing this with Vinod at Plumbers back in October, but haven't seen anything since then. Right, help! :) If not, I think that we shouldn't really hold back patches to async_tx, even though we know than in a year from now, it's going to be gone. We definitely should block new usages, because they make a bad situation worse. Russell already warned that the dma_mapping api abuse could lead to data corruption on ARM (speculative pre-fetching). We need to mark ASYNC_TX_DMA as depends on !ARM or even depends on BROKEN until we can get this resolved. I'm not sure what the issues exactly are with async_tx and ARM, but these patches have been tested on ARM and are working quite well. https://lkml.org/lkml/2011/7/8/363 What I'm doing here is merely using the existing API, I'm not making it worse, just using the API that is used by numerous drivers already. So I'm not sure this is really reasonable to ask for such a huge rework (with a huge potential of regressions) before merging my patches. It happens. https://lwn.net/Articles/641443/ It really depends on what you mean by help. If you mean undertake all by yourself the removal of async tx, then no, sorry, I won't, especially when you ask to do that for a patch that just enables a feature of an API already used on that platform. If you mean, give me a hand, you can start there, then yeah, I can do that. I'm not happy about not having had the time to do this rework myself. Linux is better off with this api deprecated. You're not talking about deprecating it, you're talking about removing it entirely. Maxime -- Maxime Ripard, Free Electrons Embedded Linux, Kernel and Android engineering http://free-electrons.com signature.asc Description: Digital signature
Re: [PATCH 0/8] ARM: mvebu: Add support for RAID6 PQ offloading
[ adding Boaz as this discussion has implications for ore_raid ] On Tue, May 26, 2015 at 2:45 AM, Maxime Ripard maxime.rip...@free-electrons.com wrote: On Mon, May 18, 2015 at 10:06:55AM -0700, Dan Williams wrote: On Mon, May 18, 2015 at 2:14 AM, Maxime Ripard maxime.rip...@free-electrons.com wrote: Hi Dan, On Wed, May 13, 2015 at 09:00:46AM -0700, Dan Williams wrote: On Wed, May 13, 2015 at 2:17 AM, Maxime Ripard maxime.rip...@free-electrons.com wrote: Hi Dan, On Tue, May 12, 2015 at 09:05:41AM -0700, Dan Williams wrote: On Tue, May 12, 2015 at 8:37 AM, Maxime Ripard maxime.rip...@free-electrons.com wrote: Hi, This serie refactors the mv_xor in order to support the latest Armada 38x features, including the PQ support in order to offload the RAID6 PQ operations. Not all the PQ operations are supported by the XOR engine, so we had to introduce new async_tx flags in the process to identify un-supported operations. Please note that this is currently not usable because of a possible regression in the RAID stack in 4.1 that is being discussed at the moment here: https://lkml.org/lkml/2015/5/7/527 This is problematic as async_tx is a wart on the dmaengine subsystem and needs to be deprecated, I just have yet to find the time to do that work. It turns out it was a mistake to hide the device details from md, it should be explicitly managing the dma channels, not relying on a abstraction api. The async_tx api usage of the dma-mapping api is broken in that it relies on overlapping mappings of the same address. This happens to work on x86, but on arm it needs explicit non-overlapping mappings. I started the work to reference count dma-mappings in 3.13, and we need to teach md to use dmaengine_unmap_data explicitly. Yielding dma channel management to md also results in a more efficient implementation as we can dma_map() the stripe cache once rather than per-io. The async_tx_ack() disaster can also go away when md is explicitly handling channel switching. Even though I'd be very much in favor of deprecating / removing async_tx, is it something likely to happen soon? Not unless someone else takes it on, I'm actively asking for help. I remember discussing this with Vinod at Plumbers back in October, but haven't seen anything since then. Right, help! :) If not, I think that we shouldn't really hold back patches to async_tx, even though we know than in a year from now, it's going to be gone. We definitely should block new usages, because they make a bad situation worse. Russell already warned that the dma_mapping api abuse could lead to data corruption on ARM (speculative pre-fetching). We need to mark ASYNC_TX_DMA as depends on !ARM or even depends on BROKEN until we can get this resolved. I'm not sure what the issues exactly are with async_tx and ARM, but these patches have been tested on ARM and are working quite well. https://lkml.org/lkml/2011/7/8/363 What I'm doing here is merely using the existing API, I'm not making it worse, just using the API that is used by numerous drivers already. So I'm not sure this is really reasonable to ask for such a huge rework (with a huge potential of regressions) before merging my patches. It happens. https://lwn.net/Articles/641443/ It really depends on what you mean by help. If you mean undertake all by yourself the removal of async tx, then no, sorry, I won't, especially when you ask to do that for a patch that just enables a feature of an API already used on that platform. ...a potentially broken feature. Are you sure that this speculative prefetch problem does not affect your implementation? If you mean, give me a hand, you can start there, then yeah, I can do that. I'm not happy about not having had the time to do this rework myself. Linux is better off with this api deprecated. You're not talking about deprecating it, you're talking about removing it entirely. True, and adding more users makes that removal more difficult. I'm willing to help out on the design and review for this work, I just can't commit to doing the implementation and testing. I think it looks something like this: At init time the raid456 driver probes for offload resources It can discover several scenarios: 1/ the ioatdma case: raid channels that have all the necessary operations (copy, xor/pq, xor/pq check). In this case we'll never need to perform a channel switch. Potentially the cpu never touches the stripe cache in this case and we can maintain a static dma mapping for the entire lifespan of a struct stripe_head. 2/ the channel switch case: All the necessary offload resources are available but span multiple devices. In this case we need to wait for channel1 to complete an operation before channel2 can start. This case is complicated by the fact that different channels may
Re: [PATCH 0/8] ARM: mvebu: Add support for RAID6 PQ offloading
On Mon, May 18, 2015 at 2:14 AM, Maxime Ripard maxime.rip...@free-electrons.com wrote: Hi Dan, On Wed, May 13, 2015 at 09:00:46AM -0700, Dan Williams wrote: On Wed, May 13, 2015 at 2:17 AM, Maxime Ripard maxime.rip...@free-electrons.com wrote: Hi Dan, On Tue, May 12, 2015 at 09:05:41AM -0700, Dan Williams wrote: On Tue, May 12, 2015 at 8:37 AM, Maxime Ripard maxime.rip...@free-electrons.com wrote: Hi, This serie refactors the mv_xor in order to support the latest Armada 38x features, including the PQ support in order to offload the RAID6 PQ operations. Not all the PQ operations are supported by the XOR engine, so we had to introduce new async_tx flags in the process to identify un-supported operations. Please note that this is currently not usable because of a possible regression in the RAID stack in 4.1 that is being discussed at the moment here: https://lkml.org/lkml/2015/5/7/527 This is problematic as async_tx is a wart on the dmaengine subsystem and needs to be deprecated, I just have yet to find the time to do that work. It turns out it was a mistake to hide the device details from md, it should be explicitly managing the dma channels, not relying on a abstraction api. The async_tx api usage of the dma-mapping api is broken in that it relies on overlapping mappings of the same address. This happens to work on x86, but on arm it needs explicit non-overlapping mappings. I started the work to reference count dma-mappings in 3.13, and we need to teach md to use dmaengine_unmap_data explicitly. Yielding dma channel management to md also results in a more efficient implementation as we can dma_map() the stripe cache once rather than per-io. The async_tx_ack() disaster can also go away when md is explicitly handling channel switching. Even though I'd be very much in favor of deprecating / removing async_tx, is it something likely to happen soon? Not unless someone else takes it on, I'm actively asking for help. I remember discussing this with Vinod at Plumbers back in October, but haven't seen anything since then. Right, help! :) If not, I think that we shouldn't really hold back patches to async_tx, even though we know than in a year from now, it's going to be gone. We definitely should block new usages, because they make a bad situation worse. Russell already warned that the dma_mapping api abuse could lead to data corruption on ARM (speculative pre-fetching). We need to mark ASYNC_TX_DMA as depends on !ARM or even depends on BROKEN until we can get this resolved. I'm not sure what the issues exactly are with async_tx and ARM, but these patches have been tested on ARM and are working quite well. https://lkml.org/lkml/2011/7/8/363 What I'm doing here is merely using the existing API, I'm not making it worse, just using the API that is used by numerous drivers already. So I'm not sure this is really reasonable to ask for such a huge rework (with a huge potential of regressions) before merging my patches. It happens. https://lwn.net/Articles/641443/ I'm not happy about not having had the time to do this rework myself. Linux is better off with this api deprecated. -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] ARM: mvebu: Add support for RAID6 PQ offloading
Hi Dan, On Wed, May 13, 2015 at 09:00:46AM -0700, Dan Williams wrote: On Wed, May 13, 2015 at 2:17 AM, Maxime Ripard maxime.rip...@free-electrons.com wrote: Hi Dan, On Tue, May 12, 2015 at 09:05:41AM -0700, Dan Williams wrote: On Tue, May 12, 2015 at 8:37 AM, Maxime Ripard maxime.rip...@free-electrons.com wrote: Hi, This serie refactors the mv_xor in order to support the latest Armada 38x features, including the PQ support in order to offload the RAID6 PQ operations. Not all the PQ operations are supported by the XOR engine, so we had to introduce new async_tx flags in the process to identify un-supported operations. Please note that this is currently not usable because of a possible regression in the RAID stack in 4.1 that is being discussed at the moment here: https://lkml.org/lkml/2015/5/7/527 This is problematic as async_tx is a wart on the dmaengine subsystem and needs to be deprecated, I just have yet to find the time to do that work. It turns out it was a mistake to hide the device details from md, it should be explicitly managing the dma channels, not relying on a abstraction api. The async_tx api usage of the dma-mapping api is broken in that it relies on overlapping mappings of the same address. This happens to work on x86, but on arm it needs explicit non-overlapping mappings. I started the work to reference count dma-mappings in 3.13, and we need to teach md to use dmaengine_unmap_data explicitly. Yielding dma channel management to md also results in a more efficient implementation as we can dma_map() the stripe cache once rather than per-io. The async_tx_ack() disaster can also go away when md is explicitly handling channel switching. Even though I'd be very much in favor of deprecating / removing async_tx, is it something likely to happen soon? Not unless someone else takes it on, I'm actively asking for help. I remember discussing this with Vinod at Plumbers back in October, but haven't seen anything since then. Right, help! :) If not, I think that we shouldn't really hold back patches to async_tx, even though we know than in a year from now, it's going to be gone. We definitely should block new usages, because they make a bad situation worse. Russell already warned that the dma_mapping api abuse could lead to data corruption on ARM (speculative pre-fetching). We need to mark ASYNC_TX_DMA as depends on !ARM or even depends on BROKEN until we can get this resolved. I'm not sure what the issues exactly are with async_tx and ARM, but these patches have been tested on ARM and are working quite well. What I'm doing here is merely using the existing API, I'm not making it worse, just using the API that is used by numerous drivers already. So I'm not sure this is really reasonable to ask for such a huge rework (with a huge potential of regressions) before merging my patches. Maxime -- Maxime Ripard, Free Electrons Embedded Linux, Kernel and Android engineering http://free-electrons.com signature.asc Description: Digital signature