Re: [PATCH v1 00/24] New fast registration API
On 10/07/2015 02:20 AM, Christoph Hellwig wrote: On Tue, Oct 06, 2015 at 11:37:40AM +0300, Sagi Grimberg wrote: The issue is that the device requires the MR page array to have an alignment (0x40 for mlx4 and 0x400 for mlx5). When I modified the page array allocation to be non-coherent I didn't take care of alignment. Just curious: why did you switch away from the coheret dma allocations anyway? Seems like the page lists are mapped as long as they are allocated so the coherent allocator would seem like a nice fit. Hello Christoph, My concern is that caching and/or write combining might be disabled for DMA coherent memory regions. This is why I assume that calling dma_map_single() and dma_unmap_single() will be faster for registering multiple pages as a single memory region instead of using DMA coherent memory. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On 10/7/2015 6:46 PM, Bart Van Assche wrote: On 10/06/2015 11:42 PM, Sagi Grimberg wrote: On 10/6/2015 9:49 PM, Bart Van Assche wrote: On 10/06/2015 01:37 AM, Sagi Grimberg wrote: I see now the error you are referring to. The issue is that the device requires the MR page array to have an alignment (0x40 for mlx4 and 0x400 for mlx5). When I modified the page array allocation to be non-coherent I didn't take care of alignment. Taking care of this alignment may result in a higher order allocation as we'd need to add (alignment - 1) to the allocation size. e.g. a 512 pages on mlx4 will become: 512 * 8 + 0x40 - 1 = 4159 I'm leaning towards this approach. Any preference? I think this patch should take care of mlx4: [ ... ] Hello Sagi, Thanks for the patch. But since the patch included in the previous e-mail mapped a memory range that could be outside the bounds of the allocated memory I have been testing the patch below: Thanks! I correct the patches. Can I take it as your Tested-by on srp? Sure :-) But please keep in mind that I currently only have access to ConnectX-3 HCA's for testing RDMA software and not to any other RDMA HCA model. Thanks Bart. For what its worth, I've tested srp (and iser + nfs) on both CX3 and CX4 with your config file. Cheers, Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On 10/06/2015 11:42 PM, Sagi Grimberg wrote: On 10/6/2015 9:49 PM, Bart Van Assche wrote: On 10/06/2015 01:37 AM, Sagi Grimberg wrote: I see now the error you are referring to. The issue is that the device requires the MR page array to have an alignment (0x40 for mlx4 and 0x400 for mlx5). When I modified the page array allocation to be non-coherent I didn't take care of alignment. Taking care of this alignment may result in a higher order allocation as we'd need to add (alignment - 1) to the allocation size. e.g. a 512 pages on mlx4 will become: 512 * 8 + 0x40 - 1 = 4159 I'm leaning towards this approach. Any preference? I think this patch should take care of mlx4: [ ... ] Hello Sagi, Thanks for the patch. But since the patch included in the previous e-mail mapped a memory range that could be outside the bounds of the allocated memory I have been testing the patch below: Thanks! I correct the patches. Can I take it as your Tested-by on srp? Sure :-) But please keep in mind that I currently only have access to ConnectX-3 HCA's for testing RDMA software and not to any other RDMA HCA model. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
I don't really care either way, it just seemed like an odd change hiding in here that I missed when reviewing earlier. OK, so I'm sticking with it until someone suggests otherwise. Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On Wed, Oct 07, 2015 at 12:25:25PM +0300, Sagi Grimberg wrote: > Bart suggested that having to sync once for the entire page list might > perform better than coherent memory. I'll settle either way since using > non-coherent memory might cause higher-order allocations due to > alignment, so it's not free-of-charge. I don't really care either way, it just seemed like an odd change hiding in here that I missed when reviewing earlier. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On 10/7/2015 12:20 PM, Christoph Hellwig wrote: On Tue, Oct 06, 2015 at 11:37:40AM +0300, Sagi Grimberg wrote: The issue is that the device requires the MR page array to have an alignment (0x40 for mlx4 and 0x400 for mlx5). When I modified the page array allocation to be non-coherent I didn't take care of alignment. Just curious: why did you switch away from the coheret dma allocations anyway? Seems like the page lists are mapped as long as they are allocated so the coherent allocator would seem like a nice fit. Bart suggested that having to sync once for the entire page list might perform better than coherent memory. I'll settle either way since using non-coherent memory might cause higher-order allocations due to alignment, so it's not free-of-charge. Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On Tue, Oct 06, 2015 at 11:37:40AM +0300, Sagi Grimberg wrote: > The issue is that the device requires the MR page array to have > an alignment (0x40 for mlx4 and 0x400 for mlx5). When I modified the > page array allocation to be non-coherent I didn't take care of > alignment. Just curious: why did you switch away from the coheret dma allocations anyway? Seems like the page lists are mapped as long as they are allocated so the coherent allocator would seem like a nice fit. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On 10/6/2015 9:49 PM, Bart Van Assche wrote: On 10/06/2015 01:37 AM, Sagi Grimberg wrote: I see now the error you are referring to. The issue is that the device requires the MR page array to have an alignment (0x40 for mlx4 and 0x400 for mlx5). When I modified the page array allocation to be non-coherent I didn't take care of alignment. Taking care of this alignment may result in a higher order allocation as we'd need to add (alignment - 1) to the allocation size. e.g. a 512 pages on mlx4 will become: 512 * 8 + 0x40 - 1 = 4159 I'm leaning towards this approach. Any preference? I think this patch should take care of mlx4: [ ... ] Hello Sagi, Thanks for the patch. But since the patch included in the previous e-mail mapped a memory range that could be outside the bounds of the allocated memory I have been testing the patch below: Thanks! I correct the patches. Can I take it as your Tested-by on srp? Cheers, Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On 10/06/2015 01:37 AM, Sagi Grimberg wrote: > I see now the error you are referring to. > > The issue is that the device requires the MR page array to have > an alignment (0x40 for mlx4 and 0x400 for mlx5). When I modified the > page array allocation to be non-coherent I didn't take care of > alignment. > > Taking care of this alignment may result in a higher order allocation > as we'd need to add (alignment - 1) to the allocation size. > > e.g. a 512 pages on mlx4 will become: > 512 * 8 + 0x40 - 1 = 4159 > > I'm leaning towards this approach. Any preference? > > I think this patch should take care of mlx4: > [ ... ] Hello Sagi, Thanks for the patch. But since the patch included in the previous e-mail mapped a memory range that could be outside the bounds of the allocated memory I have been testing the patch below: --- drivers/infiniband/hw/mlx4/mlx4_ib.h | 3 +++ drivers/infiniband/hw/mlx4/mr.c | 19 --- 2 files changed, 15 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index de6eab3..864d595 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -129,6 +129,8 @@ struct mlx4_ib_cq { struct list_headrecv_qp_list; }; +#define MLX4_MR_PAGES_ALIGN 0x40 + struct mlx4_ib_mr { struct ib_mribmr; __be64 *pages; @@ -137,6 +139,7 @@ struct mlx4_ib_mr { u32 max_pages; struct mlx4_mr mmr; struct ib_umem *umem; + void *pages_alloc; }; struct mlx4_ib_mw { diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c index fa01f75..8121c1c 100644 --- a/drivers/infiniband/hw/mlx4/mr.c +++ b/drivers/infiniband/hw/mlx4/mr.c @@ -277,12 +277,17 @@ mlx4_alloc_priv_pages(struct ib_device *device, int max_pages) { int size = max_pages * sizeof(u64); + int add_size; int ret; - mr->pages = kzalloc(size, GFP_KERNEL); - if (!mr->pages) + add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0); + + mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL); + if (!mr->pages_alloc) return -ENOMEM; + mr->pages = PTR_ALIGN(mr->pages_alloc, MLX4_MR_PAGES_ALIGN); + mr->page_map = dma_map_single(device->dma_device, mr->pages, size, DMA_TO_DEVICE); @@ -293,20 +298,20 @@ mlx4_alloc_priv_pages(struct ib_device *device, return 0; err: - kfree(mr->pages); + kfree(mr->pages_alloc); return ret; } static void mlx4_free_priv_pages(struct mlx4_ib_mr *mr) { - struct ib_device *device = mr->ibmr.device; - int size = mr->max_pages * sizeof(u64); - if (mr->pages) { + struct ib_device *device = mr->ibmr.device; + int size = mr->max_pages * sizeof(u64); + dma_unmap_single(device->dma_device, mr->page_map, size, DMA_TO_DEVICE); - kfree(mr->pages); + kfree(mr->pages_alloc); mr->pages = NULL; } } -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On 10/2/2015 6:37 PM, Bart Van Assche wrote: On 10/01/2015 11:14 PM, Sagi Grimberg wrote: Would you mind sending me your .config? Hello Sagi, Hi Bart, I just sent this .config file to you off-list. I see now the error you are referring to. The issue is that the device requires the MR page array to have an alignment (0x40 for mlx4 and 0x400 for mlx5). When I modified the page array allocation to be non-coherent I didn't take care of alignment. Taking care of this alignment may result in a higher order allocation as we'd need to add (alignment - 1) to the allocation size. e.g. a 512 pages on mlx4 will become: 512 * 8 + 0x40 - 1 = 4159 I'm leaning towards this approach. Any preference? I think this patch should take care of mlx4: diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h index de6eab3..4c69247 100644 --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h @@ -129,6 +129,8 @@ struct mlx4_ib_cq { struct list_headrecv_qp_list; }; +#define MLX4_MR_PAGES_ALIGN 0x40 + struct mlx4_ib_mr { struct ib_mribmr; __be64 *pages; @@ -137,6 +139,7 @@ struct mlx4_ib_mr { u32 max_pages; struct mlx4_mr mmr; struct ib_umem *umem; + void*pages_alloc; }; struct mlx4_ib_mw { diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c index fa01f75..d3f8175 100644 --- a/drivers/infiniband/hw/mlx4/mr.c +++ b/drivers/infiniband/hw/mlx4/mr.c @@ -279,10 +279,14 @@ mlx4_alloc_priv_pages(struct ib_device *device, int size = max_pages * sizeof(u64); int ret; - mr->pages = kzalloc(size, GFP_KERNEL); - if (!mr->pages) + size += max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0); + + mr->pages_alloc = kzalloc(size, GFP_KERNEL); + if (!mr->pages_alloc) return -ENOMEM; + mr->pages = PTR_ALIGN(mr->pages_alloc, MLX4_MR_PAGES_ALIGN); + mr->page_map = dma_map_single(device->dma_device, mr->pages, size, DMA_TO_DEVICE); @@ -293,20 +297,22 @@ mlx4_alloc_priv_pages(struct ib_device *device, return 0; err: - kfree(mr->pages); + kfree(mr->pages_alloc); return ret; } static void mlx4_free_priv_pages(struct mlx4_ib_mr *mr) { - struct ib_device *device = mr->ibmr.device; - int size = mr->max_pages * sizeof(u64); - if (mr->pages) { + struct ib_device *device = mr->ibmr.device; + int size = mr->max_pages * sizeof(u64); + + size += max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0); + dma_unmap_single(device->dma_device, mr->page_map, size, DMA_TO_DEVICE); - kfree(mr->pages); + kfree(mr->pages_alloc); mr->pages = NULL; } } -- Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On 10/01/2015 11:14 PM, Sagi Grimberg wrote: Would you mind sending me your .config? Hello Sagi, I just sent this .config file to you off-list. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On 10/01/2015 10:53 AM, Bart Van Assche wrote: On 10/01/2015 12:16 AM, Sagi Grimberg wrote: I wander what is the difference between our test environments? I can't look into this if I'm not able to reproduce. Hello Sagi, At the target side I see "Sep 30 12:56:06 ibdev1 kernel: [178664.300296] ib_srpt: RDMA t 5 for idx 0 failed with status 10." (status 10 corresponds to IB_WC_REM_ACCESS_ERR). I will try to determine the root cause. (replying to my own e-mail) Hello Sagi, To determine which side is causing this issue I captured the traffic between initiator and target with the MLNX_OFED ibdump tool (the dump has been attached to this e-mail). As one can see in that capture the target driver used exactly the same virtual address and length that were specified in the SRP_CMD request. To me this means that v1 of this patch series introduces a regression at the initiator side - either in the SRP initiator driver or in the mlx4 driver. The only difference between our test setups that could be relevant is that in my tests several kernel debugging options were enabled at the initiator side (including SLUB_DEBUG_ON=y). As one can see in the attached capture the buffer allocated at the initiator side for the SCSI INQUIRY request was not aligned on a page boundary. Bart. sniffer.pcap Description: application/vnd.tcpdump.pcap
Re: [PATCH v1 00/24] New fast registration API
On 10/01/2015 12:16 AM, Sagi Grimberg wrote: Just this morning (my morning) I tested the v2 set on iser, srp, nfs. I placed that in branch reg_api.5. Would you mind running reg_api.5 and see if this issue persist (I would be surprised because I haven't seen any sign of it)? Sorry but I still see these messages with the reg_api.5 branch. [root@ib-ini linux-kernel]# git show HEAD | grep ^commit commit 3b5b34777d3cd606433f0aca51e3885323648e07 [root@ib-ini linux-kernel]# uname -a Linux ib-ini 4.2.0-rc6-debug+ #1 SMP Wed Sep 30 11:38:36 PDT 2015 x86_64 x86_64 x86_64 GNU/Linux I will try to run a bisect. (replying to my own e-mail) Apparently this behavior got introduced through the patch "IB/srp: Convert to new registration API" (commit ad66cbace5ca8c60673bedf35e5027868b0dd2d7). Without that patch SRP I/O works fine. With that patch I see receive failures being reported. The SRP initiator was loaded on my setup with the following kernel driver options: # cat /etc/modprobe.d/ib_srp.conf options ib_srp cmd_sg_entries=255 prefer_fr=1 register_always=1 Strange. I don't see that. options ib_srp prefer_fr=1 register_always=1 are set by default. When I try to connect srp initiator against upstream srpt with cmd_sg_entries=255 I get CM reject on iu max size: kernel: scsi host17: ib_srp: REJ received kernel: scsi host17: ib_srp: SRP_LOGIN_REJ: requested max_it_iu_len too large kernel: scsi host17: ib_srp: Connection 0/8 failed kernel: scsi host17: ib_srp: Sending CM DREQ failed When I connect with cmd_sg_entries=128 I successfully connect: kernel: scsi host18: SRP.T10:F452140300117400 kernel: scsi 18:0:0:0: Direct-Access LIO-ORG RAMDISK-MCP 4.0 PQ: 0 ANSI: 5 kernel: scsi host18: ib_srp: new target: id_ext f452140300117400 ioc_guid f452140300117400 pkey service_id f452140300117400 sgid fe80::::f452:1403:0011:7411 dgid fe80::::f452:1403:0011:7401 kernel: sd 18:0:0:0: [sdy] 20480 512-byte logical blocks: (10.4 MB/10.0 MiB) kernel: sd 18:0:0:0: [sdy] Write Protect is off kernel: sd 18:0:0:0: [sdy] Mode Sense: 43 00 00 08 kernel: sd 18:0:0:0: [sdy] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA kernel: sd 18:0:0:0: [sdy] Attached SCSI disk I wander what is the difference between our test environments? I can't look into this if I'm not able to reproduce. Hello Sagi, At the target side I see "Sep 30 12:56:06 ibdev1 kernel: [178664.300296] ib_srpt: RDMA t 5 for idx 0 failed with status 10." (status 10 corresponds to IB_WC_REM_ACCESS_ERR). I will try to determine the root cause. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
Just this morning (my morning) I tested the v2 set on iser, srp, nfs. I placed that in branch reg_api.5. Would you mind running reg_api.5 and see if this issue persist (I would be surprised because I haven't seen any sign of it)? Sorry but I still see these messages with the reg_api.5 branch. [root@ib-ini linux-kernel]# git show HEAD | grep ^commit commit 3b5b34777d3cd606433f0aca51e3885323648e07 [root@ib-ini linux-kernel]# uname -a Linux ib-ini 4.2.0-rc6-debug+ #1 SMP Wed Sep 30 11:38:36 PDT 2015 x86_64 x86_64 x86_64 GNU/Linux I will try to run a bisect. (replying to my own e-mail) Apparently this behavior got introduced through the patch "IB/srp: Convert to new registration API" (commit ad66cbace5ca8c60673bedf35e5027868b0dd2d7). Without that patch SRP I/O works fine. With that patch I see receive failures being reported. The SRP initiator was loaded on my setup with the following kernel driver options: # cat /etc/modprobe.d/ib_srp.conf options ib_srp cmd_sg_entries=255 prefer_fr=1 register_always=1 Strange. I don't see that. options ib_srp prefer_fr=1 register_always=1 are set by default. When I try to connect srp initiator against upstream srpt with cmd_sg_entries=255 I get CM reject on iu max size: kernel: scsi host17: ib_srp: REJ received kernel: scsi host17: ib_srp: SRP_LOGIN_REJ: requested max_it_iu_len too large kernel: scsi host17: ib_srp: Connection 0/8 failed kernel: scsi host17: ib_srp: Sending CM DREQ failed When I connect with cmd_sg_entries=128 I successfully connect: kernel: scsi host18: SRP.T10:F452140300117400 kernel: scsi 18:0:0:0: Direct-Access LIO-ORG RAMDISK-MCP 4.0 PQ: 0 ANSI: 5 kernel: scsi host18: ib_srp: new target: id_ext f452140300117400 ioc_guid f452140300117400 pkey service_id f452140300117400 sgid fe80::::f452:1403:0011:7411 dgid fe80::::f452:1403:0011:7401 kernel: sd 18:0:0:0: [sdy] 20480 512-byte logical blocks: (10.4 MB/10.0 MiB) kernel: sd 18:0:0:0: [sdy] Write Protect is off kernel: sd 18:0:0:0: [sdy] Mode Sense: 43 00 00 08 kernel: sd 18:0:0:0: [sdy] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA kernel: sd 18:0:0:0: [sdy] Attached SCSI disk I wander what is the difference between our test environments? I can't look into this if I'm not able to reproduce. Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On 09/30/2015 11:59 AM, Bart Van Assche wrote: On 09/29/2015 01:58 PM, Sagi Grimberg wrote: On 9/29/2015 10:03 PM, Bart Van Assche wrote: On 09/17/2015 02:42 AM, Sagi Grimberg wrote: - Converted SRP initiator and RDS iwarp ULPs to the new API How has the converted SRP initiator driver been tested ? With the kernel tree that is available on branch reg_api.4 That's odd. Although I haven't formally submitted reg_api.4 yet, I did test ib_srp initiator against upstream srpt over CX3 (mlx4) and CX4 (mlx5). I ran connect, disconnect, stress IO of all block sizes and some unaligned block-IO and SG_IO test utilities. It all seems to pass for me. Just this morning (my morning) I tested the v2 set on iser, srp, nfs. I placed that in branch reg_api.5. Would you mind running reg_api.5 and see if this issue persist (I would be surprised because I haven't seen any sign of it)? Sorry but I still see these messages with the reg_api.5 branch. [root@ib-ini linux-kernel]# git show HEAD | grep ^commit commit 3b5b34777d3cd606433f0aca51e3885323648e07 [root@ib-ini linux-kernel]# uname -a Linux ib-ini 4.2.0-rc6-debug+ #1 SMP Wed Sep 30 11:38:36 PDT 2015 x86_64 x86_64 x86_64 GNU/Linux I will try to run a bisect. (replying to my own e-mail) Apparently this behavior got introduced through the patch "IB/srp: Convert to new registration API" (commit ad66cbace5ca8c60673bedf35e5027868b0dd2d7). Without that patch SRP I/O works fine. With that patch I see receive failures being reported. The SRP initiator was loaded on my setup with the following kernel driver options: # cat /etc/modprobe.d/ib_srp.conf options ib_srp cmd_sg_entries=255 prefer_fr=1 register_always=1 Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On 09/29/2015 01:58 PM, Sagi Grimberg wrote: On 9/29/2015 10:03 PM, Bart Van Assche wrote: On 09/17/2015 02:42 AM, Sagi Grimberg wrote: - Converted SRP initiator and RDS iwarp ULPs to the new API How has the converted SRP initiator driver been tested ? With the kernel tree that is available on branch reg_api.4 That's odd. Although I haven't formally submitted reg_api.4 yet, I did test ib_srp initiator against upstream srpt over CX3 (mlx4) and CX4 (mlx5). I ran connect, disconnect, stress IO of all block sizes and some unaligned block-IO and SG_IO test utilities. It all seems to pass for me. Just this morning (my morning) I tested the v2 set on iser, srp, nfs. I placed that in branch reg_api.5. Would you mind running reg_api.5 and see if this issue persist (I would be surprised because I haven't seen any sign of it)? Sorry but I still see these messages with the reg_api.5 branch. [root@ib-ini linux-kernel]# git show HEAD | grep ^commit commit 3b5b34777d3cd606433f0aca51e3885323648e07 [root@ib-ini linux-kernel]# uname -a Linux ib-ini 4.2.0-rc6-debug+ #1 SMP Wed Sep 30 11:38:36 PDT 2015 x86_64 x86_64 x86_64 GNU/Linux I will try to run a bisect. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
Hi Bart, How has the converted SRP initiator driver been tested ? With the kernel tree that is available on branch reg_api.4 That's odd. Although I haven't formally submitted reg_api.4 yet, I did test ib_srp initiator against upstream srpt over CX3 (mlx4) and CX4 (mlx5). I ran connect, disconnect, stress IO of all block sizes and some unaligned block-IO and SG_IO test utilities. It all seems to pass for me. Just this morning (my morning) I tested the v2 set on iser, srp, nfs. I placed that in branch reg_api.5. Would you mind running reg_api.5 and see if this issue persist (I would be surprised because I haven't seen any sign of it)? I'm waiting for your input before submitting v2 of this series. Thanks, Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On 9/29/2015 10:03 PM, Bart Van Assche wrote: On 09/17/2015 02:42 AM, Sagi Grimberg wrote: - Converted SRP initiator and RDS iwarp ULPs to the new API Hello Sagi, Hi Bart, How has the converted SRP initiator driver been tested ? With the kernel tree that is available on branch reg_api.4 That's odd. Although I haven't formally submitted reg_api.4 yet, I did test ib_srp initiator against upstream srpt over CX3 (mlx4) and CX4 (mlx5). I ran connect, disconnect, stress IO of all block sizes and some unaligned block-IO and SG_IO test utilities. It all seems to pass for me. Just this morning (my morning) I tested the v2 set on iser, srp, nfs. I placed that in branch reg_api.5. Would you mind running reg_api.5 and see if this issue persist (I would be surprised because I haven't seen any sign of it)? Thanks, Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On 09/17/2015 02:42 AM, Sagi Grimberg wrote: - Converted SRP initiator and RDS iwarp ULPs to the new API Hello Sagi, How has the converted SRP initiator driver been tested ? With the kernel tree that is available on branch reg_api.4 (427def03e9fa9801efbb27f6c3c6bf7fc0d012e1) I see on the initiator system that login fails and that the following message is logged: Sep 29 12:01:05 ion-dev-ib-ini kernel: scsi host72: ib_srp: failed receive status WR flushed (5) for iu 88045bb80930 Thanks, Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On Thu, Sep 24, 2015 at 09:53:29AM +0300, Sagi Grimberg wrote: > Thanks Christoph, > > should I take it as your "Tested-by: " on ib_core + mlx4 changes? Yes. And an Acked-by: for the whole series. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On 9/23/2015 12:22 AM, Bart Van Assche wrote: On 09/17/2015 02:42 AM, Sagi Grimberg wrote: came from Bart Van Assache which pointed out that some applications Most people appreciate it if their name is spelled correctly :-) Sorry about that :) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On 9/20/2015 1:45 AM, Christoph Hellwig wrote: Hi Sagi, I've converted the driver I'm developing to your API and it works great. I think this is an important step towards making the RDMA more usable! Thanks Christoph, should I take it as your "Tested-by: " on ib_core + mlx4 changes? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On 09/17/2015 02:42 AM, Sagi Grimberg wrote: came from Bart Van Assache which pointed out that some applications Most people appreciate it if their name is spelled correctly :-) Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On 9/22/2015 12:56 AM, Sagi Grimberg wrote: On 9/22/2015 10:19 AM, Sagi Grimberg wrote: As mentioned earlier, I have a WIP RDS fastreg branch [3] which is functional (at least I can RDMA messages across nodes ;-)). Nice! So merging [2] and [3], I created [4] and applied a delta change based on your other patches. I saw ib_post_send failure with my HCA driver returning '-EINVAL'. I didn't debug it further but at least opcode and num_sge were set correctly so I shouldn't have seen it. So I did memset() on reg_wr which seems to have helped to fix the ib_post_send() failure. Yep - that was my fault. When converting the ULPs I optimized by removing the memset but I forgot to set reg_wr.wr.next = NULL when the ULP needed. This caused the driver to read a second bogus work request. Steve just reported this as well so I'll fix that in v2. Ahh, right. There can be chain of wr. But I got into remote access errors which tells me that I have messed up setup(rkey, sge setup or access flags) One thing that pops is that in the old API the MR was registered with iova_start = 0 (which is probably what was sent to the peer), but the new API the iova is implicitly sg_dma_address(&sg[0]). The registered MR holds these attributes in: mr->rkey mr->iova mr->length These should be passed to a peer to perform rdma. right. ohh, I just read the RDS 3.1 specification (for the first time..) and I noticed that RDS 3.1 header extension contains only a 32bit offset parameter. Why is that anyway? why not 64bit so it can be a valid mapped address? Also the code doesn't use it at all and always passes 0 (which is buggy if sg[0] has an offset from a page). This won't work with the proposed API as the iova is 64bit (as all other existing RDMA protocols use 64bit addresses). In any event, I'd much rather to add ib_map_mr_sg_zbva() just for RDS to use instead of polluting the API with an iova argument, but I think that the RDS spec can be updated to use 64bit offsets and align to all other RDMA protocols (it has enough space in h_exthdr which is 128bit). RDS assumes it's an offset and hence it has been used as 32 bit. I need to look through this carefully though because all the existing application use this header format. There is also RDMA read/write byte information sent as part of the header(Not in upstream code yet) so the space might be less. But point taken. Will look into it. I was thinking of: diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index e7e0251..61fcab4 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -3033,6 +3033,21 @@ int ib_map_mr_sg(struct ib_mr *mr, unsigned int sg_nents, unsigned int page_size); +static inline int +ib_map_mr_sg_zbva(struct ib_mr *mr, + struct scatterlist *sg, + unsigned int sg_nents, + unsigned int page_size) +{ + int rc; + + rc = ib_map_mr_sg(mr, sg, sg_nents, page_size); + if (likely(!rc)) + mr->iova &= ((u64)page_size - 1); + + return rc; +} + int ib_sg_to_pages(struct ib_mr *mr, struct scatterlist *sgl, unsigned int sg_nents, -- Thoughts? Santosh, can you use that one instead and let us know if it resolves your issue? Unfortunately this change still doesn't fix the issue. I think you should make sure to correctly construct the h_exthdr with: rds_rdma_make_cookie(mr->rkey, (32)mr->iova) Will look into it. Thanks for suggestion. Regards, Santosh -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On 9/22/2015 10:19 AM, Sagi Grimberg wrote: As mentioned earlier, I have a WIP RDS fastreg branch [3] which is functional (at least I can RDMA messages across nodes ;-)). Nice! So merging [2] and [3], I created [4] and applied a delta change based on your other patches. I saw ib_post_send failure with my HCA driver returning '-EINVAL'. I didn't debug it further but at least opcode and num_sge were set correctly so I shouldn't have seen it. So I did memset() on reg_wr which seems to have helped to fix the ib_post_send() failure. Yep - that was my fault. When converting the ULPs I optimized by removing the memset but I forgot to set reg_wr.wr.next = NULL when the ULP needed. This caused the driver to read a second bogus work request. Steve just reported this as well so I'll fix that in v2. But I got into remote access errors which tells me that I have messed up setup(rkey, sge setup or access flags) One thing that pops is that in the old API the MR was registered with iova_start = 0 (which is probably what was sent to the peer), but the new API the iova is implicitly sg_dma_address(&sg[0]). The registered MR holds these attributes in: mr->rkey mr->iova mr->length These should be passed to a peer to perform rdma. Hope this helps, Sagi. ohh, I just read the RDS 3.1 specification (for the first time..) and I noticed that RDS 3.1 header extension contains only a 32bit offset parameter. Why is that anyway? why not 64bit so it can be a valid mapped address? Also the code doesn't use it at all and always passes 0 (which is buggy if sg[0] has an offset from a page). This won't work with the proposed API as the iova is 64bit (as all other existing RDMA protocols use 64bit addresses). In any event, I'd much rather to add ib_map_mr_sg_zbva() just for RDS to use instead of polluting the API with an iova argument, but I think that the RDS spec can be updated to use 64bit offsets and align to all other RDMA protocols (it has enough space in h_exthdr which is 128bit). I was thinking of: diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index e7e0251..61fcab4 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -3033,6 +3033,21 @@ int ib_map_mr_sg(struct ib_mr *mr, unsigned int sg_nents, unsigned int page_size); +static inline int +ib_map_mr_sg_zbva(struct ib_mr *mr, + struct scatterlist *sg, + unsigned int sg_nents, + unsigned int page_size) +{ + int rc; + + rc = ib_map_mr_sg(mr, sg, sg_nents, page_size); + if (likely(!rc)) + mr->iova &= ((u64)page_size - 1); + + return rc; +} + int ib_sg_to_pages(struct ib_mr *mr, struct scatterlist *sgl, unsigned int sg_nents, -- Thoughts? Santosh, can you use that one instead and let us know if it resolves your issue? I think you should make sure to correctly construct the h_exthdr with: rds_rdma_make_cookie(mr->rkey, (32)mr->iova) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
As mentioned earlier, I have a WIP RDS fastreg branch [3] which is functional (at least I can RDMA messages across nodes ;-)). Nice! So merging [2] and [3], I created [4] and applied a delta change based on your other patches. I saw ib_post_send failure with my HCA driver returning '-EINVAL'. I didn't debug it further but at least opcode and num_sge were set correctly so I shouldn't have seen it. So I did memset() on reg_wr which seems to have helped to fix the ib_post_send() failure. Yep - that was my fault. When converting the ULPs I optimized by removing the memset but I forgot to set reg_wr.wr.next = NULL when the ULP needed. This caused the driver to read a second bogus work request. Steve just reported this as well so I'll fix that in v2. But I got into remote access errors which tells me that I have messed up setup(rkey, sge setup or access flags) One thing that pops is that in the old API the MR was registered with iova_start = 0 (which is probably what was sent to the peer), but the new API the iova is implicitly sg_dma_address(&sg[0]). The registered MR holds these attributes in: mr->rkey mr->iova mr->length These should be passed to a peer to perform rdma. Hope this helps, Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
Hi Sagi, On 9/20/15 2:36 AM, Sagi Grimberg wrote: Hi Santosh, Nice to see this consolidaton happening. I too don't have access to iWARP hardware for RDS test but will use this series and convert our WIP IB fastreg code and see how it goes. I'm very pleased to hear about this WIP. Please feel free to share anything you have (code and questions/dilemmas) with the list. Also, if you have more suggestions on how we can do better from your PoV we'd love to hear about it. So as promised, I tried to test your series. Your github branch [1] 'reg_api.3' though mostly has 4.3-rc1 contents, it isn't based of 4.3-rc1 so I just cherry picked the patches and created 'rdma/sagi/reg_api.3_cherrypick' [2]. I had conflict with iser patch so I just dropped that one. As mentioned earlier, I have a WIP RDS fastreg branch [3] which is functional (at least I can RDMA messages across nodes ;-)). So merging [2] and [3], I created [4] and applied a delta change based on your other patches. I saw ib_post_send failure with my HCA driver returning '-EINVAL'. I didn't debug it further but at least opcode and num_sge were set correctly so I shouldn't have seen it. So I did memset() on reg_wr which seems to have helped to fix the ib_post_send() failure. But I got into remote access errors which tells me that I have messed up setup(rkey, sge setup or access flags) or missing some other patch(s) in my test tree[4]. Delta patch is top commit on [4]. Please let me know if you spot something which I missed. Regards, Santosh [1] https://github.com/sagigrimberg/linux/tree/reg_api.3 [2] https://git.kernel.org/cgit/linux/kernel/git/ssantosh/linux.git/log/?h=rdma/sagi/reg_api.3_cherrypick [3] https://git.kernel.org/cgit/linux/kernel/git/ssantosh/linux.git/log/?h=net/rds/4.3-fr-wip [4]https://git.kernel.org/cgit/linux/kernel/git/ssantosh/linux.git/commit/?h=test/reg_api.3/rds -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
Hi Santosh, Nice to see this consolidaton happening. I too don't have access to iWARP hardware for RDS test but will use this series and convert our WIP IB fastreg code and see how it goes. I'm very pleased to hear about this WIP. Please feel free to share anything you have (code and questions/dilemmas) with the list. Also, if you have more suggestions on how we can do better from your PoV we'd love to hear about it. Cheers, Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
On 9/17/15 5:42 AM, Sagi Grimberg wrote: Hi all, As discussed on the linux-rdma list, there is plenty of room for improvement in our memory registration APIs. We keep finding ULPs that are duplicating code, sometimes use wrong strategies and mis-use our current API. As a first step, this patch set replaces the fast registration API to accept a kernel common struct scatterlist and takes care of the page vector construction in the core layer with hooks for the drivers HW specific assignments. This allows to remove a common code duplication as it was done in each and every ULP driver. The changes from v0 (WIP) are: - Rebased on top of 4.3-rc1 + Christoph's ib_send_wr conversion patches - Allow the ULP to pass page_size argument to ib_map_mr_sg in order to have it work better in some specific workloads. This suggestion came from Bart Van Assache which pointed out that some applications might use page sizes significantly smaller than the system PAGE_SIZE of specific architectures - Fixed some logical bugs in ib_sg_to_pages - Added a set_page function pointer for drivers to pass to ib_sg_to_pages so some drivers (e.g mlx4, mlx5, nes) can avoid keeping a second page vector and/or re-iterate on the page vector in order to perform HW specific assignments (big/little endian conversion, extra flags) - Converted SRP initiator and RDS iwarp ULPs to the new API - Removed fast registration code from hfi1 driver (as it isn't supported anyway). I assume that the correct place to get the support back would be in a shared SW library (hfi1, qib, rxe). - Updated the change logs So far my tests covered: - ULPs: * iser initiator * iser target * xprtrdma * svcrdma - Drivers: * mlx4 * mlx5 * Steve Wise was kind enough to run NFS client/server over cxgb4 and I have yet to receive any negative feedback from him. I don't have access to other HW devices (qib, nes) nor iwarp devices so RDS is compile tested only. Nice to see this consolidaton happening. I too don't have access to iWARP hardware for RDS test but will use this series and convert our WIP IB fastreg code and see how it goes. Regards, Santosh -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/24] New fast registration API
Hi Sagi, I've converted the driver I'm developing to your API and it works great. I think this is an important step towards making the RDMA more usable! -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v1 00/24] New fast registration API
Hi all, As discussed on the linux-rdma list, there is plenty of room for improvement in our memory registration APIs. We keep finding ULPs that are duplicating code, sometimes use wrong strategies and mis-use our current API. As a first step, this patch set replaces the fast registration API to accept a kernel common struct scatterlist and takes care of the page vector construction in the core layer with hooks for the drivers HW specific assignments. This allows to remove a common code duplication as it was done in each and every ULP driver. The changes from v0 (WIP) are: - Rebased on top of 4.3-rc1 + Christoph's ib_send_wr conversion patches - Allow the ULP to pass page_size argument to ib_map_mr_sg in order to have it work better in some specific workloads. This suggestion came from Bart Van Assache which pointed out that some applications might use page sizes significantly smaller than the system PAGE_SIZE of specific architectures - Fixed some logical bugs in ib_sg_to_pages - Added a set_page function pointer for drivers to pass to ib_sg_to_pages so some drivers (e.g mlx4, mlx5, nes) can avoid keeping a second page vector and/or re-iterate on the page vector in order to perform HW specific assignments (big/little endian conversion, extra flags) - Converted SRP initiator and RDS iwarp ULPs to the new API - Removed fast registration code from hfi1 driver (as it isn't supported anyway). I assume that the correct place to get the support back would be in a shared SW library (hfi1, qib, rxe). - Updated the change logs So far my tests covered: - ULPs: * iser initiator * iser target * xprtrdma * svcrdma - Drivers: * mlx4 * mlx5 * Steve Wise was kind enough to run NFS client/server over cxgb4 and I have yet to receive any negative feedback from him. I don't have access to other HW devices (qib, nes) nor iwarp devices so RDS is compile tested only. I'm targeting this to 4.4 so I'll appreciate more feedback and a bigger testing coverage. The code is available at: https://github.com/sagigrimberg/linux/tree/reg_api.3 Sagi Grimberg (24): IB/core: Introduce new fast registration API IB/mlx5: Remove dead fmr code IB/mlx5: Support the new memory registration API IB/mlx4: Support the new memory registration API RDMA/ocrdma: Support the new memory registration API RDMA/cxgb3: Support the new memory registration API iw_cxgb4: Support the new memory registration API IB/qib: Support the new memory registration API RDMA/nes: Support the new memory registration API IB/iser: Port to new fast registration API iser-target: Port to new memory registration API xprtrdma: Port to new memory registration API svcrdma: Port to new memory registration API RDS/IW: Convert to new memory registration API IB/srp: Convert to new memory registration API IB/mlx5: Remove old FRWR API support IB/mlx4: Remove old FRWR API support RDMA/ocrdma: Remove old FRWR API RDMA/cxgb3: Remove old FRWR API iw_cxgb4: Remove old FRWR API IB/qib: Remove old FRWR API RDMA/nes: Remove old FRWR API IB/hfi1: Remove Old fast registraion API support IB/core: Remove old fast registration API drivers/infiniband/core/verbs.c | 132 --- drivers/infiniband/hw/cxgb3/iwch_cq.c | 2 +- drivers/infiniband/hw/cxgb3/iwch_provider.c | 39 +++-- drivers/infiniband/hw/cxgb3/iwch_provider.h | 2 + drivers/infiniband/hw/cxgb3/iwch_qp.c | 37 +++-- drivers/infiniband/hw/cxgb4/cq.c| 2 +- drivers/infiniband/hw/cxgb4/iw_cxgb4.h | 25 +-- drivers/infiniband/hw/cxgb4/mem.c | 61 +++ drivers/infiniband/hw/cxgb4/provider.c | 3 +- drivers/infiniband/hw/cxgb4/qp.c| 46 +++--- drivers/infiniband/hw/mlx4/cq.c | 2 +- drivers/infiniband/hw/mlx4/main.c | 3 +- drivers/infiniband/hw/mlx4/mlx4_ib.h| 22 +-- drivers/infiniband/hw/mlx4/mr.c | 120 -- drivers/infiniband/hw/mlx4/qp.c | 30 ++-- drivers/infiniband/hw/mlx5/cq.c | 4 +- drivers/infiniband/hw/mlx5/main.c | 3 +- drivers/infiniband/hw/mlx5/mlx5_ib.h| 47 +- drivers/infiniband/hw/mlx5/mr.c | 107 +++- drivers/infiniband/hw/mlx5/qp.c | 140 drivers/infiniband/hw/nes/nes_hw.h | 6 - drivers/infiniband/hw/nes/nes_verbs.c | 161 +++--- drivers/infiniband/hw/nes/nes_verbs.h | 4 + drivers/infiniband/hw/ocrdma/ocrdma.h | 2 + drivers/infiniband/hw/ocrdma/ocrdma_main.c | 3 +- drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 151 - drivers/infiniband/hw/ocrdma/ocrdma_verbs.h | 7 +- drivers/infiniband/hw/qib/qib_keys.c| 40 ++--- drivers/infiniband/hw/qib/qib_mr.c | 46 +++--- drivers/infiniband/hw/qib/qib_verbs.c | 13 +- drivers/infiniband/hw/qib/qib_verbs.h