Re: [PATCH V6 1/9] RDMA/iser: Limit sg tablesize and max_sectors to device fastreg max depth
On 7/24/2015 10:14 PM, Jason Gunthorpe wrote: On Fri, Jul 24, 2015 at 01:40:17PM -0500, Steve Wise wrote: Huh. How does this relate to the max_page_list_len argument: struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len) Shouldn't max_fast_reg_page_list_len be checked during the above? Ie does this still make sense: drivers/infiniband/ulp/iser/iser_verbs.c: desc-data_mr = ib_alloc_fast_reg_mr(pd, ISCSI_ISER_SG_TABLESIZE + 1); ? The only ULP that checks this is SRP, so basically, all our ULPs are probably quietly broken? cxgb3 has a limit of 10 (!?!?!!) Yea seems like some drivers need to enforce this in ib_alloc_fast_reg_mr() as well as ib_alloc_fast_reg_page_list(), and ULPs need to not exceed the device max. Great, Sagi, can you incorporate that in your series so that ib_alloc_mr's max_entires is checked against max_fast_reg_page_list_len and EINVAL's if it is too great? Yes. I'll take care of that. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V6 4/9] svcrdma: Use max_sge_rd for destination read depths
On Sun, Jul 26, 2015 at 12:58:59PM +0300, Sagi Grimberg wrote: With the above patch change, we have no more users of the recently created rdma_cap_read_multi_sge(). Should I add a patch to remove it? Yes please. And in the long run this is another argument for killing the system-wide REMOTE_WRITE phys MR and require memory registrations for iWarp.. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH WIP 01/43] IB: Modify ib_create_mr API
On 7/23/2015 10:08 PM, Jason Gunthorpe wrote: On Thu, Jul 23, 2015 at 01:07:56PM +0300, Sagi Grimberg wrote: On 7/22/2015 10:05 PM, Jason Gunthorpe wrote: The reason I named max_entries is because might might not be pages but real SG elements. It stands for maximum registration entries. Do you have a better name? I wouldn't try and be both.. Use 'max_num_sg' and document that no aggregate scatterlist with length larger than 'max_num_sg*PAGE_SIZE' or with more entries than max_num_sg can be submitted? Maybe document with ARB_SG that it is not length limited? OK, I can do that. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V6 4/9] svcrdma: Use max_sge_rd for destination read depths
@@ -1059,6 +1062,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt) ntohs(((struct sockaddr_in *)newxprt-sc_cm_id- route.addr.dst_addr)-sin_port), newxprt-sc_max_sge, + newxprt-sc_max_sge_rd, newxprt-sc_sq_depth, newxprt-sc_max_requests, newxprt-sc_ord); With the above patch change, we have no more users of the recently created rdma_cap_read_multi_sge(). Should I add a patch to remove it? Yes please. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/10] IB: Replace safe uses for ib_get_dma_mr with pd-local_dma_lkey
If we want security by default then I propose not only to change the default value of register_always from false into true but also to change the default value of prefer_fr from false into true such that fast registration becomes the default instead of FMR. Yes, I was frowning at that stuff too.. We are trying to get rid of FMR, so nothing should prefer it over FRWR... Sagi, perhaps that belongs in your MR unification series? I don't see how this fits in. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH WIP 28/43] IB/core: Introduce new fast registration API
On 7/23/2015 9:51 PM, Jason Gunthorpe wrote: On Thu, Jul 23, 2015 at 07:47:14PM +0300, Sagi Grimberg wrote: So we force ULPs to think about what they are doing properly, and we get a chance to actually force lkey to be local use only for IB. The lkey/rkey decision is passed in the fastreg post_send(). That is too late to check the access flags. Why? the access permissions are kept in the mr context? Sure, one could do if (key == mr-lkey) .. check lkey flags in the post, but that seems silly considering we want the post inlined.. Why should we check the lkey/rkey access flags in the post? I can move it to the post interface if it makes more sense. the access is kind of out of place in the mapping routine anyway... All the dma routines have an access equivalent during map, I don't think it is out of place.. To my mind, the map is the point where the MR should crystallize into an rkey or lkey MR, not at the post. I'm not sure I understand why the lkey/rkey should be set at the map routine. To me, it seems more natural to map_mr_sg and then either register the lkey or the rkey. It's easy enough to move the key arg to ib_map_mr_sg, but I don't see a good reason why at the moment. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V6 6/9] isert: Rename IO functions to more descriptive names
On 7/24/2015 7:18 PM, Steve Wise wrote: This is in preparation for adding new FRMR-only IO handlers for devices that support FRMR and not PI. Steve, I've given this some thought and I think we should avoid splitting logic from PI and iWARP. The reason (other than code duplication) is that currently the iser target support only up to 1MB IOs. I have some code (not done yet) to support larger IOs by using multiple registrations per IO (with or without PI). With a little tweaking I think we can get iwarp to fit in too... So, do you mind if I take a crack at it? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V6 6/9] isert: Rename IO functions to more descriptive names
On Sun, Jul 26, 2015 at 01:08:16PM +0300, Sagi Grimberg wrote: I've given this some thought and I think we should avoid splitting logic from PI and iWARP. The reason (other than code duplication) is that currently the iser target support only up to 1MB IOs. I have some code (not done yet) to support larger IOs by using multiple registrations per IO (with or without PI). Just curious: How is this going to work with iSER only having a single rkey/offset/len field? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mlx5: Expose correct page_size_cap in device attributes
On 7/24/2015 12:48 AM, Jason Gunthorpe wrote: On Thu, Jul 23, 2015 at 05:41:38PM -0400, Doug Ledford wrote: I assume this prevents the driver from working at all on certain arches (like ppc with 64k page size)? Nothing uses page_size_cap correctly, so it has no impact. Sagi, that is a good point, your generic code for the cleanup series really should check that PAGE_SIZE is in page_size_cap and at least fail the mr allocation if it isn't... Yea, that's doable... -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V6 6/9] isert: Rename IO functions to more descriptive names
On 7/26/2015 1:43 PM, Christoph Hellwig wrote: On Sun, Jul 26, 2015 at 01:08:16PM +0300, Sagi Grimberg wrote: I've given this some thought and I think we should avoid splitting logic from PI and iWARP. The reason (other than code duplication) is that currently the iser target support only up to 1MB IOs. I have some code (not done yet) to support larger IOs by using multiple registrations per IO (with or without PI). Just curious: How is this going to work with iSER only having a single rkey/offset/len field? Good question, On the wire iser sends a single rkey, but the target is allowed to transfer the data however it wants to. Say that the local target HCA supports only 32 pages (128K bytes for 4K pages) registration and the initiator sent: rkey=0x1234 address=0x length=512K The target would allocate a 512K buffer and: register offset 0-128K to lkey=0x1 register offset 128K-256K to lkey=0x2 register offset 256K-384K to lkey=0x3 register offset 384K-512K to lkey=0x4 then constructs sg_list as: sg_list[0] = {addr=buf, length=128K, lkey=0x1} sg_list[1] = {addr=buf+128K, length=128K, lkey=0x2} sg_list[2] = {addr=buf+256K, length=128K, lkey=0x3} sg_list[3] = {addr=buf+384K, length=128K, lkey=0x4} Then set rdma_read wr with: rdma_r_wr.sg_list=sg_list rdma_r_wr.rdma.addr=0x rdma_r_wr.rdma.rkey=0x1234 post_send(rdma_r_wr); Ideally, the post contains a chain of all 4 registrations and the rdma_read (and an opportunistic good scsi response). -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH WIP 28/43] IB/core: Introduce new fast registration API
I would like to see the kdoc for ib_map_mr_sg explain exactly what is required of the caller, maybe just hoist this bit from the ib_sg_to_pages I'll add the kdoc. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH WIP 28/43] IB/core: Introduce new fast registration API
On 7/23/2015 8:55 PM, Jason Gunthorpe wrote: On Thu, Jul 23, 2015 at 01:15:16PM +0300, Sagi Grimberg wrote: I was hoping we'd move the DMA flush and translate into here and make it mandatory. Is there any reason not to do that? The reason I didn't added it in was so the ULPs can make sure they meet the restrictions of ib_map_mr_sg(). Allow SRP to iterate on his SG list set partials and iSER to detect gaps (they need to dma map for that). The ULP can always get the sg list's virtual address to check for gaps. Page aligned gaps are always OK. I guess I can pull DMA mapping in there, but we will need an opposite routine ib_umap_mr_sg() since it'll be weird if the ULP will do dma unmap without doing the map... BTW, the logic in ib_sg_to_pages should be checking that directly, as coded, it won't work with swiotlb: // Only the first SG entry can start unaligned if (i page_addr != dma_addr) return EINVAL; // Only the last SG entry can end unaligned if ((page_addr + dma_len) PAGE_MASK != end_dma_addr) if (!is_last) return EINVAL; Don't use sg-offset after dma mapping. The biggest problem with checking the virtual address is swiotlb. However, if swiotlb is used this API is basically broken as swiotlb downgrades everything to a 2k alignment, which means we only ever get 1 s/g entry. Can you explain what do you mean by downgrades everything to a 2k alignment? If the ULP is responsible for a PAGE_SIZE alignment than how would this get out of alignment with swiotlb? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V6 6/9] isert: Rename IO functions to more descriptive names
Ideally, the post contains a chain of all 4 registrations and the rdma_read (and an opportunistic good scsi response). Just to be clear: This example is for IB only, correct? IW would require rkeys with REMOTE_WRITE and 4 read wrs. My assumption is that it would depend on max_sge_rd. IB only? iWARP by definition isn't capable of doing rdma_read to more than one scatter? Anyway, we'll need to calculate the number of RDMA_READs. And you're ignoring invalidation wrs (or read-with-inv) in the example... Yes, didn't want to inflate the example too much... -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 12/13] IB/cma: Share ib_cm_ids between rdma_cm_ids
Use ib_cm_insert_listen to create listening IB CM IDs or share existing ones if needed. When given a request on a specific CM ID, the code now matches the request to the RDMA CM ID based on the request parameters, so it no longer needs to rely on the ib_cm's private data matching capabilities. Signed-off-by: Haggai Eran hagg...@mellanox.com --- drivers/infiniband/core/cma.c | 60 --- 1 file changed, 5 insertions(+), 55 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 1c43b58a8eb2..ca547ff2bb95 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -1765,42 +1765,6 @@ __be64 rdma_get_service_id(struct rdma_cm_id *id, struct sockaddr *addr) } EXPORT_SYMBOL(rdma_get_service_id); -static void cma_set_compare_data(enum rdma_port_space ps, struct sockaddr *addr, -struct ib_cm_compare_data *compare) -{ - struct cma_hdr *cma_data, *cma_mask; - __be32 ip4_addr; - struct in6_addr ip6_addr; - - memset(compare, 0, sizeof *compare); - cma_data = (void *) compare-data; - cma_mask = (void *) compare-mask; - - switch (addr-sa_family) { - case AF_INET: - ip4_addr = ((struct sockaddr_in *) addr)-sin_addr.s_addr; - cma_set_ip_ver(cma_data, 4); - cma_set_ip_ver(cma_mask, 0xF); - if (!cma_any_addr(addr)) { - cma_data-dst_addr.ip4.addr = ip4_addr; - cma_mask-dst_addr.ip4.addr = htonl(~0); - } - break; - case AF_INET6: - ip6_addr = ((struct sockaddr_in6 *) addr)-sin6_addr; - cma_set_ip_ver(cma_data, 6); - cma_set_ip_ver(cma_mask, 0xF); - if (!cma_any_addr(addr)) { - cma_data-dst_addr.ip6 = ip6_addr; - memset(cma_mask-dst_addr.ip6, 0xFF, - sizeof cma_mask-dst_addr.ip6); - } - break; - default: - break; - } -} - static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event) { struct rdma_id_private *id_priv = iw_id-context; @@ -1954,33 +1918,19 @@ out: static int cma_ib_listen(struct rdma_id_private *id_priv) { - struct ib_cm_compare_data compare_data; struct sockaddr *addr; struct ib_cm_id *id; __be64 svc_id; - int ret; - id = ib_create_cm_id(id_priv-id.device, cma_req_handler, id_priv); + addr = cma_src_addr(id_priv); + svc_id = rdma_get_service_id(id_priv-id, addr); + id = ib_cm_insert_listen(id_priv-id.device, cma_req_handler, svc_id, +0); if (IS_ERR(id)) return PTR_ERR(id); - id_priv-cm_id.ib = id; - addr = cma_src_addr(id_priv); - svc_id = rdma_get_service_id(id_priv-id, addr); - if (cma_any_addr(addr) !id_priv-afonly) - ret = ib_cm_listen(id_priv-cm_id.ib, svc_id, 0, NULL); - else { - cma_set_compare_data(id_priv-id.ps, addr, compare_data); - ret = ib_cm_listen(id_priv-cm_id.ib, svc_id, 0, compare_data); - } - - if (ret) { - ib_destroy_cm_id(id_priv-cm_id.ib); - id_priv-cm_id.ib = NULL; - } - - return ret; + return 0; } static int cma_iw_listen(struct rdma_id_private *id_priv, int backlog) -- 1.7.11.2 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 09/13] IB/cma: Add net_dev and private data checks to RDMA CM
Instead of relying on a the ib_cm module to check an incoming CM request's private data header, add these checks to the RDMA CM module. This allows a following patch to to clean up the ib_cm interface and remove the code that looks into the private headers. It will also allow supporting namespaces in RDMA CM by making these checks namespace aware later on. Signed-off-by: Haggai Eran hagg...@mellanox.com --- drivers/infiniband/core/cma.c | 184 +- 1 file changed, 181 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index f2d799209412..ed3d63ad94ac 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -300,7 +300,7 @@ static enum rdma_cm_state cma_exch(struct rdma_id_private *id_priv, return old; } -static inline u8 cma_get_ip_ver(struct cma_hdr *hdr) +static inline u8 cma_get_ip_ver(const struct cma_hdr *hdr) { return hdr-ip_version 4; } @@ -1016,7 +1016,7 @@ static int cma_save_ip_info(struct sockaddr *src_addr, cma_save_ip6_info(src_addr, dst_addr, hdr, port); break; default: - return -EINVAL; + return -EAFNOSUPPORT; } return 0; @@ -1040,6 +1040,181 @@ static int cma_save_net_info(struct sockaddr *src_addr, return cma_save_ip_info(src_addr, dst_addr, ib_event, service_id); } +struct cma_req_info { + struct ib_device *device; + int port; + const union ib_gid *local_gid; + __be64 service_id; + u16 pkey; +}; + +static int cma_save_req_info(const struct ib_cm_event *ib_event, +struct cma_req_info *req) +{ + const struct ib_cm_req_event_param *req_param = + ib_event-param.req_rcvd; + const struct ib_cm_sidr_req_event_param *sidr_param = + ib_event-param.sidr_req_rcvd; + + switch (ib_event-event) { + case IB_CM_REQ_RECEIVED: + req-device = req_param-listen_id-device; + req-port = req_param-port; + req-local_gid = req_param-primary_path-sgid; + req-service_id = req_param-primary_path-service_id; + req-pkey = req_param-bth_pkey; + break; + case IB_CM_SIDR_REQ_RECEIVED: + req-device = sidr_param-listen_id-device; + req-port = sidr_param-port; + req-local_gid = NULL; + req-service_id = sidr_param-service_id; + req-pkey = sidr_param-bth_pkey; + break; + default: + return -EINVAL; + } + + return 0; +} + +static struct net_device *cma_get_net_dev(struct ib_cm_event *ib_event, + const struct cma_req_info *req) +{ + struct sockaddr_storage listen_addr_storage; + struct sockaddr *listen_addr = (struct sockaddr *)listen_addr_storage; + struct net_device *net_dev; + int err; + + err = cma_save_ip_info(listen_addr, NULL, ib_event, req-service_id); + if (err) + return ERR_PTR(err); + + net_dev = ib_get_net_dev_by_params(req-device, req-port, req-pkey, + req-local_gid, listen_addr); + if (!net_dev) + return ERR_PTR(-ENODEV); + + return net_dev; +} + +static enum rdma_port_space rdma_ps_from_service_id(__be64 service_id) +{ + return (be64_to_cpu(service_id) 16) 0x; +} + +static bool cma_match_private_data(struct rdma_id_private *id_priv, + const struct cma_hdr *hdr) +{ + struct sockaddr *addr = cma_src_addr(id_priv); + __be32 ip4_addr; + struct in6_addr ip6_addr; + + if (cma_any_addr(addr) !id_priv-afonly) + return true; + + switch (addr-sa_family) { + case AF_INET: + ip4_addr = ((struct sockaddr_in *)addr)-sin_addr.s_addr; + if (cma_get_ip_ver(hdr) != 4) + return false; + if (!cma_any_addr(addr) + hdr-dst_addr.ip4.addr != ip4_addr) + return false; + break; + case AF_INET6: + ip6_addr = ((struct sockaddr_in6 *)addr)-sin6_addr; + if (cma_get_ip_ver(hdr) != 6) + return false; + if (!cma_any_addr(addr) + memcmp(hdr-dst_addr.ip6, ip6_addr, sizeof(ip6_addr))) + return false; + break; + case AF_IB: + return true; + default: + return false; + } + + return true; +} + +static bool cma_match_net_dev(const struct rdma_id_private *id_priv, + const struct net_device *net_dev) +{ + const struct rdma_addr *addr = id_priv-id.route.addr; + + if (!net_dev) + /*
Re: [PATCH V6 6/9] isert: Rename IO functions to more descriptive names
On 7/26/2015 12:40 PM, Sagi Grimberg wrote: Ideally, the post contains a chain of all 4 registrations and the rdma_read (and an opportunistic good scsi response). Just to be clear: This example is for IB only, correct? IW would require rkeys with REMOTE_WRITE and 4 read wrs. My assumption is that it would depend on max_sge_rd. yea. IB only? iWARP by definition isn't capable of doing rdma_read to more than one scatter? Anyway, we'll need to calculate the number of RDMA_READs. The wire protocol limits the destination to a single stg/to/len (aka rkey/addr/len). Devices/fw/sw could implement some magic to support a single stg/to/len that maps to a scatter gather list of stags/tos/lens. And you're ignoring invalidation wrs (or read-with-inv) in the example... Yes, didn't want to inflate the example too much... -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V6 6/9] isert: Rename IO functions to more descriptive names
On 7/26/2015 5:08 AM, Sagi Grimberg wrote: On 7/24/2015 7:18 PM, Steve Wise wrote: This is in preparation for adding new FRMR-only IO handlers for devices that support FRMR and not PI. Steve, I've given this some thought and I think we should avoid splitting logic from PI and iWARP. The reason (other than code duplication) is that currently the iser target support only up to 1MB IOs. I have some code (not done yet) to support larger IOs by using multiple registrations per IO (with or without PI). With a little tweaking I think we can get iwarp to fit in too... So, do you mind if I take a crack at it? Sure, go ahead. Let me know how I can help. Certainly I can test it for you. I'm very keen to get this in for 4.3 if possible... -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 13/13] IB/cm: Remove compare_data checks
Now that there are no ib_cm clients using the compare_data feature for matching IB CM requests' private data, remove the compare_data parameter of ib_cm_listen and remove the code implementing the feature. Signed-off-by: Haggai Eran hagg...@mellanox.com --- drivers/infiniband/core/cm.c| 109 ++-- drivers/infiniband/core/ucm.c | 3 +- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 2 +- drivers/infiniband/ulp/srpt/ib_srpt.c | 2 +- include/rdma/ib_cm.h| 14 +--- 5 files changed, 23 insertions(+), 107 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index a05c17b336aa..73803a55edd6 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -222,7 +222,6 @@ struct cm_id_private { /* todo: use alternate port on send failure */ struct cm_av av; struct cm_av alt_av; - struct ib_cm_compare_data *compare_data; void *private_data; __be64 tid; @@ -443,40 +442,6 @@ static struct cm_id_private * cm_acquire_id(__be32 local_id, __be32 remote_id) return cm_id_priv; } -static void cm_mask_copy(u32 *dst, const u32 *src, const u32 *mask) -{ - int i; - - for (i = 0; i IB_CM_COMPARE_SIZE; i++) - dst[i] = src[i] mask[i]; -} - -static int cm_compare_data(struct ib_cm_compare_data *src_data, - struct ib_cm_compare_data *dst_data) -{ - u32 src[IB_CM_COMPARE_SIZE]; - u32 dst[IB_CM_COMPARE_SIZE]; - - if (!src_data || !dst_data) - return 0; - - cm_mask_copy(src, src_data-data, dst_data-mask); - cm_mask_copy(dst, dst_data-data, src_data-mask); - return memcmp(src, dst, sizeof(src)); -} - -static int cm_compare_private_data(u32 *private_data, - struct ib_cm_compare_data *dst_data) -{ - u32 src[IB_CM_COMPARE_SIZE]; - - if (!dst_data) - return 0; - - cm_mask_copy(src, private_data, dst_data-mask); - return memcmp(src, dst_data-data, sizeof(src)); -} - /* * Trivial helpers to strip endian annotation and compare; the * endianness doesn't actually matter since we just need a stable @@ -509,18 +474,14 @@ static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) struct cm_id_private *cur_cm_id_priv; __be64 service_id = cm_id_priv-id.service_id; __be64 service_mask = cm_id_priv-id.service_mask; - int data_cmp; while (*link) { parent = *link; cur_cm_id_priv = rb_entry(parent, struct cm_id_private, service_node); - data_cmp = cm_compare_data(cm_id_priv-compare_data, - cur_cm_id_priv-compare_data); if ((cur_cm_id_priv-id.service_mask service_id) == (service_mask cur_cm_id_priv-id.service_id) - (cm_id_priv-id.device == cur_cm_id_priv-id.device) - !data_cmp) + (cm_id_priv-id.device == cur_cm_id_priv-id.device)) return cur_cm_id_priv; if (cm_id_priv-id.device cur_cm_id_priv-id.device) @@ -531,8 +492,6 @@ static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) link = (*link)-rb_left; else if (be64_gt(service_id, cur_cm_id_priv-id.service_id)) link = (*link)-rb_right; - else if (data_cmp 0) - link = (*link)-rb_left; else link = (*link)-rb_right; } @@ -542,20 +501,16 @@ static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) } static struct cm_id_private * cm_find_listen(struct ib_device *device, -__be64 service_id, -u32 *private_data) +__be64 service_id) { struct rb_node *node = cm.listen_service_table.rb_node; struct cm_id_private *cm_id_priv; - int data_cmp; while (node) { cm_id_priv = rb_entry(node, struct cm_id_private, service_node); - data_cmp = cm_compare_private_data(private_data, - cm_id_priv-compare_data); if ((cm_id_priv-id.service_mask service_id) == cm_id_priv-id.service_id - (cm_id_priv-id.device == device) !data_cmp) + (cm_id_priv-id.device == device)) return cm_id_priv; if (device cm_id_priv-id.device) @@ -566,8 +521,6 @@ static struct cm_id_private * cm_find_listen(struct ib_device *device, node = node-rb_left; else if (be64_gt(service_id,
[PATCH v2 10/13] IB/cma: Validate routing of incoming requests
Pass incoming request parameters through the relevant IPv4/IPv6 routing tables and make sure the network stack is configured to handle such requests. Signed-off-by: Haggai Eran hagg...@mellanox.com --- drivers/infiniband/core/cma.c | 95 +-- 1 file changed, 92 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index ed3d63ad94ac..42f412fde064 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -46,6 +46,8 @@ #include net/tcp.h #include net/ipv6.h +#include net/ip_fib.h +#include net/ip6_route.h #include rdma/rdma_cm.h #include rdma/rdma_cm_ib.h @@ -1078,15 +1080,97 @@ static int cma_save_req_info(const struct ib_cm_event *ib_event, return 0; } +static bool validate_ipv4_net_dev(struct net_device *net_dev, + const struct sockaddr_in *dst_addr, + const struct sockaddr_in *src_addr) +{ + __be32 daddr = dst_addr-sin_addr.s_addr, + saddr = src_addr-sin_addr.s_addr; + struct fib_result res; + struct flowi4 fl4; + int err; + bool ret; + + if (ipv4_is_multicast(saddr) || ipv4_is_lbcast(saddr) || + ipv4_is_lbcast(daddr) || ipv4_is_zeronet(saddr) || + ipv4_is_zeronet(daddr) || ipv4_is_loopback(daddr) || + ipv4_is_loopback(saddr)) + return false; + + memset(fl4, 0, sizeof(fl4)); + fl4.flowi4_iif = net_dev-ifindex; + fl4.daddr = daddr; + fl4.saddr = saddr; + + rcu_read_lock(); + err = fib_lookup(dev_net(net_dev), fl4, res, 0); + if (err) + return false; + + ret = FIB_RES_DEV(res) == net_dev; + rcu_read_unlock(); + + return ret; +} + +static bool validate_ipv6_net_dev(struct net_device *net_dev, + const struct sockaddr_in6 *dst_addr, + const struct sockaddr_in6 *src_addr) +{ +#if IS_ENABLED(CONFIG_IPV6) + const int strict = ipv6_addr_type(dst_addr-sin6_addr) + IPV6_ADDR_LINKLOCAL; + struct rt6_info *rt = rt6_lookup(dev_net(net_dev), dst_addr-sin6_addr, +src_addr-sin6_addr, net_dev-ifindex, +strict); + bool ret; + + if (!rt) + return false; + + ret = rt-rt6i_idev-dev == net_dev; + ip6_rt_put(rt); + + return ret; +#else + return false; +#endif +} + +static bool validate_net_dev(struct net_device *net_dev, +const struct sockaddr *daddr, +const struct sockaddr *saddr) +{ + const struct sockaddr_in *daddr4 = (const struct sockaddr_in *)daddr; + const struct sockaddr_in *saddr4 = (const struct sockaddr_in *)saddr; + const struct sockaddr_in6 *daddr6 = (const struct sockaddr_in6 *)daddr; + const struct sockaddr_in6 *saddr6 = (const struct sockaddr_in6 *)saddr; + + switch (daddr-sa_family) { + case AF_INET: + return saddr-sa_family == AF_INET + validate_ipv4_net_dev(net_dev, daddr4, saddr4); + + case AF_INET6: + return saddr-sa_family == AF_INET6 + validate_ipv6_net_dev(net_dev, daddr6, saddr6); + + default: + return false; + } +} + static struct net_device *cma_get_net_dev(struct ib_cm_event *ib_event, const struct cma_req_info *req) { - struct sockaddr_storage listen_addr_storage; - struct sockaddr *listen_addr = (struct sockaddr *)listen_addr_storage; + struct sockaddr_storage listen_addr_storage, src_addr_storage; + struct sockaddr *listen_addr = (struct sockaddr *)listen_addr_storage, + *src_addr = (struct sockaddr *)src_addr_storage; struct net_device *net_dev; int err; - err = cma_save_ip_info(listen_addr, NULL, ib_event, req-service_id); + err = cma_save_ip_info(listen_addr, src_addr, ib_event, + req-service_id); if (err) return ERR_PTR(err); @@ -1095,6 +1179,11 @@ static struct net_device *cma_get_net_dev(struct ib_cm_event *ib_event, if (!net_dev) return ERR_PTR(-ENODEV); + if (!validate_net_dev(net_dev, listen_addr, src_addr)) { + dev_put(net_dev); + return ERR_PTR(-EHOSTUNREACH); + } + return net_dev; } -- 1.7.11.2 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 01/13] IB/core: lock client data with lists_rwsem
An ib_client callback that is called with the lists_rwsem locked only for read is protected from changes to the IB client lists, but not from ib_unregister_device() freeing its client data. This is because ib_unregister_device() will remove the device from the device list with lists_rwsem locked for write, but perform the rest of the cleanup, including the call to remove() without that lock. Mark client data that is undergoing de-registration with a new going_down flag in the client data context. Lock the client data list with lists_rwsem for write in addition to using the spinlock, so that functions calling the callback would be able to lock only lists_rwsem for read and let callbacks sleep. Since ib_unregister_client() now marks the client data context, no need for remove() to search the context again, so pass the client data directly to remove() callbacks. Cc: Jason Gunthorpe jguntho...@obsidianresearch.com Signed-off-by: Haggai Eran hagg...@mellanox.com --- drivers/infiniband/core/cache.c | 2 +- drivers/infiniband/core/cm.c | 7 ++-- drivers/infiniband/core/cma.c | 7 ++-- drivers/infiniband/core/device.c | 53 +-- drivers/infiniband/core/mad.c | 2 +- drivers/infiniband/core/multicast.c | 7 ++-- drivers/infiniband/core/sa_query.c| 6 ++-- drivers/infiniband/core/ucm.c | 6 ++-- drivers/infiniband/core/user_mad.c| 6 ++-- drivers/infiniband/core/uverbs_main.c | 6 ++-- drivers/infiniband/ulp/ipoib/ipoib_main.c | 7 ++-- drivers/infiniband/ulp/srp/ib_srp.c | 6 ++-- drivers/infiniband/ulp/srpt/ib_srpt.c | 5 ++- include/rdma/ib_verbs.h | 4 ++- net/rds/ib.c | 5 ++- net/rds/iw.c | 5 ++- 16 files changed, 82 insertions(+), 52 deletions(-) diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index 871da832d016..c93af66cc091 100644 --- a/drivers/infiniband/core/cache.c +++ b/drivers/infiniband/core/cache.c @@ -394,7 +394,7 @@ err: kfree(device-cache.lmc_cache); } -static void ib_cache_cleanup_one(struct ib_device *device) +static void ib_cache_cleanup_one(struct ib_device *device, void *client_data) { int p; diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 3a972ebf3c0d..82d5c4362aa8 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -58,7 +58,7 @@ MODULE_DESCRIPTION(InfiniBand CM); MODULE_LICENSE(Dual BSD/GPL); static void cm_add_one(struct ib_device *device); -static void cm_remove_one(struct ib_device *device); +static void cm_remove_one(struct ib_device *device, void *client_data); static struct ib_client cm_client = { .name = cm, @@ -3886,9 +3886,9 @@ free: kfree(cm_dev); } -static void cm_remove_one(struct ib_device *ib_device) +static void cm_remove_one(struct ib_device *ib_device, void *client_data) { - struct cm_device *cm_dev; + struct cm_device *cm_dev = client_data; struct cm_port *port; struct ib_port_modify port_modify = { .clr_port_cap_mask = IB_PORT_CM_SUP @@ -3896,7 +3896,6 @@ static void cm_remove_one(struct ib_device *ib_device) unsigned long flags; int i; - cm_dev = ib_get_client_data(ib_device, cm_client); if (!cm_dev) return; diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 143ded2bbe7c..6b6cdfa5d231 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -94,7 +94,7 @@ const char *rdma_event_msg(enum rdma_cm_event_type event) EXPORT_SYMBOL(rdma_event_msg); static void cma_add_one(struct ib_device *device); -static void cma_remove_one(struct ib_device *device); +static void cma_remove_one(struct ib_device *device, void *client_data); static struct ib_client cma_client = { .name = cma, @@ -3551,11 +3551,10 @@ static void cma_process_remove(struct cma_device *cma_dev) wait_for_completion(cma_dev-comp); } -static void cma_remove_one(struct ib_device *device) +static void cma_remove_one(struct ib_device *device, void *client_data) { - struct cma_device *cma_dev; + struct cma_device *cma_dev = client_data; - cma_dev = ib_get_client_data(device, cma_client); if (!cma_dev) return; diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index f08d438205ed..623d8e191ced 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -50,6 +50,9 @@ struct ib_client_data { struct list_head list; struct ib_client *client; void *data; + /* The device or client is going down. Do not call client or device +* callbacks other than remove(). */ + bool going_down; }; struct workqueue_struct *ib_wq;
[PATCH v2 06/13] IB/cma: Refactor RDMA IP CM private-data parsing code
When receiving a connection request, rdma_cm needs to associate the request with a network device, in order to disambiguate requests. To do this, it needs to know the request's destination IP. For this the module needs to allow getting this information from the private data in the request packet, instead of relying on the information already being in the listening RDMA CM ID. When creating a new incoming connection ID, the code in cma_save_ip{4,6}_info can no longer rely on the listener's private data to find the port number, so it reads it from the requested service ID. Signed-off-by: Guy Shapiro gu...@mellanox.com Signed-off-by: Haggai Eran hagg...@mellanox.com Signed-off-by: Yotam Kenneth yota...@mellanox.com Signed-off-by: Shachar Raindel rain...@mellanox.com --- drivers/infiniband/core/cma.c | 170 ++ 1 file changed, 105 insertions(+), 65 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 6b6cdfa5d231..cf5c48b0b7d5 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -870,107 +870,138 @@ static inline int cma_any_port(struct sockaddr *addr) return !cma_port(addr); } -static void cma_save_ib_info(struct rdma_cm_id *id, struct rdma_cm_id *listen_id, +static void cma_save_ib_info(struct sockaddr *src_addr, +struct sockaddr *dst_addr, +struct rdma_cm_id *listen_id, struct ib_sa_path_rec *path) { struct sockaddr_ib *listen_ib, *ib; listen_ib = (struct sockaddr_ib *) listen_id-route.addr.src_addr; - ib = (struct sockaddr_ib *) id-route.addr.src_addr; - ib-sib_family = listen_ib-sib_family; - if (path) { - ib-sib_pkey = path-pkey; - ib-sib_flowinfo = path-flow_label; - memcpy(ib-sib_addr, path-sgid, 16); - } else { - ib-sib_pkey = listen_ib-sib_pkey; - ib-sib_flowinfo = listen_ib-sib_flowinfo; - ib-sib_addr = listen_ib-sib_addr; - } - ib-sib_sid = listen_ib-sib_sid; - ib-sib_sid_mask = cpu_to_be64(0xULL); - ib-sib_scope_id = listen_ib-sib_scope_id; - - if (path) { - ib = (struct sockaddr_ib *) id-route.addr.dst_addr; - ib-sib_family = listen_ib-sib_family; - ib-sib_pkey = path-pkey; - ib-sib_flowinfo = path-flow_label; - memcpy(ib-sib_addr, path-dgid, 16); + if (src_addr) { + ib = (struct sockaddr_ib *)src_addr; + ib-sib_family = AF_IB; + if (path) { + ib-sib_pkey = path-pkey; + ib-sib_flowinfo = path-flow_label; + memcpy(ib-sib_addr, path-sgid, 16); + ib-sib_sid = path-service_id; + ib-sib_scope_id = 0; + } else { + ib-sib_pkey = listen_ib-sib_pkey; + ib-sib_flowinfo = listen_ib-sib_flowinfo; + ib-sib_addr = listen_ib-sib_addr; + ib-sib_sid = listen_ib-sib_sid; + ib-sib_scope_id = listen_ib-sib_scope_id; + } + ib-sib_sid_mask = cpu_to_be64(0xULL); + } + if (dst_addr) { + ib = (struct sockaddr_ib *)dst_addr; + ib-sib_family = AF_IB; + if (path) { + ib-sib_pkey = path-pkey; + ib-sib_flowinfo = path-flow_label; + memcpy(ib-sib_addr, path-dgid, 16); + } } } -static __be16 ss_get_port(const struct sockaddr_storage *ss) -{ - if (ss-ss_family == AF_INET) - return ((struct sockaddr_in *)ss)-sin_port; - else if (ss-ss_family == AF_INET6) - return ((struct sockaddr_in6 *)ss)-sin6_port; - BUG(); -} - -static void cma_save_ip4_info(struct rdma_cm_id *id, struct rdma_cm_id *listen_id, - struct cma_hdr *hdr) +static void cma_save_ip4_info(struct sockaddr *src_addr, + struct sockaddr *dst_addr, + struct cma_hdr *hdr, + __be16 local_port) { struct sockaddr_in *ip4; - ip4 = (struct sockaddr_in *) id-route.addr.src_addr; - ip4-sin_family = AF_INET; - ip4-sin_addr.s_addr = hdr-dst_addr.ip4.addr; - ip4-sin_port = ss_get_port(listen_id-route.addr.src_addr); + if (src_addr) { + ip4 = (struct sockaddr_in *)src_addr; + ip4-sin_family = AF_INET; + ip4-sin_addr.s_addr = hdr-dst_addr.ip4.addr; + ip4-sin_port = local_port; + } - ip4 = (struct sockaddr_in *) id-route.addr.dst_addr; - ip4-sin_family = AF_INET; - ip4-sin_addr.s_addr =
[PATCH v2 04/13] IB/cm: Expose service ID in request events
Expose the service ID on an incoming CM or SIDR request to the event handler. This will allow the RDMA CM module to de-multiplex connection requests based on the information encoded in the service ID. Acked-by: Sean Hefty sean.he...@intel.com Signed-off-by: Haggai Eran hagg...@mellanox.com --- drivers/infiniband/core/cm.c | 3 +++ include/rdma/ib_cm.h | 1 + 2 files changed, 4 insertions(+) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 82d5c4362aa8..93e9e2f34fc6 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -1268,6 +1268,7 @@ static void cm_format_paths_from_req(struct cm_req_msg *req_msg, primary_path-packet_life_time = cm_req_get_primary_local_ack_timeout(req_msg); primary_path-packet_life_time -= (primary_path-packet_life_time 0); + primary_path-service_id = req_msg-service_id; if (req_msg-alt_local_lid) { memset(alt_path, 0, sizeof *alt_path); @@ -1289,6 +1290,7 @@ static void cm_format_paths_from_req(struct cm_req_msg *req_msg, alt_path-packet_life_time = cm_req_get_alt_local_ack_timeout(req_msg); alt_path-packet_life_time -= (alt_path-packet_life_time 0); + alt_path-service_id = req_msg-service_id; } } @@ -2992,6 +2994,7 @@ static void cm_format_sidr_req_event(struct cm_work *work, param = work-cm_event.param.sidr_req_rcvd; param-pkey = __be16_to_cpu(sidr_req_msg-pkey); param-listen_id = listen_id; + param-service_id = sidr_req_msg-service_id; param-port = work-port-port_num; work-cm_event.private_data = sidr_req_msg-private_data; } diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h index 39ed2d2fbd51..1b567bbc3ad4 100644 --- a/include/rdma/ib_cm.h +++ b/include/rdma/ib_cm.h @@ -223,6 +223,7 @@ struct ib_cm_apr_event_param { struct ib_cm_sidr_req_event_param { struct ib_cm_id *listen_id; + __be64 service_id; u8 port; u16 pkey; }; -- 1.7.11.2 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/15] xprtrdma: Clean up rpcrdma_ia_open()
Jason has patches that provide a local_dma_lkey in the PD that is always available. Do you need this clean up for the next merge window? If not it might be worth to postponed it to avoid merge conflicts, specially as I assume the NFS changes will go in through Trond. On Mon, Jul 20, 2015 at 03:03:20PM -0400, Chuck Lever wrote: Untangle the end of rpcrdma_ia_open() by moving DMA MR set-up, which is different for each registration method, to the .ro_open functions. This is refactoring only. No behavior change is expected. Signed-off-by: Chuck Lever chuck.le...@oracle.com Tested-by: Devesh Sharma devesh.sha...@avagotech.com --- net/sunrpc/xprtrdma/fmr_ops.c | 19 +++ net/sunrpc/xprtrdma/frwr_ops.c |5 +++ net/sunrpc/xprtrdma/physical_ops.c | 25 ++- net/sunrpc/xprtrdma/verbs.c| 60 +++- net/sunrpc/xprtrdma/xprt_rdma.h|3 +- 5 files changed, 67 insertions(+), 45 deletions(-) diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c index f1e8daf..cb25c89 100644 --- a/net/sunrpc/xprtrdma/fmr_ops.c +++ b/net/sunrpc/xprtrdma/fmr_ops.c @@ -39,6 +39,25 @@ static int fmr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep, struct rpcrdma_create_data_internal *cdata) { + struct ib_device_attr *devattr = ia-ri_devattr; + struct ib_mr *mr; + + /* Obtain an lkey to use for the regbufs, which are + * protected from remote access. + */ + if (devattr-device_cap_flags IB_DEVICE_LOCAL_DMA_LKEY) { + ia-ri_dma_lkey = ia-ri_device-local_dma_lkey; + } else { + mr = ib_get_dma_mr(ia-ri_pd, IB_ACCESS_LOCAL_WRITE); + if (IS_ERR(mr)) { + pr_err(%s: ib_get_dma_mr for failed with %lX\n, +__func__, PTR_ERR(mr)); + return -ENOMEM; + } + ia-ri_dma_lkey = ia-ri_dma_mr-lkey; + ia-ri_dma_mr = mr; + } + return 0; } diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c index 04ea914..63f282e 100644 --- a/net/sunrpc/xprtrdma/frwr_ops.c +++ b/net/sunrpc/xprtrdma/frwr_ops.c @@ -189,6 +189,11 @@ frwr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep, struct ib_device_attr *devattr = ia-ri_devattr; int depth, delta; + /* Obtain an lkey to use for the regbufs, which are + * protected from remote access. + */ + ia-ri_dma_lkey = ia-ri_device-local_dma_lkey; + ia-ri_max_frmr_depth = min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS, devattr-max_fast_reg_page_list_len); diff --git a/net/sunrpc/xprtrdma/physical_ops.c b/net/sunrpc/xprtrdma/physical_ops.c index 41985d0..72cf8b1 100644 --- a/net/sunrpc/xprtrdma/physical_ops.c +++ b/net/sunrpc/xprtrdma/physical_ops.c @@ -23,6 +23,29 @@ static int physical_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep, struct rpcrdma_create_data_internal *cdata) { + struct ib_device_attr *devattr = ia-ri_devattr; + struct ib_mr *mr; + + /* Obtain an rkey to use for RPC data payloads. + */ + mr = ib_get_dma_mr(ia-ri_pd, +IB_ACCESS_LOCAL_WRITE | +IB_ACCESS_REMOTE_WRITE | +IB_ACCESS_REMOTE_READ); + if (IS_ERR(mr)) { + pr_err(%s: ib_get_dma_mr for failed with %lX\n, +__func__, PTR_ERR(mr)); + return -ENOMEM; + } + ia-ri_dma_mr = mr; + + /* Obtain an lkey to use for regbufs. + */ + if (devattr-device_cap_flags IB_DEVICE_LOCAL_DMA_LKEY) + ia-ri_dma_lkey = ia-ri_device-local_dma_lkey; + else + ia-ri_dma_lkey = ia-ri_dma_mr-lkey; + return 0; } @@ -51,7 +74,7 @@ physical_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg, struct rpcrdma_ia *ia = r_xprt-rx_ia; rpcrdma_map_one(ia-ri_device, seg, rpcrdma_data_dir(writing)); - seg-mr_rkey = ia-ri_bind_mem-rkey; + seg-mr_rkey = ia-ri_dma_mr-rkey; seg-mr_base = seg-mr_dma; seg-mr_nsegs = 1; return 1; diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index da184f9..8516d98 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++ b/net/sunrpc/xprtrdma/verbs.c @@ -493,9 +493,11 @@ rpcrdma_clean_cq(struct ib_cq *cq) int rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg) { - int rc, mem_priv; struct rpcrdma_ia *ia = xprt-rx_ia; struct ib_device_attr *devattr = ia-ri_devattr; + int rc; + + ia-ri_dma_mr = NULL; ia-ri_id = rpcrdma_create_id(xprt, ia, addr); if (IS_ERR(ia-ri_id)) { @@ -519,11 +521,6 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr *addr, int memreg) goto out3; }
Re: [PATCH V6 9/9] isert: Support iWARP transports using FRMRs
On 7/24/2015 10:24 PM, Jason Gunthorpe wrote: On Fri, Jul 24, 2015 at 01:48:09PM -0500, Steve Wise wrote: The use of FRWR for RDMA READ should be iWarp specific, IB shouldn't pay that overhead. I am expecting to see a cap_rdma_read_rkey or something in here ? Ok. But cap_rdma_read_rkey() doesn't really describe the requirement. The requirement is rkey + REMOTE_WRITE. So it is more like rdma_cap_read_requires_remote_write() which is ugly and too long (but descriptive)... I don't care much what name you pick, just jam something like this in the description I think we can just do if (signature || iwarp) use fastreg else use local_dma_lkey. If set then RDMA_READ must be performed by mapping the local buffers through a rkey MR with ACCESS_REMOTE_WRITE enabled. The rkey of this MR should be passed in as the sg_lists's lkey for IB_WR_RDMA_READ_WITH_INV. I think this would be an incremental patch and not as part of iwarp support. Question though, wouldn't it be better to do a single RDMA_READ to say 4 registered keys rather than RDMA_READ_WITH_INV for each? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 11/13] IB/cma: Use found net_dev for passive connections
When receiving a new connection in cma_req_handler, we actually already know the net_dev that is used for the connection's creation. Instead of calling cma_translate_addr to resolve the new connection id's source address, just use the net_dev that was found. Signed-off-by: Haggai Eran hagg...@mellanox.com --- drivers/infiniband/core/cma.c | 74 +++ 1 file changed, 47 insertions(+), 27 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 42f412fde064..1c43b58a8eb2 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -1273,33 +1273,31 @@ static struct rdma_id_private *cma_find_listener( } static struct rdma_id_private *cma_id_from_event(struct ib_cm_id *cm_id, -struct ib_cm_event *ib_event) +struct ib_cm_event *ib_event, +struct net_device **net_dev) { struct cma_req_info req; struct rdma_bind_list *bind_list; struct rdma_id_private *id_priv; - struct net_device *net_dev; int err; err = cma_save_req_info(ib_event, req); if (err) return ERR_PTR(err); - net_dev = cma_get_net_dev(ib_event, req); - if (IS_ERR(net_dev)) { - if (PTR_ERR(net_dev) == -EAFNOSUPPORT) { + *net_dev = cma_get_net_dev(ib_event, req); + if (IS_ERR(*net_dev)) { + if (PTR_ERR(*net_dev) == -EAFNOSUPPORT) { /* Assuming the protocol is AF_IB */ - net_dev = NULL; + *net_dev = NULL; } else { - return ERR_PTR(PTR_ERR(net_dev)); + return ERR_PTR(PTR_ERR(*net_dev)); } } bind_list = cma_ps_find(rdma_ps_from_service_id(req.service_id), cma_port_from_service_id(req.service_id)); - id_priv = cma_find_listener(bind_list, cm_id, ib_event, req, net_dev); - - dev_put(net_dev); + id_priv = cma_find_listener(bind_list, cm_id, ib_event, req, *net_dev); return id_priv; } @@ -1549,7 +1547,8 @@ out: } static struct rdma_id_private *cma_new_conn_id(struct rdma_cm_id *listen_id, - struct ib_cm_event *ib_event) + struct ib_cm_event *ib_event, + struct net_device *net_dev) { struct rdma_id_private *id_priv; struct rdma_cm_id *id; @@ -1581,14 +1580,15 @@ static struct rdma_id_private *cma_new_conn_id(struct rdma_cm_id *listen_id, if (rt-num_paths == 2) rt-path_rec[1] = *ib_event-param.req_rcvd.alternate_path; - if (cma_any_addr(cma_src_addr(id_priv))) { - rt-addr.dev_addr.dev_type = ARPHRD_INFINIBAND; - rdma_addr_set_sgid(rt-addr.dev_addr, rt-path_rec[0].sgid); - ib_addr_set_pkey(rt-addr.dev_addr, be16_to_cpu(rt-path_rec[0].pkey)); - } else { - ret = cma_translate_addr(cma_src_addr(id_priv), rt-addr.dev_addr); + if (net_dev) { + ret = rdma_copy_addr(rt-addr.dev_addr, net_dev, NULL); if (ret) goto err; + } else { + /* An AF_IB connection */ + WARN_ON_ONCE(ss_family != AF_IB); + + cma_translate_ib((struct sockaddr_ib *)cma_src_addr(id_priv), rt-addr.dev_addr); } rdma_addr_set_dgid(rt-addr.dev_addr, rt-path_rec[0].dgid); @@ -1601,7 +1601,8 @@ err: } static struct rdma_id_private *cma_new_udp_id(struct rdma_cm_id *listen_id, - struct ib_cm_event *ib_event) + struct ib_cm_event *ib_event, + struct net_device *net_dev) { struct rdma_id_private *id_priv; struct rdma_cm_id *id; @@ -1620,10 +1621,17 @@ static struct rdma_id_private *cma_new_udp_id(struct rdma_cm_id *listen_id, ib_event-param.sidr_req_rcvd.service_id)) goto err; - if (!cma_any_addr((struct sockaddr *) id-route.addr.src_addr)) { - ret = cma_translate_addr(cma_src_addr(id_priv), id-route.addr.dev_addr); + if (net_dev) { + ret = rdma_copy_addr(id-route.addr.dev_addr, net_dev, NULL); if (ret) goto err; + } else { + /* An AF_IB connection */ + WARN_ON_ONCE(ss_family != AF_IB); + + if (!cma_any_addr(cma_src_addr(id_priv))) + cma_translate_ib((struct sockaddr_ib *)cma_src_addr(id_priv), +id-route.addr.dev_addr); } id_priv-state =
[PATCH v2 03/13] IB/ipoib: Return IPoIB devices matching connection parameters
From: Guy Shapiro gu...@mellanox.com Implement the get_net_device_by_port_pkey_ip callback that returns network device to ib_core according to connection parameters. Check the ipoib device and iterate over all child devices to look for a match. For each IPoIB device we iterate through all upper devices when searching for a matching IP, in order to support bonding. Signed-off-by: Guy Shapiro gu...@mellanox.com Signed-off-by: Haggai Eran hagg...@mellanox.com Signed-off-by: Yotam Kenneth yota...@mellanox.com Signed-off-by: Shachar Raindel rain...@mellanox.com --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 229 +- 1 file changed, 228 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index cca1a0c91ec4..36536ce5a3e2 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -48,6 +48,9 @@ #include linux/jhash.h #include net/arp.h +#include net/addrconf.h +#include linux/inetdevice.h +#include rdma/ib_cache.h #define DRV_VERSION 1.0.0 @@ -91,11 +94,16 @@ struct ib_sa_client ipoib_sa_client; static void ipoib_add_one(struct ib_device *device); static void ipoib_remove_one(struct ib_device *device, void *client_data); static void ipoib_neigh_reclaim(struct rcu_head *rp); +static struct net_device *ipoib_get_net_dev_by_params( + struct ib_device *dev, u8 port, u16 pkey, + const union ib_gid *gid, const struct sockaddr *addr, + void *client_data); static struct ib_client ipoib_client = { .name = ipoib, .add= ipoib_add_one, - .remove = ipoib_remove_one + .remove = ipoib_remove_one, + .get_net_dev_by_params = ipoib_get_net_dev_by_params, }; int ipoib_open(struct net_device *dev) @@ -222,6 +230,225 @@ static int ipoib_change_mtu(struct net_device *dev, int new_mtu) return 0; } +/* Called with an RCU read lock taken */ +static bool ipoib_is_dev_match_addr_rcu(const struct sockaddr *addr, + struct net_device *dev) +{ + struct net *net = dev_net(dev); + struct in_device *in_dev; + struct sockaddr_in *addr_in = (struct sockaddr_in *)addr; + struct sockaddr_in6 *addr_in6 = (struct sockaddr_in6 *)addr; + __be32 ret_addr; + + switch (addr-sa_family) { + case AF_INET: + in_dev = in_dev_get(dev); + if (!in_dev) + return false; + + ret_addr = inet_confirm_addr(net, in_dev, 0, +addr_in-sin_addr.s_addr, +RT_SCOPE_HOST); + in_dev_put(in_dev); + if (ret_addr) + return true; + + break; + case AF_INET6: + if (IS_ENABLED(CONFIG_IPV6) + ipv6_chk_addr(net, addr_in6-sin6_addr, dev, 1)) + return true; + + break; + } + return false; +} + +/** + * Find the master net_device on top of the given net_device. + * @dev: base IPoIB net_device + * + * Returns the master net_device with a reference held, or the same net_device + * if no master exists. + */ +static struct net_device *ipoib_get_master_net_dev(struct net_device *dev) +{ + struct net_device *master; + + rcu_read_lock(); + master = netdev_master_upper_dev_get_rcu(dev); + if (master) + dev_hold(master); + rcu_read_unlock(); + + if (master) + return master; + + dev_hold(dev); + return dev; +} + +/** + * Find a net_device matching the given address, which is an upper device of + * the given net_device. + * @addr: IP address to look for. + * @dev: base IPoIB net_device + * + * If found, returns the net_device with a reference held. Otherwise return + * NULL. + */ +static struct net_device *ipoib_get_net_dev_match_addr( + const struct sockaddr *addr, struct net_device *dev) +{ + struct net_device *upper, + *result = NULL; + struct list_head *iter; + + rcu_read_lock(); + if (ipoib_is_dev_match_addr_rcu(addr, dev)) { + dev_hold(dev); + result = dev; + goto out; + } + + netdev_for_each_all_upper_dev_rcu(dev, upper, iter) { + if (ipoib_is_dev_match_addr_rcu(addr, upper)) { + dev_hold(upper); + result = upper; + break; + } + } +out: + rcu_read_unlock(); + return result; +} + +/* returns the number of IPoIB netdevs on top a given ipoib device matching a + * pkey_index and address, if one exists. + * + * @found_net_dev: contains a matching net_device if the return value = 1, + * with a reference held. */ +static int ipoib_match_gid_pkey_addr(struct
Re: [PATCH v3 04/15] xprtrdma: Don't fall back to PHYSICAL memory registration
On Mon, Jul 20, 2015 at 03:03:02PM -0400, Chuck Lever wrote: PHYSICAL memory registration uses a single rkey for all of the client's memory, thus is insecure. It is still useful in some cases for testing. Retain the ability to select PHYSICAL memory registration capability via /proc/sys/sunrpc/rdma_memreg_strategy, but don't fall back to it if the HCA does not support FRWR or FMR. This means amso1100 no longer works out of the box with NFS/RDMA. When using amso1100 HCAs, set the memreg_strategy sysctl to 6 before performing NFS/RDMA mounts. Looks good, Reviewed-by: Christoph Hellwig h...@lst.de -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 01/15] xprtrdma: Make xprt_setup_rdma() agnostic to family of server address
On Mon, Jul 20, 2015 at 03:02:33PM -0400, Chuck Lever wrote: In particular, recognize when an IPv6 connection is bound. Signed-off-by: Chuck Lever chuck.le...@oracle.com Tested-by: Devesh Sharma devesh.sha...@avagotech.com Looks good, Reviewed-by: Christoph Hellwig h...@lst.de -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V6 6/9] isert: Rename IO functions to more descriptive names
On 7/26/2015 6:00 AM, Sagi Grimberg wrote: On 7/26/2015 1:43 PM, Christoph Hellwig wrote: On Sun, Jul 26, 2015 at 01:08:16PM +0300, Sagi Grimberg wrote: I've given this some thought and I think we should avoid splitting logic from PI and iWARP. The reason (other than code duplication) is that currently the iser target support only up to 1MB IOs. I have some code (not done yet) to support larger IOs by using multiple registrations per IO (with or without PI). Just curious: How is this going to work with iSER only having a single rkey/offset/len field? Good question, On the wire iser sends a single rkey, but the target is allowed to transfer the data however it wants to. Say that the local target HCA supports only 32 pages (128K bytes for 4K pages) registration and the initiator sent: rkey=0x1234 address=0x length=512K The target would allocate a 512K buffer and: register offset 0-128K to lkey=0x1 register offset 128K-256K to lkey=0x2 register offset 256K-384K to lkey=0x3 register offset 384K-512K to lkey=0x4 then constructs sg_list as: sg_list[0] = {addr=buf, length=128K, lkey=0x1} sg_list[1] = {addr=buf+128K, length=128K, lkey=0x2} sg_list[2] = {addr=buf+256K, length=128K, lkey=0x3} sg_list[3] = {addr=buf+384K, length=128K, lkey=0x4} Then set rdma_read wr with: rdma_r_wr.sg_list=sg_list rdma_r_wr.rdma.addr=0x rdma_r_wr.rdma.rkey=0x1234 post_send(rdma_r_wr); Ideally, the post contains a chain of all 4 registrations and the rdma_read (and an opportunistic good scsi response). Just to be clear: This example is for IB only, correct? IW would require rkeys with REMOTE_WRITE and 4 read wrs. And you're ignoring invalidation wrs (or read-with-inv) in the example... -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/15] xprtrdma: Clean up rpcrdma_ia_open()
On Sun, Jul 26, 2015 at 02:21:23PM -0400, Chuck Lever wrote: No, this patch is not strictly needed in 4.3, but my read of Jason?s series is that he does not touch xprtrdma. I don?t believe there will be a merge conflict. The goal of this patch is to move xprtrdma forward so it will be straightforward to use pd-local_dma_key for RPC send and receive buffers. That?s a change that can be added after both this patch and Jason?s series is merged. I prefer keeping this patch separate, because that makes it simpler to review and test this refactor. I don?t see a reason to delay it, but I can do that if it is needed. You're right, Jason didn't touch xprtrdma. Sorry for the noise. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V6 6/9] isert: Rename IO functions to more descriptive names
On Sun, Jul 26, 2015 at 02:00:51PM +0300, Sagi Grimberg wrote: On the wire iser sends a single rkey, but the target is allowed to transfer the data however it wants to. So you're trying to get above the limit of a single RDMA READ, not above the limit for memory registration in the initiator? In that case your explanation makes sense, that's just not what I expected to be the limiting factor. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/15] xprtrdma: Clean up rpcrdma_ia_open()
Hi Christoph- On Jul 26, 2015, at 12:53 PM, Christoph Hellwig h...@infradead.org wrote: Jason has patches that provide a local_dma_lkey in the PD that is always available. Do you need this clean up for the next merge window? If not it might be worth to postponed it to avoid merge conflicts, specially as I assume the NFS changes will go in through Trond. No, this patch is not strictly needed in 4.3, but my read of Jason’s series is that he does not touch xprtrdma. I don’t believe there will be a merge conflict. The goal of this patch is to move xprtrdma forward so it will be straightforward to use pd-local_dma_key for RPC send and receive buffers. That’s a change that can be added after both this patch and Jason’s series is merged. I prefer keeping this patch separate, because that makes it simpler to review and test this refactor. I don’t see a reason to delay it, but I can do that if it is needed. On Mon, Jul 20, 2015 at 03:03:20PM -0400, Chuck Lever wrote: Untangle the end of rpcrdma_ia_open() by moving DMA MR set-up, which is different for each registration method, to the .ro_open functions. This is refactoring only. No behavior change is expected. Signed-off-by: Chuck Lever chuck.le...@oracle.com Tested-by: Devesh Sharma devesh.sha...@avagotech.com --- net/sunrpc/xprtrdma/fmr_ops.c | 19 +++ net/sunrpc/xprtrdma/frwr_ops.c |5 +++ net/sunrpc/xprtrdma/physical_ops.c | 25 ++- net/sunrpc/xprtrdma/verbs.c| 60 +++- net/sunrpc/xprtrdma/xprt_rdma.h|3 +- 5 files changed, 67 insertions(+), 45 deletions(-) diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c index f1e8daf..cb25c89 100644 --- a/net/sunrpc/xprtrdma/fmr_ops.c +++ b/net/sunrpc/xprtrdma/fmr_ops.c @@ -39,6 +39,25 @@ static int fmr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep, struct rpcrdma_create_data_internal *cdata) { +struct ib_device_attr *devattr = ia-ri_devattr; +struct ib_mr *mr; + +/* Obtain an lkey to use for the regbufs, which are + * protected from remote access. + */ +if (devattr-device_cap_flags IB_DEVICE_LOCAL_DMA_LKEY) { +ia-ri_dma_lkey = ia-ri_device-local_dma_lkey; +} else { +mr = ib_get_dma_mr(ia-ri_pd, IB_ACCESS_LOCAL_WRITE); +if (IS_ERR(mr)) { +pr_err(%s: ib_get_dma_mr for failed with %lX\n, + __func__, PTR_ERR(mr)); +return -ENOMEM; +} +ia-ri_dma_lkey = ia-ri_dma_mr-lkey; +ia-ri_dma_mr = mr; +} + return 0; } diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c index 04ea914..63f282e 100644 --- a/net/sunrpc/xprtrdma/frwr_ops.c +++ b/net/sunrpc/xprtrdma/frwr_ops.c @@ -189,6 +189,11 @@ frwr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep, struct ib_device_attr *devattr = ia-ri_devattr; int depth, delta; +/* Obtain an lkey to use for the regbufs, which are + * protected from remote access. + */ +ia-ri_dma_lkey = ia-ri_device-local_dma_lkey; + ia-ri_max_frmr_depth = min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS, devattr-max_fast_reg_page_list_len); diff --git a/net/sunrpc/xprtrdma/physical_ops.c b/net/sunrpc/xprtrdma/physical_ops.c index 41985d0..72cf8b1 100644 --- a/net/sunrpc/xprtrdma/physical_ops.c +++ b/net/sunrpc/xprtrdma/physical_ops.c @@ -23,6 +23,29 @@ static int physical_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep, struct rpcrdma_create_data_internal *cdata) { +struct ib_device_attr *devattr = ia-ri_devattr; +struct ib_mr *mr; + +/* Obtain an rkey to use for RPC data payloads. + */ +mr = ib_get_dma_mr(ia-ri_pd, + IB_ACCESS_LOCAL_WRITE | + IB_ACCESS_REMOTE_WRITE | + IB_ACCESS_REMOTE_READ); +if (IS_ERR(mr)) { +pr_err(%s: ib_get_dma_mr for failed with %lX\n, + __func__, PTR_ERR(mr)); +return -ENOMEM; +} +ia-ri_dma_mr = mr; + +/* Obtain an lkey to use for regbufs. + */ +if (devattr-device_cap_flags IB_DEVICE_LOCAL_DMA_LKEY) +ia-ri_dma_lkey = ia-ri_device-local_dma_lkey; +else +ia-ri_dma_lkey = ia-ri_dma_mr-lkey; + return 0; } @@ -51,7 +74,7 @@ physical_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg, struct rpcrdma_ia *ia = r_xprt-rx_ia; rpcrdma_map_one(ia-ri_device, seg, rpcrdma_data_dir(writing)); -seg-mr_rkey = ia-ri_bind_mem-rkey; +seg-mr_rkey = ia-ri_dma_mr-rkey; seg-mr_base = seg-mr_dma; seg-mr_nsegs = 1; return 1; diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index da184f9..8516d98 100644 ---
[PATCH for-4.2] iw_cxgb4: gracefully handle unknown CQE status errors
c4iw_poll_cq_on() shouldn't fail the poll operation just because the CQE status is unknown. Rather, it should map this to the fatal error status and log the anomaly. Signed-off-by: Steve Wise sw...@opengridcomputing.com Signed-off-by: Hariprasad Shenai haripra...@chelsio.com --- drivers/infiniband/hw/cxgb4/cq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c index c7aab48..92d5183 100644 --- a/drivers/infiniband/hw/cxgb4/cq.c +++ b/drivers/infiniband/hw/cxgb4/cq.c @@ -814,7 +814,7 @@ static int c4iw_poll_cq_one(struct c4iw_cq *chp, struct ib_wc *wc) printk(KERN_ERR MOD Unexpected cqe_status 0x%x for QPID=0x%0x\n, CQE_STATUS(cqe), CQE_QPID(cqe)); - ret = -EINVAL; + wc-status = IB_WC_FATAL_ERR; } } out: -- 2.3.4 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-4.2] iw_cxgb4: set the default MPA version to 2
This enables ORD/IRD negotiation and its about time to enable it by default Signed-off-by: Steve Wise sw...@opengridcomputing.com Signed-off-by: Hariprasad Shenai haripra...@chelsio.com --- drivers/infiniband/hw/cxgb4/cm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c index 3ad8dc7..75144d9 100644 --- a/drivers/infiniband/hw/cxgb4/cm.c +++ b/drivers/infiniband/hw/cxgb4/cm.c @@ -115,11 +115,11 @@ module_param(ep_timeout_secs, int, 0644); MODULE_PARM_DESC(ep_timeout_secs, CM Endpoint operation timeout in seconds (default=60)); -static int mpa_rev = 1; +static int mpa_rev = 2; module_param(mpa_rev, int, 0644); MODULE_PARM_DESC(mpa_rev, MPA Revision, 0 supports amso1100, 1 is RFC0544 spec compliant, 2 is IETF MPA Peer Connect Draft -compliant (default=1)); +compliant (default=2)); static int markers_enabled; module_param(markers_enabled, int, 0644); -- 2.3.4 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 02/13] IB/core: Find the network device matching connection parameters
From: Yotam Kenneth yota...@mellanox.com In the case of IPoIB, and maybe in other cases, the network device is managed by an upper-layer protocol (ULP). In order to expose this network device to other users of the IB device, let ULPs implement a callback that returns network device according to connection parameters. The IB device and port, together with the P_Key and the GID should be enough to uniquely identify the ULP net device. However, in current kernels there can be multiple IPoIB interfaces created with the same GID. Furthermore, such configuration may be desireable to support ipvlan-like configurations for RDMA CM with IPoIB. To resolve the device in these cases the code will also take the IP address as an additional input. Cc: Jason Gunthorpe jguntho...@obsidianresearch.com Signed-off-by: Haggai Eran hagg...@mellanox.com Signed-off-by: Yotam Kenneth yota...@mellanox.com Signed-off-by: Shachar Raindel rain...@mellanox.com Signed-off-by: Guy Shapiro gu...@mellanox.com --- drivers/infiniband/core/device.c | 46 include/rdma/ib_verbs.h | 27 +++ 2 files changed, 73 insertions(+) diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 623d8e191ced..124597732fe7 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -38,6 +38,7 @@ #include linux/slab.h #include linux/init.h #include linux/mutex.h +#include linux/netdevice.h #include rdma/rdma_netlink.h #include core_priv.h @@ -781,6 +782,51 @@ int ib_find_pkey(struct ib_device *device, } EXPORT_SYMBOL(ib_find_pkey); +/** + * ib_get_net_dev_by_params() - Return the appropriate net_dev + * for a received CM request + * @dev: An RDMA device on which the request has been received. + * @port: Port number on the RDMA device. + * @pkey: The Pkey the request came on. + * @gid: A GID that the net_dev uses to communicate. + * @addr: Contains the IP address that the request specified as its + * destination. + */ +struct net_device *ib_get_net_dev_by_params(struct ib_device *dev, + u8 port, + u16 pkey, + const union ib_gid *gid, + const struct sockaddr *addr) +{ + struct net_device *net_dev = NULL; + struct ib_client_data *context; + + if (!rdma_protocol_ib(dev, port)) + return NULL; + + down_read(lists_rwsem); + + list_for_each_entry(context, dev-client_data_list, list) { + struct ib_client *client = context-client; + + if (context-going_down) + continue; + + if (client-get_net_dev_by_params) { + net_dev = client-get_net_dev_by_params(dev, port, pkey, + gid, addr, + context-data); + if (net_dev) + break; + } + } + + up_read(lists_rwsem); + + return net_dev; +} +EXPORT_SYMBOL(ib_get_net_dev_by_params); + static int __init ib_core_init(void) { int ret; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 5b83e0c10d55..b04d2b4d1792 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -48,6 +48,7 @@ #include linux/rwsem.h #include linux/scatterlist.h #include linux/workqueue.h +#include linux/socket.h #include uapi/linux/if_ether.h #include linux/atomic.h @@ -1765,6 +1766,28 @@ struct ib_client { void (*add) (struct ib_device *); void (*remove)(struct ib_device *, void *client_data); + /* Returns the net_dev belonging to this ib_client and matching the +* given parameters. +* @dev: An RDMA device that the net_dev use for communication. +* @port:A physical port number on the RDMA device. +* @pkey:P_Key that the net_dev uses if applicable. +* @gid: A GID that the net_dev uses to communicate. +* @addr:An IP address the net_dev is configured with. +* @client_data: The device's client data set by ib_set_client_data(). +* +* An ib_client that implements a net_dev on top of RDMA devices +* (such as IP over IB) should implement this callback, allowing the +* rdma_cm module to find the right net_dev for a given request. +* +* The caller is responsible for calling dev_put on the returned +* netdev. */ + struct net_device *(*get_net_dev_by_params)( + struct ib_device *dev, + u8 port, + u16 pkey, + const union ib_gid *gid, + const struct sockaddr *addr, +
[PATCH v2 05/13] IB/cm: Share listening CM IDs
Enabling network namespaces for RDMA CM will allow processes on different namespaces to listen on the same port. In order to leave namespace support out of the CM layer, this requires that multiple RDMA CM IDs will be able to share a single CM ID. This patch adds infrastructure to retrieve an existing listening ib_cm_id, based on its device and service ID, or create a new one if one does not already exist. It also adds a reference count for such instances (cm_id_private.listen_sharecount), and prevents cm_destroy_id from destroying a CM if it is still shared. See the relevant discussion [1]. [1] Re: [PATCH v3 for-next 05/13] IB/cm: Reference count ib_cm_ids http://www.spinics.net/lists/netdev/msg328860.html Reviewed-by: Jason Gunthorpe jguntho...@obsidianresearch.com Signed-off-by: Haggai Eran hagg...@mellanox.com --- drivers/infiniband/core/cm.c | 126 --- include/rdma/ib_cm.h | 4 ++ 2 files changed, 124 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 93e9e2f34fc6..bcad4cf8404e 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -213,6 +213,9 @@ struct cm_id_private { spinlock_t lock;/* Do not acquire inside cm.lock */ struct completion comp; atomic_t refcount; + /* Number of clients sharing this ib_cm_id. Only valid for listeners. +* Protected by the cm.lock spinlock. */ + int listen_sharecount; struct ib_mad_send_buf *msg; struct cm_timewait_info *timewait_info; @@ -859,9 +862,15 @@ retest: spin_lock_irq(cm_id_priv-lock); switch (cm_id-state) { case IB_CM_LISTEN: - cm_id-state = IB_CM_IDLE; spin_unlock_irq(cm_id_priv-lock); + spin_lock_irq(cm.lock); + if (--cm_id_priv-listen_sharecount 0) { + /* The id is still shared. */ + cm_deref_id(cm_id_priv); + spin_unlock_irq(cm.lock); + return; + } rb_erase(cm_id_priv-service_node, cm.listen_service_table); spin_unlock_irq(cm.lock); break; @@ -941,11 +950,32 @@ void ib_destroy_cm_id(struct ib_cm_id *cm_id) } EXPORT_SYMBOL(ib_destroy_cm_id); -int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, -struct ib_cm_compare_data *compare_data) +/** + * __ib_cm_listen - Initiates listening on the specified service ID for + * connection and service ID resolution requests. + * @cm_id: Connection identifier associated with the listen request. + * @service_id: Service identifier matched against incoming connection + * and service ID resolution requests. The service ID should be specified + * network-byte order. If set to IB_CM_ASSIGN_SERVICE_ID, the CM will + * assign a service ID to the caller. + * @service_mask: Mask applied to service ID used to listen across a + * range of service IDs. If set to 0, the service ID is matched + * exactly. This parameter is ignored if %service_id is set to + * IB_CM_ASSIGN_SERVICE_ID. + * @compare_data: This parameter is optional. It specifies data that must + * appear in the private data of a connection request for the specified + * listen request. + * @lock: If set, lock the cm.lock spin-lock when adding the id to the + * listener tree. When false, the caller must already hold the spin-lock, + * and compare_data must be NULL. + */ +static int __ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, + __be64 service_mask, + struct ib_cm_compare_data *compare_data, + bool lock) { struct cm_id_private *cm_id_priv, *cur_cm_id_priv; - unsigned long flags; + unsigned long flags = 0; int ret = 0; service_mask = service_mask ? service_mask : ~cpu_to_be64(0); @@ -970,8 +1000,10 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, } cm_id-state = IB_CM_LISTEN; + if (lock) + spin_lock_irqsave(cm.lock, flags); - spin_lock_irqsave(cm.lock, flags); + ++cm_id_priv-listen_sharecount; if (service_id == IB_CM_ASSIGN_SERVICE_ID) { cm_id-service_id = cpu_to_be64(cm.listen_service_id++); cm_id-service_mask = ~cpu_to_be64(0); @@ -980,18 +1012,100 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, cm_id-service_mask = service_mask; } cur_cm_id_priv = cm_insert_listen(cm_id_priv); - spin_unlock_irqrestore(cm.lock, flags); if (cur_cm_id_priv) { cm_id-state = IB_CM_IDLE; + --cm_id_priv-listen_sharecount; kfree(cm_id_priv-compare_data); cm_id_priv-compare_data = NULL;
[PATCH v2 08/13] IB/cm: Expose BTH P_Key in CM and SIDR request events
The rdma_cm module will later use the P_Key from the BTH to de-mux requests. See discussion at: http://www.spinics.net/lists/netdev/msg336067.html Cc: Jason Gunthorpe jguntho...@obsidianresearch.com Cc: Liran Liss lir...@mellanox.com Signed-off-by: Haggai Eran hagg...@mellanox.com --- drivers/infiniband/core/cm.c | 20 include/rdma/ib_cm.h | 6 ++ 2 files changed, 26 insertions(+) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index bcad4cf8404e..a05c17b336aa 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -1408,6 +1408,24 @@ static void cm_format_paths_from_req(struct cm_req_msg *req_msg, } } +static u16 cm_get_bth_pkey(struct cm_work *work) +{ + struct ib_device *ib_dev = work-port-cm_dev-ib_device; + u8 port_num = work-port-port_num; + u16 pkey_index = work-mad_recv_wc-wc-pkey_index; + u16 pkey; + int ret; + + ret = ib_get_cached_pkey(ib_dev, port_num, pkey_index, pkey); + if (ret) { + dev_warn_ratelimited(ib_dev-dev, ib_cm: Couldn't retrieve pkey for incoming request (port %d, pkey index %d). %d\n, +port_num, pkey_index, ret); + return 0; + } + + return pkey; +} + static void cm_format_req_event(struct cm_work *work, struct cm_id_private *cm_id_priv, struct ib_cm_id *listen_id) @@ -1418,6 +1436,7 @@ static void cm_format_req_event(struct cm_work *work, req_msg = (struct cm_req_msg *)work-mad_recv_wc-recv_buf.mad; param = work-cm_event.param.req_rcvd; param-listen_id = listen_id; + param-bth_pkey = cm_get_bth_pkey(work); param-port = cm_id_priv-av.port-port_num; param-primary_path = work-path[0]; if (req_msg-alt_local_lid) @@ -3109,6 +3128,7 @@ static void cm_format_sidr_req_event(struct cm_work *work, param-pkey = __be16_to_cpu(sidr_req_msg-pkey); param-listen_id = listen_id; param-service_id = sidr_req_msg-service_id; + param-bth_pkey = cm_get_bth_pkey(work); param-port = work-port-port_num; work-cm_event.private_data = sidr_req_msg-private_data; } diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h index f7fd22f10bae..5b54cf77862e 100644 --- a/include/rdma/ib_cm.h +++ b/include/rdma/ib_cm.h @@ -113,6 +113,10 @@ struct ib_cm_id; struct ib_cm_req_event_param { struct ib_cm_id *listen_id; + + /* P_Key that was used by the GMP's BTH header */ + u16 bth_pkey; + u8 port; struct ib_sa_path_rec *primary_path; @@ -224,6 +228,8 @@ struct ib_cm_apr_event_param { struct ib_cm_sidr_req_event_param { struct ib_cm_id *listen_id; __be64 service_id; + /* P_Key that was used by the GMP's BTH header */ + u16 bth_pkey; u8 port; u16 pkey; }; -- 1.7.11.2 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 07/13] IB/cma: Helper functions to access port space IDRs
Add helper functions to access the IDRs by port-space and port number. Pass around the port-space enum in cma.c instead of using pointers to port-space IDRs. Signed-off-by: Haggai Eran hagg...@mellanox.com Signed-off-by: Yotam Kenneth yota...@mellanox.com Signed-off-by: Shachar Raindel rain...@mellanox.com Signed-off-by: Guy Shapiro gu...@mellanox.com --- drivers/infiniband/core/cma.c | 81 --- 1 file changed, 60 insertions(+), 21 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index cf5c48b0b7d5..f2d799209412 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -113,6 +113,22 @@ static DEFINE_IDR(udp_ps); static DEFINE_IDR(ipoib_ps); static DEFINE_IDR(ib_ps); +static struct idr *cma_idr(enum rdma_port_space ps) +{ + switch (ps) { + case RDMA_PS_TCP: + return tcp_ps; + case RDMA_PS_UDP: + return udp_ps; + case RDMA_PS_IPOIB: + return ipoib_ps; + case RDMA_PS_IB: + return ib_ps; + default: + return NULL; + } +} + struct cma_device { struct list_headlist; struct ib_device*device; @@ -122,11 +138,33 @@ struct cma_device { }; struct rdma_bind_list { - struct idr *ps; + enum rdma_port_spaceps; struct hlist_head owners; unsigned short port; }; +static int cma_ps_alloc(enum rdma_port_space ps, + struct rdma_bind_list *bind_list, int snum) +{ + struct idr *idr = cma_idr(ps); + + return idr_alloc(idr, bind_list, snum, snum + 1, GFP_KERNEL); +} + +static struct rdma_bind_list *cma_ps_find(enum rdma_port_space ps, int snum) +{ + struct idr *idr = cma_idr(ps); + + return idr_find(idr, snum); +} + +static void cma_ps_remove(enum rdma_port_space ps, int snum) +{ + struct idr *idr = cma_idr(ps); + + idr_remove(idr, snum); +} + enum { CMA_OPTION_AFONLY, }; @@ -1069,7 +1107,7 @@ static void cma_release_port(struct rdma_id_private *id_priv) mutex_lock(lock); hlist_del(id_priv-node); if (hlist_empty(bind_list-owners)) { - idr_remove(bind_list-ps, bind_list-port); + cma_ps_remove(bind_list-ps, bind_list-port); kfree(bind_list); } mutex_unlock(lock); @@ -2365,8 +2403,8 @@ static void cma_bind_port(struct rdma_bind_list *bind_list, hlist_add_head(id_priv-node, bind_list-owners); } -static int cma_alloc_port(struct idr *ps, struct rdma_id_private *id_priv, - unsigned short snum) +static int cma_alloc_port(enum rdma_port_space ps, + struct rdma_id_private *id_priv, unsigned short snum) { struct rdma_bind_list *bind_list; int ret; @@ -2375,7 +2413,7 @@ static int cma_alloc_port(struct idr *ps, struct rdma_id_private *id_priv, if (!bind_list) return -ENOMEM; - ret = idr_alloc(ps, bind_list, snum, snum + 1, GFP_KERNEL); + ret = cma_ps_alloc(ps, bind_list, snum); if (ret 0) goto err; @@ -2388,7 +2426,8 @@ err: return ret == -ENOSPC ? -EADDRNOTAVAIL : ret; } -static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv) +static int cma_alloc_any_port(enum rdma_port_space ps, + struct rdma_id_private *id_priv) { static unsigned int last_used_port; int low, high, remaining; @@ -2399,7 +2438,7 @@ static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv) rover = prandom_u32() % remaining + low; retry: if (last_used_port != rover - !idr_find(ps, (unsigned short) rover)) { + !cma_ps_find(ps, (unsigned short)rover)) { int ret = cma_alloc_port(ps, id_priv, rover); /* * Remember previously used port number in order to avoid @@ -2454,7 +2493,8 @@ static int cma_check_port(struct rdma_bind_list *bind_list, return 0; } -static int cma_use_port(struct idr *ps, struct rdma_id_private *id_priv) +static int cma_use_port(enum rdma_port_space ps, + struct rdma_id_private *id_priv) { struct rdma_bind_list *bind_list; unsigned short snum; @@ -2464,7 +2504,7 @@ static int cma_use_port(struct idr *ps, struct rdma_id_private *id_priv) if (snum PROT_SOCK !capable(CAP_NET_BIND_SERVICE)) return -EACCES; - bind_list = idr_find(ps, snum); + bind_list = cma_ps_find(ps, snum); if (!bind_list) { ret = cma_alloc_port(ps, id_priv, snum); } else { @@ -2487,25 +2527,24 @@ static int cma_bind_listen(struct rdma_id_private *id_priv) return ret; } -static struct idr *cma_select_inet_ps(struct rdma_id_private *id_priv) +static enum rdma_port_space
[PATCH v2 00/13] Demux IB CM requests in the rdma_cm module
Thanks everyone for the review comments. I've updated the patch set accordingly. The changes are listed below. In addition to the changes discussed on the list I've made sure AF_IB continues to work by retrieving parameters from the listener ID when an AF_IB request is detected. Changes from v1: - Patch 1: mark ib_client_data as going down instead of removing all client contexts during de-registration. - Patch 2: * move kdoc to the function definition * do not call get_net_dev_by_params() on devices/clients that are going down * pass client data directly to the callback - Patch 3: * pass client data directly to callback * fix a lockdep warning in ipoib_match_gid_pkey_addr() * remove a debugging print left over * set a rate limit to the duplicated IP address warning - Patch 5: * change atomic_dec(id-refcount) to cm_deref_id() * always update listen_sharecount under the cm.lock spinlock - Patch 6: handle AF_IB requests by getting parameters from the listener - Patch 8: new patch to expose BTH P_Key from ib_cm to rdma_cm - Patch 9: * get P_Key used for de-mux from the BTH * use -EAFNOSUPPORT in cma_save_ip_info to designate a possible AF_IB connection request * pass a NULL netdev for AF_IB requests - Patch 11: handle AF_IB connections by filling connection information from the listener id instead of from the net_dev - Patch 12: fix mention of the old ib_cm_id_create_and_listen function in the changelog entry. Changes from v0: - Added a patch to prevent a race between ib_unregister_device() and ib_get_net_dev_by_params(). - Removed the patch that exported a UD GMP packet's GID from the GRH, and related code. - Patch 3: * Add _rcu suffix to ipoib_is_dev_match_addr(). * Add helper function to get the master netdev for bonding support. * Scan for matching net devices in two phases: first without looking at * the IP address, and then looking at the IP address only when the first phase did not find a unique net device. - Patch 5: * Do not init listen_sharecount = 1 for non-listening ib_cm_ids. * Remove code that sets a CM ID's state to IB_CM_IDLE right before destruction. * Rename ib_cm_id_create_and_listen() to ib_cm_insert_listen(). * Do not increase reference counts when failing to add a shared CM ID due to having a different handler callback. - Patch 9: Clean IPv4 net_dev validation function. - Added patch 10: new patch to use the found net_dev in IB/cma for eliminating unneeded calls to cma_translate_addr. - Patch 12: Remove the lock argument to __ib_cm_listen(). The rdma_cm module relies today on the ib_cm module to demux incoming requests based on their service ID and IP address. The ib_cm module is the wrong place to perform this task, as it can also be used with services that do not adhere to the RDMA IP CM service as defined in the IBA specifications. It is forced to use an opaque private data struct and mask to compare incoming requests against. This series moves that demux task responsibility to the rdma_cm module. The rdma_cm module can look into the private data attached to a CM request, containing the IP addresses related to the request. It uses the details of the request to find the net device associated with the request, and use that net device to find the correct listening rdma_cm_id. The series applies against Doug's for-v4.2 tree with the patch adding a rwsem to IB core [2] applied. The series is structured as follows: Patch 1 prevents a possible race between ib_client.remove() callbacks from ib_unregister_device(), and ib_client callbacks that rely on the lists_rwsem locked for read, such as ib_get_net_dev_by_params(). Both callbacks may call ib_get_client_data(), and the patch makes sure that the remove callback doesn't free the client data while it is being used by the other callback. Patches 2-3 add the ability to lookup a network device according to the IB device, port, P_Key, GID and IP address. They find the matching IPoIB interfaces, and return a matching net_device if one exists. Patches 4-5 make necessary changes in ib_cm to allow RDMA CM get the information it needs out of CM and SIDR requests, and share a single ib_cm_id with multiple RDMA CM listeners. Patches 6-7 do some preliminary refactoring to the rdma_cm module. They allow extracting information out of incoming requests instead of retrieving them from a listening CM ID, and add helper functions to access the port space IDRs. Finally, patches 8-12 change rdma_cm to demultiplex requests on its own, and patch 13 cleans up the now unneeded code in ib_cm to compare against the private data. This series contains a subset of the RDMA CM namespaces patches [1]. The changes from v4 of the relevant patches are: - Patch 1 * in addition to the IB device, port, P_Key and IP address, pass also the GID, to make future IPoIB devices with alias GIDs to unique. * return the matching net_device instead of a network namespace. - Patch 2: use
Re: [PATCH V6 6/9] isert: Rename IO functions to more descriptive names
On 7/26/2015 6:53 PM, Christoph Hellwig wrote: On Sun, Jul 26, 2015 at 02:00:51PM +0300, Sagi Grimberg wrote: On the wire iser sends a single rkey, but the target is allowed to transfer the data however it wants to. So you're trying to get above the limit of a single RDMA READ, not above the limit for memory registration in the initiator? Correct. In that case your explanation makes sense, that's just not what I expected to be the limiting factor. In the initiator case, there is no way to support transfer size that exceeds the device registration length capabilities (unless we start using higher-order atomic allocations which we won't). -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html