Re: [PATCH] rdma cm + XRC
I feel similarly about the XRC domain. Is there any real reason to expose it? What if we just defined a 1:1 relationship between PDs and XRC domains, or between XRC domains and XRC TGT QPs? Near as I can tell it serves the same purpose as the PD, to provide a form of security within a single process.. No, XRC domains can be shared between different processes -- that's kind of the point of XRCs. - R. -- Roland Dreier rola...@cisco.com || For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/index.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] rdma cm + XRC
It seems the new API has too many constraints for XRC. There are a couple things that don't fit: - XRC needs a domain, which must be created before creating the QP, but after we know the device to use. In addition it also needs a file descriptor. The application may want to use a different fd depending on the device. Currently the domain can only be created in the middle of rdma_create_ep(). This looks like a gap in the APIs. There's no easy way to associate the data returned by rdma_addrinfo to a specific ibv_device. Part of the issue is that rdma_addrinfo may not have an ai_src_addr. gurgle... I agree with Jason that we can still change the newer calls. In this case, the problem isn't limited to XRC. The user will have issues just trying to specify the CQs that should be associated with the QP. Maybe the 'fix' here is to remove rdma_create_qp() from rdma_create_ep() -- which basically replaces that API with rdma_create_id2(**id, *res). - The server side of the connection also needs an SRQ. It's not obvious whether it's the application or rdma cm to create that SRQ. And that SRQ number must be given to the client side, presumably in the private data. The desired mapping of XRC to the librdmacm isn't clear to me. For example, after 'connecting' is two-way communication possible (setting up INI/TGT pairs on both nodes), or is a connection only one-way (setup local INI to remote TGT)? Also, as you point out, how are SRQ values exchanged? Does private data carry one SRQ value, all SRQ values for remote processes, none? - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rdma cm + XRC
On 08/11/2010 05:22 PM, Hefty, Sean wrote: - The server side of the connection also needs an SRQ. It's not obvious whether it's the application or rdma cm to create that SRQ. And that SRQ number must be given to the client side, presumably in the private data. The desired mapping of XRC to the librdmacm isn't clear to me. For example, after 'connecting' is two-way communication possible (setting up INI/TGT pairs on both nodes), or is a connection only one-way (setup local INI to remote TGT)? Also, as you point out, how are SRQ values exchanged? Does private data carry one SRQ value, all SRQ values for remote processes, none? XRC is one way. There is a send QP (which looks like a regular RC QP), and a receive QP. Logically, the listen side would create an XRC receive QP, and for each incoming connection would create/attach an SRQ to it. But the initiator would have to know the SRQ number to be able to send data, and the easiest way to pass that informations is in the CM REP private data. I think one way to see is that you connect a QP on one side to an SRQs on the other side. In case of XRC, rdma_get_request() could create that SRQ instead of a QP. Frank. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rdma cm + XRC
On Wed, Aug 11, 2010 at 03:22:45PM -0700, Hefty, Sean wrote: It seems the new API has too many constraints for XRC. There are a couple things that don't fit: - XRC needs a domain, which must be created before creating the QP, but after we know the device to use. In addition it also needs a file descriptor. The application may want to use a different fd depending on the device. Currently the domain can only be created in the middle of rdma_create_ep(). This looks like a gap in the APIs. There's no easy way to associate the data returned by rdma_addrinfo to a specific ibv_device. Part of the issue is that rdma_addrinfo may not have an ai_src_addr. gurgle... This is why I liked the notion of passing in the pd. This restricts getaddrinfo to doing something that is compatible with the PD and when the rdma_cm_id is created and bound it is bound to a device, selected by getaddrinfo, or the kernel, that is compatible with the given PD. [** I looked at this for a bit, and I couldn't convince myself the current imeplementation doesn't have this gap either. The rdma_cm_id is bound to a device based on IP addresses, but it can be bound without specifying a PD - so there really is no guarentee that the PD you want to use will be compatible with the device the kernel selects - I bet this means most RDMA CM using apps will explode if you do something like IPoIB bond to HCAs..] [The other view is that exporting per device domains to userspace means the kernel has walked away from its role as HW resource virtualizer. Why can't a PD be global and the kernel swap it into HW as necessary? Makes much of this API mess instantly disappear.] Ditto for XRC domains. I think the flow works best for apps, generally apps are being written that can handle only one domain - so they should get the domain through a 0 call to getaddrinfo and then re use that domain in all future calls for secondary connections. I agree with Jason that we can still change the newer calls. In this case, the problem isn't limited to XRC. The user will have issues just trying to specify the CQs that should be associated with the QP. Maybe the 'fix' here is to remove rdma_create_qp() from rdma_create_ep() -- which basically replaces that API with rdma_create_id2(**id, *res). Maybe 3 functions, since you already have create_ep: create_id_ep - takes rdma_addrinfo, allocates PD/XRC, rdma_cm_id create_qp_ep - takes rdma_addrinfo, allocates QP, CQ, etc create_ep - just calls both the above. Very simplified (not sure on the names) Flow is then: // First QP hints = 0; rdma_getaddrinfo(..,hints,res); rdma_create_id_ep(id,res) // id-verbs, id-pd, id-xrcdomain are valid now rdma_create_qp_ep(id,res,attrs); // Second QP hints.pd = first_id-pd; hints.xrcdomain = first_id-xrcdomain; rdma_getaddrinfo(...,hints,res); res-pd/xrcdomain are == first_id // No pd is allocated rdma_create_ep(second_id,res,attrs); How do you keep track of the lifetime of the pd though? This also cleans up the confusing half-state of the rdma_cm_id with the legacy API where id-verbs can be 0. - The server side of the connection also needs an SRQ. It's not obvious whether it's the application or rdma cm to create that SRQ. And that SRQ number must be given to the client side, presumably in the private data. The desired mapping of XRC to the librdmacm isn't clear to me. For example, after 'connecting' is two-way communication possible (setting up INI/TGT pairs on both nodes), or is a connection only one-way (setup local INI to remote TGT)? Also, as you point out, how are SRQ values exchanged? Does private data carry one SRQ value, all SRQ values for remote processes, none? Well, I think RDMACM should do the minimum above what is defined for the CM protocol, so for XRC that is a unidirectional connect and it only creates INI/TGT pairs. The required SRQ(s) will have to be setup by the user - I expect the typical use would be SRQs shared by multiple TGT QPs. It looks to me like the main use model for this is peer-peer, so each side would establish their send half independently and message routing would be app specific. This means the CM initiator side should be the side that has the INI QP and the CM target side should be the side with TGT - ? Absent any standards, private data SRQ number exchange is protocol specific.. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] rdma cm + XRC
Maybe 3 functions, since you already have create_ep: create_id_ep - takes rdma_addrinfo, allocates PD/XRC, rdma_cm_id create_qp_ep - takes rdma_addrinfo, allocates QP, CQ, etc create_ep - just calls both the above. Very simplified (not sure on the names) This is similar to what I was thinking, except I would just use the existing create_qp. I need to give adding PDs to rdma_addrinfo more thought. It seems that if AF_IB were ever accepted, then device specific addressing could be used, rather than relying on mapped values. As an alternative, we could define a function like: struct ibv_context *rdma_get_device(rdma_addrinfo *res); Internally, this would just end up doing a lot of the same work that create_id_ep mentioned above would do. How do you keep track of the lifetime of the pd though? The librdmacm obtains the list of devices during initialization. It allocates a PD per device, which exist while the library is loaded. This can be optimized to release the PDs when nothing is left that references the devices, but that's not done now. If the user specifies the PD, it's up to them to track it. Well, I think RDMACM should do the minimum above what is defined for the CM protocol, so for XRC that is a unidirectional connect and it only creates INI/TGT pairs. The required SRQ(s) will have to be setup by the user - I expect the typical use would be SRQs shared by multiple TGT QPs. It looks to me like the main use model for this is peer-peer, so each side would establish their send half independently and message routing would be app specific. This means the CM initiator side should be the side that has the INI QP and the CM target side should be the side with TGT - ? This is why I questioned what the desired behavior should be (from an API perspective). If the main usage model is peer-peer, then the librdmacm _could_ allocate and connect XRC INI and TGT QPs as pairs, so that bidirectional traffic was possible. (For example, the rdma_cm could respond to a CM REQ with a CM REP and a CM REQ for the return path.) I'm not saying this wouldn't end up in an implementation mess. The easiest thing to do is just perform a unidirectional connect and leave the SRQs up to the user. XRC just seems hideous to use from an application programmer viewpoint, but it seems worth exploring if an app could make use of it without significant changes from what they would do for RC QPs. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rdma cm + XRC
On Wed, Aug 11, 2010 at 05:04:00PM -0700, Hefty, Sean wrote: Maybe 3 functions, since you already have create_ep: create_id_ep - takes rdma_addrinfo, allocates PD/XRC, rdma_cm_id create_qp_ep - takes rdma_addrinfo, allocates QP, CQ, etc create_ep - just calls both the above. Very simplified (not sure on the names) This is similar to what I was thinking, except I would just use the existing create_qp. I need to give adding PDs to rdma_addrinfo more thought. It seems that if AF_IB were ever accepted, then device specific addressing could be used, rather than relying on mapped values. As an alternative, we could define a function like: Even so, the point of passing it into rdma_getaddrinfo is to restrict the device rdma_getaddrinfo selects, it doesn't matter that you can get back to the PD from the addrinfo if that PD doesn't have the resources you need attached to it. Again, I'm thinking from the app perspective where juggling multiple PDs isn't really done, and thus multiple connections on different HCAs are not supported by the app. This model should be supportable without introducing random failures when rdma_getaddrinfo returns things that use other devices. struct ibv_context *rdma_get_device(rdma_addrinfo *res); Internally, this would just end up doing a lot of the same work that create_id_ep mentioned above would do. Indeed - so why bother? create_id_ep gets you the ID, bound to a device with a verbs handle. Follow-up calls can use the convetion that 0 for the PD means 'use the global default PD' otherwise an app can allocate a new PD, or find an existing one using the verbs handle provided. It looks to me like the main use model for this is peer-peer, so each side would establish their send half independently and message routing would be app specific. This means the CM initiator side should be the side that has the INI QP and the CM target side should be the side with TGT - ? This is why I questioned what the desired behavior should be (from an API perspective). If the main usage model is peer-peer, then the librdmacm _could_ allocate and connect XRC INI and TGT QPs as pairs, so that bidirectional traffic was possible. (For example, the rdma_cm could respond to a CM REQ with a CM REP and a CM REQ for the return path.) If the model is peer-peer having half duplex connections is ideal - peer-peer would mean that either side could start setting things up at any time, dealing with the inherent races is a huge messy problem. Having each side control when its half of the connection starts up cleans things up tremendously. I'm not saying this wouldn't end up in an implementation mess. The easiest thing to do is just perform a unidirectional connect and leave the SRQs up to the user. XRC just seems hideous to use from an application programmer viewpoint, but it seems worth exploring if an app could make use of it without significant changes from what they would do for RC QPs. Well.. I admit I can't think of many uses for XRC - but I can think of two reasonable approached for non peer-peer apps: - Create a RC QP and build the XRC QPs by exchanging messages within the RC QP - avoid the CM protocol entirely. XRC would then be a secondary channel to the RC channel, probably for buffer size sorting purposes - Create the TGT QP on the initiator side and pass its info in the private message and do a double QP setup. More complex - does rdmacm provide hooks to do this? Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] rdma cm + XRC
Even so, the point of passing it into rdma_getaddrinfo is to restrict the device rdma_getaddrinfo selects, it doesn't matter that you can get back to the PD from the addrinfo if that PD doesn't have the resources you need attached to it. Again, I'm thinking from the app perspective where juggling multiple PDs isn't really done, and thus multiple connections on different HCAs are not supported by the app. This model should be supportable without introducing random failures when rdma_getaddrinfo returns things that use other devices. But why select the PD as the restriction, rather than just the device? If rdma_getaddrinfo calls into a service to obtain some of its information, then at some point an address must be sufficient. What about just passing in a device guid as the ai_src_addr? or, I guess we could add the device to rdma_getaddrinfo: rdma_getaddrinfo(struct ibv_context *verbs, char *node, char *service, struct rdma_addrinfo *hints, struct rdma_addrinfo **res); - Create the TGT QP on the initiator side and pass its info in the private message and do a double QP setup. More complex - does rdmacm provide hooks to do this? I thought about this as well. The librdmacm could steal more of the private data for this - just giving less to the user when connecting using XRC. Right now, the rdma_cm doesn't provide anything to help connect XRC QPs, so we can do what makes the most sense. I still don't know how the apps exchange SRQs, which is really part of the entire connection process... - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rdma cm + XRC
On Wed, Aug 11, 2010 at 08:30:53PM -0700, Hefty, Sean wrote: Even so, the point of passing it into rdma_getaddrinfo is to restrict the device rdma_getaddrinfo selects, it doesn't matter that you can get back to the PD from the addrinfo if that PD doesn't have the resources you need attached to it. Again, I'm thinking from the app perspective where juggling multiple PDs isn't really done, and thus multiple connections on different HCAs are not supported by the app. This model should be supportable without introducing random failures when rdma_getaddrinfo returns things that use other devices. But why select the PD as the restriction, rather than just the device? Well.. That is ok too, but I feel the PD is more general, for instance with soft iwarp/iboe there is no reason the PD is tied to a single device - devices are tied to netdevs, but the PD can be shared between them all. There are some missing APIs here and there that muddle this (ie there is no API to set the device a QP is associated with), but I'd rather not see more added :) If rdma_getaddrinfo calls into a service to obtain some of its information, then at some point an address must be sufficient. What about just passing in a device guid as the ai_src_addr? Again, the PD is more general.. There are lots of things rdma_getaddrinfo can support, mapping to a guid seems restrictive - not sure how guid mapping works with iwarp/etc when you consider vlans, for instance. or, I guess we could add the device to rdma_getaddrinfo: I'd rather see it in hints, but it isn't a bad idea to have something like that, even if it is in addition to the PD. - Create the TGT QP on the initiator side and pass its info in the private message and do a double QP setup. More complex - does rdmacm provide hooks to do this? I thought about this as well. The librdmacm could steal more of the private data for this - just giving less to the user when connecting using XRC. Right now, the rdma_cm doesn't provide anything to help connect XRC QPs, so we can do what makes the most sense. I still don't know how the apps exchange SRQs, which is really part of the entire connection process... Well, whatever it is, the whole private data thing has to be optional, because it is really protocol specific. I assume the typical model for XRC would be demand-started peer-peer. So.. app A starts, looks for friends, finds none, goes to sleep. app B starts, find A, setups up a channel from B-A and sends 'hello! Here are my SRQs..'. app A gets the message, builds a reply, looks at the destination of app B, finds no existing XRC send channel so it sets one up, sends 'hello! here are my SRQs', and the ack'ing reply to B's hello, and preseto, bi-directional communication, peer-peer initiation. I'm assuming a single SRQ is exchanged during the CM process, and the total number of SRQs exceeds the size of the CM private data - but I could also imagine packing them all in the private data if an app wanted. I guess the key thing is that only a single SRQ is necessary to start, and the target side switches to an initiator role to setup its side of the connection. It is actually pretty good if peer-peer is your app design.. If you tried to do the same with duplex channels you'd end up with races creating duplicate send sides that you'd want to tear down to save resources. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rdma cm + XRC
Hello Sean, On 08/09/2010 03:53 PM, Hefty, Sean wrote: This allow rdma ucm to establish an XRC connection between two nodes. Most of the changes are related to modify_qp since the API is different whether the QP is on the send or receive side. To create an XRC receive QP, the cap.max_send_wr must be set to 0. Conversely, to create the send XRC QP, that attribute must be non-zero. I need to give XRC support to the librdmacm more thought, but here are at least the initial concerns: - XRC support upstream (kernel and user space) is still pending. (I can start a librdmacm branch for XRC support.) - Changes are needed to the kernel rdma_cm. We could start submitting patches against Roland's xrc branch for these. - Please update to the latest librdmacm tree. More specifically, rdma_getaddrinfo should support XRC as well. The general parameters would be the same as for RC. Should we create a new ai_flag ? or a new port space ? Is it really necessary to support rdma_getaddrinfo, rdma_create_ep and the new APIs ? In general, I'd like to find a way to add XRC support to the librdmacm that makes things as simple for the user as possible. Besides the need to correctly set cap.max_send_wr, the user API is unchanged. New patch attached. diff --git a/include/rdma/rdma_cma.h b/include/rdma/rdma_cma.h index d17ef88..d18685b 100644 --- a/include/rdma/rdma_cma.h +++ b/include/rdma/rdma_cma.h @@ -125,6 +125,8 @@ struct rdma_cm_id { struct ibv_cq *send_cq; struct ibv_comp_channel *recv_cq_channel; struct ibv_cq *recv_cq; + struct ibv_xrc_domain *xrc_domain; + uint32_t xrc_rcv_qpn; }; enum { diff --git a/man/rdma_create_qp.3 b/man/rdma_create_qp.3 index 9d2de76..659e033 100644 --- a/man/rdma_create_qp.3 +++ b/man/rdma_create_qp.3 @@ -39,6 +39,10 @@ a send or receive completion queue is not specified, then a CQ will be allocated by the rdma_cm for the QP, along with corresponding completion channels. Completion channels and CQ data created by the rdma_cm are exposed to the user through the rdma_cm_id structure. +.P +To create an XRC receive QP, and in addition to the XRC QP type, +ibv_qp_init_attr.cap.max_send_wr must be set to 0. Conversely, to +create the XRC send QP, that attribute must be non-zero. .SH SEE ALSO rdma_bind_addr(3), rdma_resolve_addr(3), rdma_destroy_qp(3), ibv_create_qp(3), ibv_modify_qp(3) diff --git a/src/cma.c b/src/cma.c index a4fd574..b4eec77 100755 --- a/src/cma.c +++ b/src/cma.c @@ -948,12 +948,29 @@ static int rdma_init_qp_attr(struct rdma_cm_id *id, struct ibv_qp_attr *qp_attr, return 0; } +static int rdma_modify_qp(struct rdma_cm_id *id, + struct ibv_qp_attr *qp_attr, + int qp_attr_mask) +{ + int ret; + + if (id-qp) + ret = ibv_modify_qp(id-qp, qp_attr, qp_attr_mask); + else if (id-xrc_domain) + ret = ibv_modify_xrc_rcv_qp(id-xrc_domain, id-xrc_rcv_qpn, + qp_attr, qp_attr_mask); + else + ret = EINVAL; + + return ret; +} + static int ucma_modify_qp_rtr(struct rdma_cm_id *id, uint8_t resp_res) { struct ibv_qp_attr qp_attr; int qp_attr_mask, ret; - if (!id-qp) + if (!id-qp !id-xrc_domain) return ERR(EINVAL); /* Need to update QP attributes from default values. */ @@ -962,7 +979,7 @@ static int ucma_modify_qp_rtr(struct rdma_cm_id *id, uint8_t resp_res) if (ret) return ret; - ret = ibv_modify_qp(id-qp, qp_attr, qp_attr_mask); + ret = rdma_modify_qp(id, qp_attr, qp_attr_mask); if (ret) return ERR(ret); @@ -973,7 +990,7 @@ static int ucma_modify_qp_rtr(struct rdma_cm_id *id, uint8_t resp_res) if (resp_res != RDMA_MAX_RESP_RES) qp_attr.max_dest_rd_atomic = resp_res; - return rdma_seterrno(ibv_modify_qp(id-qp, qp_attr, qp_attr_mask)); + return rdma_seterrno(rdma_modify_qp(id, qp_attr, qp_attr_mask)); } static int ucma_modify_qp_rts(struct rdma_cm_id *id, uint8_t init_depth) @@ -988,29 +1005,29 @@ static int ucma_modify_qp_rts(struct rdma_cm_id *id, uint8_t init_depth) if (init_depth != RDMA_MAX_INIT_DEPTH) qp_attr.max_rd_atomic = init_depth; - return rdma_seterrno(ibv_modify_qp(id-qp, qp_attr, qp_attr_mask)); + return rdma_seterrno(rdma_modify_qp(id, qp_attr, qp_attr_mask)); } static int ucma_modify_qp_sqd(struct rdma_cm_id *id) { struct ibv_qp_attr qp_attr; - if (!id-qp) + if (!id-qp !id-xrc_domain) return 0; qp_attr.qp_state = IBV_QPS_SQD; - return rdma_seterrno(ibv_modify_qp(id-qp, qp_attr, IBV_QP_STATE)); + return rdma_seterrno(rdma_modify_qp(id, qp_attr, IBV_QP_STATE)); } static int ucma_modify_qp_err(struct rdma_cm_id *id) { struct ibv_qp_attr qp_attr; - if (!id-qp) + if (!id-qp !id-xrc_domain) return 0; qp_attr.qp_state = IBV_QPS_ERR; - return rdma_seterrno(ibv_modify_qp(id-qp, qp_attr, IBV_QP_STATE)); + return rdma_seterrno(rdma_modify_qp(id, qp_attr, IBV_QP_STATE)); } static int ucma_find_pkey(struct cma_device *cma_dev, uint8_t port_num, @@ -1029,7 +1046,7 @@ static int ucma_find_pkey(struct cma_device *cma_dev, uint8_t
RE: [PATCH] rdma cm + XRC
- XRC support upstream (kernel and user space) is still pending. (I can start a librdmacm branch for XRC support.) - Changes are needed to the kernel rdma_cm. We could start submitting patches against Roland's xrc branch for these. - Please update to the latest librdmacm tree. More specifically, rdma_getaddrinfo should support XRC as well. The general parameters would be the same as for RC. Should we create a new ai_flag ? or a new port space ? There's a ai_qp_type field available. I think the RDMA TCP port space would work. Is it really necessary to support rdma_getaddrinfo, rdma_create_ep and the new APIs ? I think so, yes. At least XRC needs to be handled, even if some of the calls just fail as unsupported. In general, I'd like to find a way to add XRC support to the librdmacm that makes things as simple for the user as possible. Besides the need to correctly set cap.max_send_wr, the user API is unchanged. I'm also concerned that the kernel must format the IB CM messages correctly when XRC is in use. The kernel currently formats the messages assuming that the QP is RC. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rdma cm + XRC
On Tue, Aug 10, 2010 at 09:59:50AM -0700, Hefty, Sean wrote: The general parameters would be the same as for RC. Should we create a new ai_flag ? or a new port space ? There's a ai_qp_type field available. I think the RDMA TCP port space would work. Not sure the port space matters at all? Is there anything additional CM information for XRC other than requesting an XRC QP type? (XRCSRQ or something?) Is it really necessary to support rdma_getaddrinfo, rdma_create_ep and the new APIs ? I think so, yes. At least XRC needs to be handled, even if some of the calls just fail as unsupported. I'd like to see a strong rational for leaving any of the new API unsupported for XRC - IMHO it should all be doable. The new API is supposed to be simplifying, we want people to use it.. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] rdma cm + XRC
There's a ai_qp_type field available. I think the RDMA TCP port space would work. Not sure the port space matters at all? Is there anything additional CM information for XRC other than requesting an XRC QP type? (XRCSRQ or something?) It's nothing huge: Modifications to Table 99: * Responder Resources field, Values column: 0 for XRC * Transport Service Type field, Description column: See Section 14.6.3.1 Transport Service Type. * Retry Count field, Values column: 0 for XRC * RNR Retry Count field, Values column: 0 for XRC * SRQ field, Values column: 0 for XRC * Change (reserved) field at position byte 51, bit 5 to: * Field: Extended Transport Type * Description: See Section 14.6.3.2 Extended Transport Type * Used for Purpose: C * Byte [Bit] Offset: 51 [5] * Length, Bits: 3 Modifications to Table 103: * Initiator Depth field, Values column: 0 for XRC * End-to-End Flow Control field, Values column: 0 for XRC * SRQ field, Values column: 1 for XRC But should still be set correctly. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rdma cm + XRC
On 08/10/2010 12:14 PM, Jason Gunthorpe wrote: On Tue, Aug 10, 2010 at 09:59:50AM -0700, Hefty, Sean wrote: The general parameters would be the same as for RC. Should we create a new ai_flag ? or a new port space ? There's a ai_qp_type field available. I think the RDMA TCP port space would work. Not sure the port space matters at all? Is there anything additional CM information for XRC other than requesting an XRC QP type? (XRCSRQ or something?) Creating a send or receive XRC QP is using a different API (ibv_create_qp vs ibv_create_xrc_rcv_qp) so I used the max_send_wr capability attribute to discriminate between both cases. That's the only visible change to the API. On 7/30, I posted a patch to perftest rdmb_bw to show how it's used. Is it really necessary to support rdma_getaddrinfo, rdma_create_ep and the new APIs ? I think so, yes. At least XRC needs to be handled, even if some of the calls just fail as unsupported. I'd like to see a strong rational for leaving any of the new API unsupported for XRC - IMHO it should all be doable. The new API is supposed to be simplifying, we want people to use it.. No rational, besides that someone has to write some code :) Regards, Frank. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rdma cm + XRC
On 08/10/2010 12:14 PM, Jason Gunthorpe wrote: On Tue, Aug 10, 2010 at 09:59:50AM -0700, Hefty, Sean wrote: The general parameters would be the same as for RC. Should we create a new ai_flag ? or a new port space ? There's a ai_qp_type field available. I think the RDMA TCP port space would work. Not sure the port space matters at all? Is there anything additional CM information for XRC other than requesting an XRC QP type? (XRCSRQ or something?) Is it really necessary to support rdma_getaddrinfo, rdma_create_ep and the new APIs ? I think so, yes. At least XRC needs to be handled, even if some of the calls just fail as unsupported. I'd like to see a strong rational for leaving any of the new API unsupported for XRC - IMHO it should all be doable. The new API is supposed to be simplifying, we want people to use it.. It seems the new API has too many constraints for XRC. There are a couple things that don't fit: - XRC needs a domain, which must be created before creating the QP, but after we know the device to use. In addition it also needs a file descriptor. The application may want to use a different fd depending on the device. Currently the domain can only be created in the middle of rdma_create_ep(). - The server side of the connection also needs an SRQ. It's not obvious whether it's the application or rdma cm to create that SRQ. And that SRQ number must be given to the client side, presumably in the private data. Frank. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rdma cm + XRC
On Tue, Aug 10, 2010 at 04:05:42PM -0500, frank zago wrote: It seems the new API has too many constraints for XRC. There are a couple things that don't fit: I'll try to take a more careful look at this later, but just want to say that the new APIs are so new that we could still change them - not supporting XRC seems like a API design failing that might haunt us later? Keep in mind that rdma_getaddrinfo is the scheme to be used for CM address resolution scalability, so it seems natural that anyone using RDMA CM and XRC would want to use both together. - XRC needs a domain, which must be created before creating the QP, but after we know the device to use. In addition it also needs a file descriptor. The application may want to use a different fd depending on the device. Currently the domain can only be created in the middle of rdma_create_ep(). Well.. the XRC domain needs to be an input to create_ep just like the PD :( In looking at how this API turned out maybe the PD should have been carried in the rdma_addrinfo? Certainly I would put the XRC domain in there.. Recall my original comments about the PD being used to restrict device selection in rdma_getaddrinfo. Not sure what the FD is about (been awhile since the libibverbs XRC patches were posted)? - The server side of the connection also needs an SRQ. It's not obvious whether it's the application or rdma cm to create that SRQ. And that SRQ number must be given to the client side, presumably in the private data. All rdma_create_ep could do is setup the XRC INI QP and XRC TGT QP, the XRC SRQ is associated with the XRC domain so it has to be managed by the app. The private data is passed into rdma_connect, after create_ep, so there seems to be no problem there, the app can format up the SRQ number according to the CM private data protocol it is using.. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] rdma cm + XRC
Well.. the XRC domain needs to be an input to create_ep just like the PD :( In looking at how this API turned out maybe the PD should have been carried in the rdma_addrinfo? Certainly I would put the XRC domain in there.. Recall my original comments about the PD being used to restrict device selection in rdma_getaddrinfo. Personally, if I had it to do over, I don't know that I would have exposed the PD through the librdmacm at all. Does anyone ever allocate more than one per device? I feel similarly about the XRC domain. Is there any real reason to expose it? What if we just defined a 1:1 relationship between PDs and XRC domains, or between XRC domains and XRC TGT QPs? - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rdma cm + XRC
On Tue, Aug 10, 2010 at 04:18:57PM -0700, Hefty, Sean wrote: Well.. the XRC domain needs to be an input to create_ep just like the PD :( In looking at how this API turned out maybe the PD should have been carried in the rdma_addrinfo? Certainly I would put the XRC domain in there.. Recall my original comments about the PD being used to restrict device selection in rdma_getaddrinfo. Personally, if I had it to do over, I don't know that I would have exposed the PD through the librdmacm at all. Does anyone ever allocate more than one per device? That isn't a half bad idea I guess, and if it was in rdma_addinfo it could be 0 = use global PD/XRC and 99% of apps can just do that.. Maybe you should just go ahead and do that? It would not be hard to do so and even keep ABI (but not API) compatability. Now would be the time :) I feel similarly about the XRC domain. Is there any real reason to expose it? What if we just defined a 1:1 relationship between PDs and XRC domains, or between XRC domains and XRC TGT QPs? Near as I can tell it serves the same purpose as the PD, to provide a form of security within a single process.. I could imagine things like storage apps wanting to use them - MR's/etc need to be carefully compartmentalized if you do not entirely trust your peer. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] rdma cm + XRC
This allow rdma ucm to establish an XRC connection between two nodes. Most of the changes are related to modify_qp since the API is different whether the QP is on the send or receive side. To create an XRC receive QP, the cap.max_send_wr must be set to 0. Conversely, to create the send XRC QP, that attribute must be non-zero. I need to give XRC support to the librdmacm more thought, but here are at least the initial concerns: - XRC support upstream (kernel and user space) is still pending. (I can start a librdmacm branch for XRC support.) - Changes are needed to the kernel rdma_cm. We could start submitting patches against Roland's xrc branch for these. - Please update to the latest librdmacm tree. More specifically, rdma_getaddrinfo should support XRC as well. In general, I'd like to find a way to add XRC support to the librdmacm that makes things as simple for the user as possible. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rdma cm + XRC
Hello Frank, Thank you for these patches.. !! We are working on a user mode IPC based on XRC.. we will try these patches.. frank zago wrote: Hello, This allow rdma ucm to establish an XRC connection between two nodes. Most of the changes are related to modify_qp since the API is different whether the QP is on the send or receive side. To create an XRC receive QP, the cap.max_send_wr must be set to 0. Conversely, to create the send XRC QP, that attribute must be non-zero. Regards, Frank -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html