On Wed, Aug 11, 2010 at 03:22:45PM -0700, Hefty, Sean wrote: > > It seems the new API has too many constraints for XRC. There are a couple > > things that don't fit: > > > > - XRC needs a domain, which must be created before creating the QP, but > > after we know > > the device to use. In addition it also needs a file descriptor. The > > application may > > want to use a different fd depending on the device. Currently the domain > > can only > > be created in the middle of rdma_create_ep(). > This looks like a gap in the APIs. There's no easy way to associate > the data returned by rdma_addrinfo to a specific ibv_device. Part of > the issue is that rdma_addrinfo may not have an ai_src_addr. > gurgle...
This is why I liked the notion of passing in the pd. This restricts getaddrinfo to doing something that is compatible with the PD and when the rdma_cm_id is created and bound it is bound to a device, selected by getaddrinfo, or the kernel, that is compatible with the given PD. [** I looked at this for a bit, and I couldn't convince myself the current imeplementation doesn't have this gap either. The rdma_cm_id is bound to a device based on IP addresses, but it can be bound without specifying a PD - so there really is no guarentee that the PD you want to use will be compatible with the device the kernel selects - I bet this means most RDMA CM using apps will explode if you do something like IPoIB bond to HCAs..] [The other view is that exporting per device domains to userspace means the kernel has walked away from its role as HW resource virtualizer. Why can't a PD be global and the kernel swap it into HW as necessary? Makes much of this API mess instantly disappear.] Ditto for XRC domains. I think the flow works best for apps, generally apps are being written that can handle only one domain - so they should get the domain through a 0 call to getaddrinfo and then re use that domain in all future calls for secondary connections. > I agree with Jason that we can still change the newer calls. In > this case, the problem isn't limited to XRC. The user will have > issues just trying to specify the CQs that should be associated with > the QP. Maybe the 'fix' here is to remove rdma_create_qp() from > rdma_create_ep() -- which basically replaces that API with > rdma_create_id2(**id, *res). Maybe 3 functions, since you already have create_ep: create_id_ep - takes rdma_addrinfo, allocates PD/XRC, rdma_cm_id create_qp_ep - takes rdma_addrinfo, allocates QP, CQ, etc create_ep - just calls both the above. Very simplified (not sure on the names) Flow is then: // First QP hints = 0; rdma_getaddrinfo(..,&hints,&res); rdma_create_id_ep(&id,&res) // id->verbs, id->pd, id->xrcdomain are valid now rdma_create_qp_ep(id,res,&attrs); // Second QP hints.pd = first_id->pd; hints.xrcdomain = first_id->xrcdomain; rdma_getaddrinfo(...,&hints,&res); res->pd/xrcdomain are == first_id // No pd is allocated rdma_create_ep(&second_id,&res,&attrs); How do you keep track of the lifetime of the pd though? This also cleans up the confusing half-state of the rdma_cm_id with the legacy API where id->verbs can be 0. > > - The server side of the connection also needs an SRQ. It's not obvious > > whether it's > > the application or rdma cm to create that SRQ. And that SRQ number must > > be > > given to the client side, presumably in the private data. > > The desired mapping of XRC to the librdmacm isn't clear to me. For > example, after 'connecting' is two-way communication possible > (setting up INI/TGT pairs on both nodes), or is a connection only > one-way (setup local INI to remote TGT)? Also, as you point out, > how are SRQ values exchanged? Does private data carry one SRQ > value, all SRQ values for remote processes, none? Well, I think RDMACM should do the minimum above what is defined for the CM protocol, so for XRC that is a unidirectional connect and it only creates INI/TGT pairs. The required SRQ(s) will have to be setup by the user - I expect the typical use would be SRQs shared by multiple TGT QPs. It looks to me like the main use model for this is peer-peer, so each side would establish their send half independently and message routing would be app specific. This means the CM initiator side should be the side that has the INI QP and the CM target side should be the side with TGT - ? Absent any standards, private data SRQ number exchange is protocol specific.. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html