Roland: Steve and I came to the same conclusion on the airplane ride back to Austin. Whereas plain old TCP/IP selects a device at the bottom of the stack, RDMA transports must select the device at the top because pre-connect resources must be allocated and these resouces are associated with a particular device.
I think you've absolutely nailed the active side (by the way, I think the ib_at_route_by_ip service already performs the necessary routing function). The listen side, however, I think needs a little tweaking. It would be beneficial if the client can specify either an IP address and port to listen on (effectively selecting a particular device), or a wild card (all RDMA devices). An NFS server is an example of the later. This is trivial to do by providing an address to the listen call where a '0' represents a wild card. > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Roland Dreier > Sent: Wednesday, August 24, 2005 12:07 AM > To: openib-general@openib.org > Subject: [openib-general] RDMA connection and address translation API > > At the OpenIB workshop on Monday, we had some discussion > about a high-level transport-neutral API for connection > handling. After giving the topic some more thought, I've > come to the conclusion that neither the kDAPL API nor the new > API that was presented are usable. > In this email, I'll try to detail my reasoning and sketch > what I believe is the correct API. > > The new API that we looked at was essentially the following > (I'm recreating this from memory, so I apologize if I > misrepresent it): > > listen(local_ip_address, service_id, listen_callback) > connect(local_qp, remote_ip_address, qos, service_id, > private_data, connect_callback) > > We already discussed the problem with having the listen > callback pass the consumer a remote source address -- doing > this requires the connection handling module to do an ATS > reverse lookup in the IB case, which the consumer might not > want. I think there's agreement that the correct thing here > is for the listen callback to pass a transport address to the > consumer and provide a function that the consumer can call to > perform an ATS reverse lookup if desired. This isn't a major > problem and can be dealt with. > > However, there's another problem with trying to lump address > translation and connection into a single "connect" call, and > this problem looks fundamental and fatal to me. The connect > call takes a QP pointer, but to create a QP the consumer > needs to know which local device to use. However, the > consumer doesn't know which device to use until the > destination address has been resolved to a route, including a > local interface. > > As far as I can tell, kDAPL punts on this and simply requires > the consumer to handle the route lookup itself before calling > dat_ep_connect(). It seems that current kDAPL consumers > similarly punt on this issue: the iSER initiator and the > NFS-RDMA client both just use a single device which is > statically discovered at init time. > > It seems that the kDAPL connection model has a serious flaw, > in that it pushes the complexity of route lookup into the > consumer. Further, we have strong evidence that this routing > code is hard to write and that consumers will just ignore > this complexity and hard-code solutions that don't work under > all configurations. > > With this in mind, I believe that the connection API needs to > be something more like the following: > > rdma_resolve_address(): > inputs: dest IP address, qos, npaths, > done callback, opaque context > done callback params: status, local RDMA device, > RDMA transport address, context > > This function starts the process of resolving an IP address to > an RDMA device and address. When the resolution is complete, > the callback is called with a status. If the status is > "success" then the callback also gets the device pointer and > transport address (as well as the original context that the > consumer passed in). > > The "RDMA transport address" type is a union containing > transport-dependent data. In the IB case, it's all of the > SGID, DGID, SLID, DLID, SL etc. that we know and love. In the > iWARP case, it's the source IP, destination IP and QOS. > > npaths can be either 1 or 2 in the IB case; if it's 2, then > the resolver will try to find a primary and alternate path for > APM. In the iWARP case, I guess npaths will always be 1, and > I guess anyone who wants to use iWARP over multihomed SCTP > will probably have to use some lower-level API. > > By the way, we may also have to have the option of passing in > a local netdev so that we can handle link-local IPv6 > addresses. There may be other cases I haven't thought of yet. > I just hope we can avoid going all the way to the horror of > the getaddrinfo() API. > > I also hope we can agree to use IPoIB ARP to resolve the > address in the IB case; having a flag or some other hack in > the API to expose the option of ATS seems unacceptably ugly. > > rdma_connect(): > inputs: local QP, RDMA transport address, destination service, > private data, timeout, event callback, opaque context > > This function takes the resolved address and actually > connects. > > I'm not sure how we want to abstract the IB service vs. iWARP > TCP port number difference. I guess it's OK to have iWARP > consumers stick their (16-bit) port number in a 64-bit > parameter, even if it's not the prettiest API. > > To head off the knee-jerk objection: this API does NOT > require any transport-specific code in consumers (unless a > particular consumer WANTS to look inside the RDMA transport > address). Code to connect would be as simple as: > > rdma_resolve_address(...); > /* wait for resolution */ > ib_create_qp(...) /* use device pointer we got from > rdma_resolve_address() */ > rdma_connect(...); /* pass transport address we got from > rdma_resolve_address() */ > /* wait for connection to finish... */ > > The listen side is even simpler: > > rdma_listen(): > inputs: local service, event callback, consumer context > > Wait for connection requests and pass events to the consumer's > callback. I'm not sure if/home we want to support binding to > a particular IP address. The current IB CM in Linux doesn't > support binding a listen to a single device or port, and even > if it did it's not clear how to handle binding to one IP > address when a port has more than one IP. > > I guess the event callback would receive a device pointer and > the same RDMA transport address union I talked about above > when discussing address resolution. > > It would be possible to have another function like > rdma_getpeername() that takes the transport address and > returns a source IP address. In the IB case this would do an > ATS reverse lookup. However, I hate this idea. iSER already > uses the CM private data to pass the source IP in the IB case, > and I would much rather fix NFS/RDMA to do the same thing (so > we can just kill ATS as an address resolution method). > > - R. > _______________________________________________ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > _______________________________________________ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general