At the OpenIB workshop on Monday, we had some discussion about a high-level transport-neutral API for connection handling. After giving the topic some more thought, I've come to the conclusion that neither the kDAPL API nor the new API that was presented are usable. In this email, I'll try to detail my reasoning and sketch what I believe is the correct API.
The new API that we looked at was essentially the following (I'm recreating this from memory, so I apologize if I misrepresent it): listen(local_ip_address, service_id, listen_callback) connect(local_qp, remote_ip_address, qos, service_id, private_data, connect_callback) We already discussed the problem with having the listen callback pass the consumer a remote source address -- doing this requires the connection handling module to do an ATS reverse lookup in the IB case, which the consumer might not want. I think there's agreement that the correct thing here is for the listen callback to pass a transport address to the consumer and provide a function that the consumer can call to perform an ATS reverse lookup if desired. This isn't a major problem and can be dealt with. However, there's another problem with trying to lump address translation and connection into a single "connect" call, and this problem looks fundamental and fatal to me. The connect call takes a QP pointer, but to create a QP the consumer needs to know which local device to use. However, the consumer doesn't know which device to use until the destination address has been resolved to a route, including a local interface. As far as I can tell, kDAPL punts on this and simply requires the consumer to handle the route lookup itself before calling dat_ep_connect(). It seems that current kDAPL consumers similarly punt on this issue: the iSER initiator and the NFS-RDMA client both just use a single device which is statically discovered at init time. It seems that the kDAPL connection model has a serious flaw, in that it pushes the complexity of route lookup into the consumer. Further, we have strong evidence that this routing code is hard to write and that consumers will just ignore this complexity and hard-code solutions that don't work under all configurations. With this in mind, I believe that the connection API needs to be something more like the following: rdma_resolve_address(): inputs: dest IP address, qos, npaths, done callback, opaque context done callback params: status, local RDMA device, RDMA transport address, context This function starts the process of resolving an IP address to an RDMA device and address. When the resolution is complete, the callback is called with a status. If the status is "success" then the callback also gets the device pointer and transport address (as well as the original context that the consumer passed in). The "RDMA transport address" type is a union containing transport-dependent data. In the IB case, it's all of the SGID, DGID, SLID, DLID, SL etc. that we know and love. In the iWARP case, it's the source IP, destination IP and QOS. npaths can be either 1 or 2 in the IB case; if it's 2, then the resolver will try to find a primary and alternate path for APM. In the iWARP case, I guess npaths will always be 1, and I guess anyone who wants to use iWARP over multihomed SCTP will probably have to use some lower-level API. By the way, we may also have to have the option of passing in a local netdev so that we can handle link-local IPv6 addresses. There may be other cases I haven't thought of yet. I just hope we can avoid going all the way to the horror of the getaddrinfo() API. I also hope we can agree to use IPoIB ARP to resolve the address in the IB case; having a flag or some other hack in the API to expose the option of ATS seems unacceptably ugly. rdma_connect(): inputs: local QP, RDMA transport address, destination service, private data, timeout, event callback, opaque context This function takes the resolved address and actually connects. I'm not sure how we want to abstract the IB service vs. iWARP TCP port number difference. I guess it's OK to have iWARP consumers stick their (16-bit) port number in a 64-bit parameter, even if it's not the prettiest API. To head off the knee-jerk objection: this API does NOT require any transport-specific code in consumers (unless a particular consumer WANTS to look inside the RDMA transport address). Code to connect would be as simple as: rdma_resolve_address(...); /* wait for resolution */ ib_create_qp(...) /* use device pointer we got from rdma_resolve_address() */ rdma_connect(...); /* pass transport address we got from rdma_resolve_address() */ /* wait for connection to finish... */ The listen side is even simpler: rdma_listen(): inputs: local service, event callback, consumer context Wait for connection requests and pass events to the consumer's callback. I'm not sure if/home we want to support binding to a particular IP address. The current IB CM in Linux doesn't support binding a listen to a single device or port, and even if it did it's not clear how to handle binding to one IP address when a port has more than one IP. I guess the event callback would receive a device pointer and the same RDMA transport address union I talked about above when discussing address resolution. It would be possible to have another function like rdma_getpeername() that takes the transport address and returns a source IP address. In the IB case this would do an ATS reverse lookup. However, I hate this idea. iSER already uses the CM private data to pass the source IP in the IB case, and I would much rather fix NFS/RDMA to do the same thing (so we can just kill ATS as an address resolution method). - R. _______________________________________________ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general